Project audit experiences

Back in January 2020, I wrote How to pick up a project with an audit. I received some comments about it over the last couple of years, but I don't think I really did anything with them. Then Sumana sent an email asking whether I'd blogged about my experiences auditing projects and estimating how long it takes and things like that.

That got me to re-reading the original blog post and it was clear it needed an update, so I did that. One thing I focused on was differentiating between "service" and "non-service" projects. The post feels better now.

But that's not this post! This post is about my experiences with auditing. What happened in that Summer of 2019 which formed the basis of that blog post? What were those 5 1 fabled projects? How did those audits go? Where are those projects now?

1

It ended up being 6 projects. I think I didn't originally count Mozilla Location Services for some reason.

Read more…

Everett v3.0.0 released!

What is it?

Everett is a configuration library for Python apps.

Goals of Everett:

  1. flexible configuration from multiple configured environments

  2. easy testing with configuration

  3. easy automated documentation of configuration for users

From that, Everett has the following features:

  • is flexible for your configuration environment needs and supports process environment, env files, dicts, INI files, YAML files, and writing your own configuration environments

  • facilitates helpful error messages for users trying to configure your software

  • has a Sphinx extension for documenting configuration including autocomponentconfig and automoduleconfig directives for automatically generating configuration documentation

  • facilitates testing of configuration values

  • supports parsing values of a variety of types like bool, int, lists of things, classes, and others and lets you write your own parsers

  • supports key namespaces

  • supports component architectures

  • works with whatever you’re writing–command line tools, web sites, system daemons, etc

v3.0.0 released!

This is a major release that sports three things:

  • Adjustments in Python support.

    Everett 3.0.0 drops support for Python 3.6 and picks up support for Python 3.10.

  • Reworked namespaces so they work better with Everett components.

    Previously, you couldn't apply a namespace after binding the configuration to a component. Now you can.

    This handles situations like this component:

    class MyComponent:
        class Config:
            http_host = Option(default="localhost")
            http_port = Option(default="8000", parser=int)
    
            db_host = Option(default="localhost")
            db_port = Option(default="5432", parser=int)
    
    config = ConfigManager.basic_config()
    
    # Bind the configuration to a specific component so you can only use
    # options defined in that component
    component_config = config.with_options(MyComponent)
    
    # Apply a namespace which acts as a prefix for options defined in
    # the component
    http_config = component_config.with_namespace("http")
    
    db_config = component_config.with_namespace("db")
    
  • Overhauled Sphinx extension.

    This is the new thing that I'm most excited about. This fixes a lot of my problems with documenting configuration.

    Everett now lets you:

    • document options and components:

      Example option:

      .. everett:option:: SOME_OPTION
         :parser: int
         :default: "5"
      
         Here's some option.
      

      Example component:

      .. everett:component:: SOME_COMPONENT
      
         .. rubric:: Options
      
         .. everett:option:: SOME_OPTION
            :parser: int
            :default: "5"
      
            Here's some option.
      
    • autodocument all the options defined in a Python class

      Example autocomponentconfig:

      .. autocomponentconfig:: myproject.module.MyComponent
         :show-table:
         :case: upper
      
    • autodocument all the options defined in a Python module

      Example automoduleconfig:

      .. automoduleconfig:: mydjangoproject.settings._config
         :hide-name:
         :show-table:
         :case: upper
      

    This works much better with configuration in Django settings modules. This works with component architectures. This works with centrally defining configuration with a configuration class.

    Further, all options and components are added to the index, have unique links, and are easier to link to in your documentation.

    I updated the Antenna (Mozilla crash ingestion collector) docs:

    https://antenna.readthedocs.io/en/latest/configuration.html

    I updated the Eliot (Mozilla Symbolication Service) docs:

    https://tecken.readthedocs.io/en/latest/configuration.html#symbolication-service-configuration-eliot

Why you should take a look at Everett

Everett makes it easy to:

  1. deal with different configurations between local development and server environments

  2. write tests for configuration values

  3. document configuration

  4. debug configuration issues

First-class docs. First-class configuration error help. First-class testing. This is why I created Everett.

If this sounds useful to you, take it for a spin. It's almost a drop-in replacement for python-decouple and os.environ.get('CONFIGVAR', 'default_value') style of configuration so it's easy to test out.

Where to go for more

For more specifics on this release, see here: https://everett.readthedocs.io/en/latest/history.html#january-13th-2022

Documentation and quickstart here: https://everett.readthedocs.io/

Source code and issue tracker here: https://github.com/willkg/everett

Kent v0.1.0 released! And the story of Kent in the first place....

What is it?

Before explaining what it is, I want to talk about Why.

A couple of years ago, we migrated from the Raven Sentry client (Python) to sentry-sdk. One of the things we did was implement our own sanitization code which removed personally identifyable information and secret information (as best as possible) from error reports.

I find the documentation for writing sanitization filters really confusing. before_send? before_breadcrumb? When do those hooks kick off? What does an event look like? There's a link to a page that describes an event, but there's a lot of verbiage and no schema so it's not wildly clear what the errors my application is sending look like. 1

Anyhow, so when we switched to sentry-sdk, we implemented some sanitization code because while Raven had some code, sentry-sdk did not. Then at some point between then and now, the sanitization code stopped working. It's my fault probably. I bet something changed in the sentry-sdk and I didn't notice.

Why didn't I notice? Am I a crappy engineer? Sure, but in this case the problem here is that the sanitization code runs in the context of handling an unhandled error. In handling the unhandled error, Sentry passes the event through our broken sanitization code and that throws an exception. Nothing gets sent to Sentry--neither the original error nor the sanitization error.

Once I realized there were errors, I looked in the logs and I can see the original errors--but not the sanitization errors.

Note

Fun fact: turns out John Whitlock thought about this when he wrote the sanitization code and added some code to emit a metric if the sanitization code errors out. If I had a graph in the dashboard showing this metric, I would have seen it.

"You should test your sanitization code!" you say! Right on! That's what we should be doing! We have unit tests but they run with ficticious data in a pocket dimension. So they passed wonderfully despite the issue!

What we needed was a few things:

  1. I needed to be able to run a fake Sentry service that I could throw errors at and debug the sanitization code in my local environment without having to spin up a real Sentry instance

  2. I needed to be able to see exactly what is in the error payloads for my application.

  3. I needed something I can use for integration tests with the sentry-sdk.

That's how I ended up putting aside all the things I needed to do and built Kent.

1

I don't intend to bash Sentry and the Sentry folks and all the work they do. Their docs may be great. I'm probably the dumb one here.

So what is Kent?

Kent is a fake Sentry service. You can run it, set the Sentry DSN of your application to something like http://public@localhost:8000/1, and then Kent will capture Sentry error reports.

Kent takes 2 seconds to set up. You can run it locally:

$ pip install kent
$ kent-server run

You can run it in a Docker container. There's a sample Dockerfile in the repo.

It doesn't require databases, credentials, caching, or any of that stuff.

Kent stores things in-memory. You don't have to clean up after it.

Kent has a website letting you view errors with your browser.

Kent has an API letting you build integration tests that create the errors and then fetch them and assert things against them.

What questionable architectural decisions did you make?

I built it with Flask. Flask is great for stuff like this--that part is fine.

The part that's less fine is that I decided to put in the least amount of effort in standing it up as a service and putting it behind a real WSGI server, so I'm (ab)using Flask's cli and monkeypatching werkzeug to not print out "helpful" (but in this case--unhelpful) messages to the console.

I used pico.css because I read about it like yesterday and it seemed easier to use that than to go fiddling with CSS frameworks to get a really lovely looking site for a fake Sentry service.

I may replace that at some point with something that involves less horizontal space.

I only wrote one test. I have testing set up, but only wrote one test to make sure it's minimally viable. I may write more at some point.

I only tested with Python sentry-sdk. I figure if other people need it, they can let me know what else it works with and we can fix any issues that come up.

I decided to store errors in memory rather than persist things to disk. That was easy to do and seems like the right move. Maybe we'll hit something that requires us to do something different.

I named it Kent. I like short names. Friends suggested I name it Caerbannog because it was a sentry of a sort. I love that name, but I can't reliably spell it.

0.1.0 released!

I thought about making this 1.0.0, but then decided to put it into the world and use it for a bit and fix any issues that come up and then release 1.0.0.

Initial release with minimally viable feature set.

  • capture errors and keep them in memory

  • API endpoint to list errors

  • API endpoint to fetch error

Where to go for more

History of releases: https://github.com/willkg/kent/blob/main/HISTORY.rst

Source code, issue tracker, documentation, and quickstart here: https://github.com/willkg/kent

Let me know how this helps you!

I say that in a lot of my posts. "Let me know how this helps you!" or "Comment by sending me an email!" or something like that. I occasionally get a response--usually from Sumana--but most often, it's me talking to the void. I do an awful lot of work that theoretically positively affects thousands of people to be constantly talking to the void.

Let me know if you have positive or negative feelings about Kent by:

  1. click on this link: https://github.com/willkg/kent/issues/3

  2. add a reaction to the description which should be like two clicks

Socorro Engineering: 2021 retrospective

Summary

2020h1 was rough and 2020h2 was not to be outdone. 2021h1 was worse in a lot of ways, but I got really lucky and a bunch of things happened that made 2021h2 much better. I'll talk a bit more about that towards the end.

But this post isn't about stymying the corrosion of multi-year burnout--it's a dizzying retrospective of Socorro engineering in 2021.

Read more…

Mozilla: 10 years

It's been a long while since I wrote Mozilla: 1 year review. I hit my 10-year "Moziversary" as an employee on September 6th. I was hired in a "doubling" period of Mozilla, so there are a fair number of people who are hitting 10 year anniversaries right now. It's interesting to see that even though we're all at the same company, we had different journeys here.

I started out as a Software Engineer or something like that. Then I was promoted to Senior Software Engineer and then Staff Software Engineer. Then last week, I was promoted to Senior Staff Software Engineer. My role at work over time has changed significantly. It was a weird path to get to where I am now, but that's probably a topic for another post.

I've worked on dozens of projects in a variety of capacities. Here's a handful of the ones that were interesting experiences in one way or another:

  • SUMO (support.mozilla.org): Mozilla's support site

  • Input: Mozilla's feedback site, user sentiment analysis, and Mozilla's initial experiments with Heartbeat and experiments platforms

  • MDN Web Docs: documentation, tutorials, and such for web standards

  • Mozilla Location Service: Mozilla's device location query system

  • Buildhub and Buildhub2: index for build information

  • Socorro: Mozilla's crash ingestion pipeline for collecting, processing, and analyzing crash reports for Mozilla products

  • Tecken: Mozilla's symbols server for uploading and downloading symbols and also symbolicating stacks

  • Standup: system for reporting and viewing status

  • FirefoxOS: Mozilla's mobile operating system

I also worked on a bunch of libraries and tools:

  • siggen: library for generating crash signatures using the same algorithm that Socorro uses (Python)

  • Everett: configuration library (Python)

  • Markus: metrics client library (Python)

  • Bleach: sanitizer for user-provided text for use in an HTML context (Python)

  • ElasticUtils: Elasticsearch query DSL library (Python)

  • mozilla-django-oidc: OIDC authentication for Django (Python)

  • Puente: convenience library for using gettext strings in Django (Python)

  • crashstats-tools: command line tools for accessing Socorro APIs (Python)

  • rob-bugson: Firefox addon that adds Bugzilla links to GitHub PR pages (JS)

  • paul-mclendahand: tool for combining GitHub PRs into a single branch (Python)

  • Dennis: gettext translated strings linter (Python)

I was a part of things:

I've given a few presentations 1:

1

I thought there were more, but I can't recall what they might have been.

I've left lots of FIXME notes everywhere.

I made some stickers:

/images/soloist_2017_handdrawn.thumbnail.png

"Soloists" sticker (2017)

/images/ted_sticker.thumbnail.png

"Ted maintained this" sticker (2019)

I've worked with a lot of people and created some really warm, wonderful friendships. Some have left Mozilla, but we keep in touch.

I've been to many work weeks, conferences, summits, and all hands trips.

I've gone through a few profile pictures:

/images/profile_2011.thumbnail.jpg

Me in 2011

/images/profile_2013.thumbnail.jpg

Me in 2013

/images/profile_2016.thumbnail.jpg

Me in 2016 (taken by Erik Rose in London)

/images/profile_2021.thumbnail.jpg

Me in 2021

I've built a few desks, though my pictures are pretty meagre:

/images/standing_desk_rough_sketch.thumbnail.jpg

Rough sketch of a standing desk

/images/standing_desk_1.thumbnail.jpg

Standing desk and a stool I built

/images/desk_2021.thumbnail.jpg

My current chaos of a desk

I've written lots of blog posts on status, project retrospectives, releases, initiatives, and such. Some of them are fun reads still.

It's been a long 10 years. I wonder if I'll be here for 10 more. It's possible!

Socorro Overview: 2021, presentation

Socorro became part of the Data Org part of Mozilla back in August 2020. I had intended to give this presentation in October 2020 after I had given one on Tecken 1, but then the team I was on got re-orged and I never got around to redoing the presentation for a different group.

Fast-forward to March. I got around to updating the presentation and then presented it to Data Club on March 26th, 2021.

I was asked if I want it posted on YouTube and while that'd be cool, I don't think video is very accessible on its own 2. Instead, I decided I wanted to convert it to a blog post. It took a while to do that for various reasons that I'll cover in another blog post.

This blog post goes through the slides and narrative of that presentation.

1

I should write that as a blog post, too.

2

This is one of the big reasons I worked on pyvideo for so long.

Read more…

Data Org Working Groups: retrospective (2020)

Project

time

1 month

impact

established cross organization groups as a tool for grouping people

Problem statement

Data Org architects, builds, and maintains a data ingestion system and the ecosystem of pieces around it. It covers a swath of engineering and data science disciplines and problem domains. Many of us are generalists and have expertise and interests in multiple areas. Many projects cut across disciplines, problem domains, and organizational structures. Some projects, disciplines, and problem domains benefit from participation of other stakeholders who aren't in Data Org.

In order to succeed in tackling the projects of tomorrow, we need to formalize creating, maintaining, and disbanding groups composed of interested stakeholders focusing on specific missions. Further, we need a set of best practices to help make these groups successful.

Read more…

Markus v3.0.0 released! Better metrics API for Python projects.

What is it?

Markus is a Python library for generating metrics.

Markus makes it easier to generate metrics in your program by:

  • providing multiple backends (Datadog statsd, statsd, logging, logging roll-up, and so on) for sending metrics data to different places

  • sending metrics to multiple backends at the same time

  • providing testing helpers for easy verification of metrics generation

  • providing a decoupled architecture making it easier to write code to generate metrics without having to worry about making sure creating and configuring a metrics client has been done--similar to the Python logging module in this way

We use it at Mozilla on many projects.

v3.0.0 released!

I released v3.0.0 just now. Changes:

Features

  • Added support for Python 3.9 (#79). Thank you, Brady!

  • Changed assert_* helper methods on markus.testing.MetricsMock to print the records to stdout if the assertion fails. This can save some time debugging failing tests. (#74)

Backwards incompatible changes

  • Dropped support for Python 3.5 (#78). Thank you, Brady!

  • markus.testing.MetricsMock.get_records and markus.testing.MetricsMock.filter_records return markus.main.MetricsRecord instances now.

    This might require you to rewrite/update tests that use the MetricsMock.

Where to go for more

Changes for this release: https://markus.readthedocs.io/en/latest/history.html#february-5th-2021

Documentation and quickstart here: https://markus.readthedocs.io/en/latest/index.html

Source code and issue tracker here: https://github.com/willkg/markus/

Let me know how this helps you!

Socorro: This Period in Crash Stats: Volume 2021.1

New features and changes in Crash Stats

Crash Stats crash report view pages show Breadcrumbs information

In 2020q3, Roger and I worked out a new Breadcrumbs crash annotation for crash reports generated by the android-components crash reporter. It's a JSON-encoded field with a structure just like the one that the sentry-sdk sends. Incoming crash reports that have this annotation will show the data on the crash report view page to people who have protected data access.

/images/tpics_2021_1_breadcrumbs.thumbnail.png

Figure 1: Screenshot of Breadcrumbs data in Details tab of crash report view on Crash Stats.

I implemented it based on what we get in the structure and what's in the Sentry interface.

Breadcrumbs information is not searchable with Supersearch. Currently, it's only shown in the crash report view in the Details tab.

Does this help your work? Are there ways to improve this? If so, let me know!

This work was done in [bug 1671276].

Crash Stats crash report view pages show Java exceptions

For the longest of long times, crash reports from a Java process included a JavaStackTrace annotation which was a big unstructured string of problematic parseability and I couldn't do much with it.

In 2020q4, Roger and I worked out a new JavaException crash annotation which was a JSON-encoded structured string containing the exception information. Not only does it have the final exception, but it also has cascading exceptions if there are any! With a structured form of the exception, we can suddenly do a lot of interesting things.

As a first step, I added display of the Java exception information to the crash report view page in the Display tab. It's in the same place that you would see the crashing thread stack if this were a C++/Rust crash.

Just like JavaStackTrace, the JavaException annotation has some data in it that can have PII in it. Because of that, the Socorro processor generates two versions of the data: one that's sanitized (no java exception message values) and one that's raw. If you have protected data access, you can see the raw one.

The interface is pretty wide and exceeds the screenshot. Sorry about that.

/images/tpics_2021_1_exception.thumbnail.png

Figure 2: Screenshot of Java exception data in Details tab of crash report view in Crash Stats.

My next step is to use the structured exception information to improve Java crash report signatures. I'm doing that work in [bug 1541120] and hoping to land that in 2021q1. More on that later.

Does this help your work? Are there ways to improve this? If so, let me know!

This work was done in [bug 1675560].

Changes to crash report view

One of the things I don't like about the crash report view is that it's impossible to intuit where the data you're looking at is from. Further, some of the tabs were unclear about what bits of data were protected data and what bits weren't. I've been improving that over time.

The most recent step involved the following changes:

  1. The "Metadata" tab was renamed to "Crash Annotations". This tab holds the crash annotation data from the raw crash before processing as well as a few fields that the collector adds when accepting a crash report from the client. Most of the fields are defined in the CrashAnnotations.yaml file in mozilla-central. The ones that aren't, yet, should get added. I have that on my list of things to get to.

  2. The "Crash Annotations" tab is now split into public and protected data sections. I hope this makes it a little clearer which is which.

  3. I removed some unneeded fields that the collector adds at ingestion.

Does this help your work? Are there ways to improve this? If so, let me know!

What's in the queue

In addition to the day-to-day stuff, I'm working on the following big projects in 2021q1.

Remove the Email field

Last year, I looked into who's using the Email field and for what, whether the data was any good, and in what circumstances do we even get Email data. That work was done in [bug 1626277].

The consensus is that since not all of the crash reporter clients let a user enter in an email address, it doesn't seem like we use the data, and it's pretty toxic data to have, we should remove it.

The first step of that is to delete the field from the crash report at ingestion. I'll be doing that work in [bug 1688905].

The second step is to remove it from the webapp. I'll be doing that work in [bug 1688907].

Once that's done, I'll write up some bugs to remove it from the crash reporter clients and wherever else it is in products.

Does this affect you? If so, let me know!

Redo signature generation for Java crashes

Currently, signature generation for Java crashes is pretty basic and it's not flexible in the ways we need it. Now we can fix that.

I need some Java crash expertise to bounce ideas off of and to help me verify "goodness" of signatures. If you're interested in helping in any capacity or if you have opinions on how it should work or what you need out of it, please let me know.

I'm hoping to do this work in 2021q1.

The tracker bug is [bug 1541120].

Closing

Thank you to Roger Yang who implemented Breadcrumbs and JavaException reporting and Gabriele Svelto who advised on new annotations and how things should work! Thank you to everyone who submits signature generation changes--I really appreciate your efforts!

Socorro Engineering: Half in Review 2020 h2 and 2020 retrospective

Summary

2020h1 was rough. 2020h2 was also rough: more layoffs, 2 re-orgs, Covid-19.

I (and Socorro and Tecken) got re-orged into the Data Org. Data Org manages the Telemetry ingestion pipeline as well as all the things related to it. There's a lot of overlap between Socorro and Telemetry and being in the Data Org might help reduce that overlap and ease maintenance.

But this post isn't about the future--it's about the past! Let's talk about what happened in 2020h2 and then a brief retrospective of 2020.

Prepare to dive in!

Read more…