Socorro/Tecken Overview: 2022, presentation

Monday May 16, 2022, Will Kahn-Greene | share this (mastodon)

Socorro and Tecken make up the services part of our crash reporting system at Mozilla. We ran a small Data Sprint day to onboard a new ops person and a new engineer. I took my existing Socorro presentation and Tecken presentation [1], combined them, reduced them, and then fixed a bunch of issues. This is that presentation.

Project audit experiences

Sunday January 16, 2022, Will Kahn-Greene | share this (mastodon)

Back in January 2020, I wrote How to pick up a project with an audit. I received some comments about it over the last couple of years, but I don't think I really did anything with them. Then Sumana sent an email asking whether I'd blogged about my experiences auditing projects and estimating how long it takes and things like that.

That got me to re-reading the original blog post and it was clear it needed an update, so I did that. One thing I focused on was differentiating between "service" and "non-service" projects. The post feels better now.

But that's not this post! This post is about my experiences with auditing. What happened in that Summer of 2019 which formed the basis of that blog post? What were those 5 [1] fabled projects? How did those audits go? Where are those projects now?

Everett v3.0.0 released!

Thursday January 13, 2022, Will Kahn-Greene | share this (mastodon)

What is it?

Everett is a configuration library for Python apps.

Goals of Everett:

flexible configuration from multiple configured environments
easy testing with configuration
easy automated documentation of configuration for users

From that, Everett has the following features:

is flexible for your configuration environment needs and supports process environment, env files, dicts, INI files, YAML files, and writing your own configuration environments
facilitates helpful error messages for users trying to configure your software
has a Sphinx extension for documenting configuration including autocomponentconfig and automoduleconfig directives for automatically generating configuration documentation
facilitates testing of configuration values
supports parsing values of a variety of types like bool, int, lists of things, classes, and others and lets you write your own parsers
supports key namespaces
supports component architectures
works with whatever you’re writing–command line tools, web sites, system daemons, etc

v3.0.0 released!

This is a major release that sports three things:

Adjustments in Python support.

Everett 3.0.0 drops support for Python 3.6 and picks up support for Python 3.10.

Reworked namespaces so they work better with Everett components.

Previously, you couldn't apply a namespace after binding the configuration to a component. Now you can.

This handles situations like this component:

class MyComponent:
    class Config:
        http_host = Option(default="localhost")
        http_port = Option(default="8000", parser=int)

        db_host = Option(default="localhost")
        db_port = Option(default="5432", parser=int)

config = ConfigManager.basic_config()

# Bind the configuration to a specific component so you can only use
# options defined in that component
component_config = config.with_options(MyComponent)

# Apply a namespace which acts as a prefix for options defined in
# the component
http_config = component_config.with_namespace("http")

db_config = component_config.with_namespace("db")

Overhauled Sphinx extension.

This is the new thing that I'm most excited about. This fixes a lot of my problems with documenting configuration.

Everett now lets you:
- document options and components:
  
  Example option:
```
.. everett:option:: SOME_OPTION
   :parser: int
   :default: "5"

   Here's some option.
```
  Example component:
```
.. everett:component:: SOME_COMPONENT

   .. rubric:: Options

   .. everett:option:: SOME_OPTION
      :parser: int
      :default: "5"

      Here's some option.
```
- autodocument all the options defined in a Python class
  
  Example autocomponentconfig:
```
.. autocomponentconfig:: myproject.module.MyComponent
   :show-table:
   :case: upper
```
- autodocument all the options defined in a Python module
  
  Example automoduleconfig:
```
.. automoduleconfig:: mydjangoproject.settings._config
   :hide-name:
   :show-table:
   :case: upper
```
This works much better with configuration in Django settings modules. This works with component architectures. This works with centrally defining configuration with a configuration class.

Further, all options and components are added to the index, have unique links, and are easier to link to in your documentation.

I updated the Antenna (Mozilla crash ingestion collector) docs:

https://antenna.readthedocs.io/en/latest/configuration.html

I updated the Eliot (Mozilla Symbolication Service) docs:

https://tecken.readthedocs.io/en/latest/configuration.html#symbolication-service-configuration-eliot

Why you should take a look at Everett

Everett makes it easy to:

deal with different configurations between local development and server environments
write tests for configuration values
document configuration
debug configuration issues

First-class docs. First-class configuration error help. First-class testing. This is why I created Everett.

If this sounds useful to you, take it for a spin. It's almost a drop-in replacement for python-decouple and os.environ.get('CONFIGVAR', 'default_value') style of configuration so it's easy to test out.

Where to go for more

For more specifics on this release, see here: https://everett.readthedocs.io/en/latest/history.html#january-13th-2022

Documentation and quickstart here: https://everett.readthedocs.io/

Source code and issue tracker here: https://github.com/willkg/everett

Kent v0.1.0 released! And the story of Kent in the first place....

Tuesday January 4, 2022, Will Kahn-Greene | share this (mastodon)

What is it?

Before explaining what it is, I want to talk about Why.

A couple of years ago, we migrated from the Raven Sentry client (Python) to sentry-sdk. One of the things we did was implement our own sanitization code which removed personally identifyable information and secret information (as best as possible) from error reports.

I find the documentation for writing sanitization filters really confusing. before_send? before_breadcrumb? When do those hooks kick off? What does an event look like? There's a link to a page that describes an event, but there's a lot of verbiage and no schema so it's not wildly clear what the errors my application is sending look like. [1]

Anyhow, so when we switched to sentry-sdk, we implemented some sanitization code because while Raven had some code, sentry-sdk did not. Then at some point between then and now, the sanitization code stopped working. It's my fault probably. I bet something changed in the sentry-sdk and I didn't notice.

Why didn't I notice? Am I a crappy engineer? Sure, but in this case the problem here is that the sanitization code runs in the context of handling an unhandled error. In handling the unhandled error, Sentry passes the event through our broken sanitization code and that throws an exception. Nothing gets sent to Sentry--neither the original error nor the sanitization error.

Once I realized there were errors, I looked in the logs and I can see the original errors--but not the sanitization errors.

"You should test your sanitization code!" you say! Right on! That's what we should be doing! We have unit tests but they run with ficticious data in a pocket dimension. So they passed wonderfully despite the issue!

What we needed was a few things:

I needed to be able to run a fake Sentry service that I could throw errors at and debug the sanitization code in my local environment without having to spin up a real Sentry instance
I needed to be able to see exactly what is in the error payloads for my application.
I needed something I can use for integration tests with the sentry-sdk.

That's how I ended up putting aside all the things I needed to do and built Kent.

So what is Kent?

Kent is a fake Sentry service. You can run it, set the Sentry DSN of your application to something like http://public@localhost:8000/1, and then Kent will capture Sentry error reports.

Kent takes 2 seconds to set up. You can run it locally:

$ pip install kent
$ kent-server run

You can run it in a Docker container. There's a sample Dockerfile in the repo.

It doesn't require databases, credentials, caching, or any of that stuff.

Kent stores things in-memory. You don't have to clean up after it.

Kent has a website letting you view errors with your browser.

Kent has an API letting you build integration tests that create the errors and then fetch them and assert things against them.

What questionable architectural decisions did you make?

I built it with Flask. Flask is great for stuff like this--that part is fine.

The part that's less fine is that I decided to put in the least amount of effort in standing it up as a service and putting it behind a real WSGI server, so I'm (ab)using Flask's cli and monkeypatching werkzeug to not print out "helpful" (but in this case--unhelpful) messages to the console.

I used pico.css because I read about it like yesterday and it seemed easier to use that than to go fiddling with CSS frameworks to get a really lovely looking site for a fake Sentry service.

I may replace that at some point with something that involves less horizontal space.

I only wrote one test. I have testing set up, but only wrote one test to make sure it's minimally viable. I may write more at some point.

I only tested with Python sentry-sdk. I figure if other people need it, they can let me know what else it works with and we can fix any issues that come up.

I decided to store errors in memory rather than persist things to disk. That was easy to do and seems like the right move. Maybe we'll hit something that requires us to do something different.

I named it Kent. I like short names. Friends suggested I name it Caerbannog because it was a sentry of a sort. I love that name, but I can't reliably spell it.

0.1.0 released!

I thought about making this 1.0.0, but then decided to put it into the world and use it for a bit and fix any issues that come up and then release 1.0.0.

Initial release with minimally viable feature set.

capture errors and keep them in memory
API endpoint to list errors
API endpoint to fetch error

Where to go for more

History of releases: https://github.com/willkg/kent/blob/main/HISTORY.rst

Source code, issue tracker, documentation, and quickstart here: https://github.com/willkg/kent

Let me know how this helps you!

I say that in a lot of my posts. "Let me know how this helps you!" or "Comment by sending me an email!" or something like that. I occasionally get a response--usually from Sumana--but most often, it's me talking to the void. I do an awful lot of work that theoretically positively affects thousands of people to be constantly talking to the void.

Let me know if you have positive or negative feelings about Kent by:

click on this link: https://github.com/willkg/kent/issues/3
add a reaction to the description which should be like two clicks

Socorro Engineering: 2021 retrospective

Wednesday December 22, 2021, Will Kahn-Greene | share this (mastodon)

Summary

2020h1 was rough and 2020h2 was not to be outdone. 2021h1 was worse in a lot of ways, but I got really lucky and a bunch of things happened that made 2021h2 much better. I'll talk a bit more about that towards the end.

But this post isn't about stymying the corrosion of multi-year burnout--it's a dizzying retrospective of Socorro engineering in 2021.

Eliot: retrospective (2021)

Monday November 15, 2021, Will Kahn-Greene | share this (mastodon)

Project

time:

1 year

impact:

reduced risk of Mozilla Symbols Server outage which affects symbols uploads from the build system
improves maintainability of symbolication service by offloading parsing of Breakpad symbols files and symbol lookups to external library that's used by other tools in crash reporting ecosystem at Mozilla
opens up possible futures around supporting inline functions and using other debug file types

Problem statement

Tecken is the project for the Mozilla Symbols Service. This service manages several things:

symbol upload API for uploading and storing debugging symbols generated by build systems for products like Firefox, Fenix, etc
download API for downloading symbols which could be located in a variety of different places supporting tools like Visual Studio, stackwalkers, profilers, symbolicators, etc
symbolication API for finding symbols for memory addresses

It also has a webapp for querying symbols, debugging symbols problems, managing API tokens, and granting permissions for uploading symbols.

All of those functions are currently handled by a single webapp service.

There are a few problems here.

First, we want to reduce risk of an outage for uploading symbols. When we have service outages, the build systems can't upload symbols. It tries really hard to upload symbols, so this increases the build times for Firefox and other products. On top of that, if the build system doesn't successfully upload symbols, any crashes in tests or channels result in unsymbolicated stacks which obscures the details of the crash.

There are several projects that are waiting in the eve to dramatically increase their use of the symbolication API which increases the likelihood of an outage with the service that affects symbol uploads.

Second, the existing symbolication API implementation is an independent implementation or sym file parsing, lookups, and symbolication. Whenever we make adjustments to how sym files are built, structured, or the lookup algorithms, we have to additionally update the symbolication API code.

Mozilla is in the process of rewriting crash reporting related code in Rust. It behooves us greatly to switch from our independent ipmlementation to a shared library.

Third, the symbolication API is missing some critical features like support for line numbers and inline functions. The existing code can't be extended to support either line numbers or inline functions--we need to rewrite it.

In September of 2020, I embarked on a project to break out the symbolication API as a separate microservice and implement it using the Symbolic library. That had the following effects:

eases the risk of outage due to increasing usage of the symbolication API,
adds support for line numbers and sets us up for adding support for inline functions, and
reduces the maintenance work because we'll be using a library used by other parts of the crash reporting ecosystem

This post covers that project.