I went to NormConf 2022, but didn't attend the whole thing. It was entirely
online as a YouTube livestream for something like 14 hours split into three
sessions. It had a very active Slack instance.
I like doing post-conference write-ups because then I have some record of what
I was thinking at the time. Sometimes that's useful for other people. Often
it's helpful for me.
I'm data engineer adjacent. I work on a data pipeline for crash reporting, but
it's a streaming pipeline, entirely bespoke, and doesn't use any/many of the
tools in the data engineer toolkit. There's no ML. There's no NLP. I don't have
a data large-body-of-water. I'm not using SQL much. I'm not having Python
packaging problems. Because of that, I kind of skipped over the data engineer
related talks.
The conference was well done. Everyone did a great job. The Slack channels I
lurked in were hopping. The way they did questions worked really well.
I work at Mozilla. We get a laptop refresh periodically. I got a new laptop
that I was going to replace my older laptop with. I'm a software engineer and I
work on services that are built using Docker and tooling that runs on Linux.
This post covers my attempt at setting up a Windows laptop for software
development for the projects I work on after having spent the last 20 years
predominantly using Linux and Linux-like environments.
Spoiler: This is a failed attempt and I gave up and stuck with Linux.
Back in June, I saw a note about Volunteer Responsibility Amnesty Day in Sumana's Changeset Consulting
newsletter.
The idea of it really struck a chord with me. I wondered whether running an
event like this at work would help. With that, I coordinated an event, ran it,
and this is the blog post summarizing how it went.
The context
As people leave Mozilla, the libraries, processes, services, and other
responsibilities (hidden and visible) all suddenly become unowned. In some
cases, these things get passed to teams and individuals and there's a clear
handoff. In a lot of cases, stuff just gets dropped on the floor.
Some of these things should remain on the floor--we shouldn't maintain all the
things forever. Sometimes things get maintained because of inertia rather than
actual need. Letting these drop and decay over time is fine.
Some of these things turn out to be critical cogs in the machinations of
complex systems. Letting these drop and decay over time can sometimes lead to a
huge emergency involving a lot of unscheduled scrambling to fix. That's bad. No
one likes that.
In the last year, I had picked up a bunch of stuff from people who had left and
it was increasingly hard to juggle it all. Thus taking a day to audit all the
things on my plate and figuring out which ones I don't want to do anymore
seemed really helpful.
Further, even without people leaving, new projects show up, pipelines are
added, new services are stood up--there's more stuff running and more stuff to
do to keep it all running.
Thus I wondered, what if other people in Data Org at Mozilla had similar
issues? What if there were tasks and responsibilities that we had accumulated
over the years that, if we stepped back and looked at them, didn't really need
to be done anymore? What if there were people who had too many things on their
plate and people who had a lot of space? Maybe an audit would surface this and
let us collectively shuffle some things around.
Setting it up
In that context, I decided to coordinate a Volunteer Responsibility Amnesty Day
for Data Org.
I decided to structure it a little differently because I wanted to run
something that people could participate in regardless of what time zone they
were in. I wanted it to produce an output that individuals could talk with
their managers about--something they could use to take stock of where things
were at, surface work individuals were doing that managers may not know about,
and provide a punch list of actions to fix any problems that came up.
I threw together a Google doc that summarized the goals, provided a template
for the audit, and included a next steps which were pretty much tell us on
Slack and bring it up with your manager in your next 1:1. Here's the doc:
I talked to my manager about it. I mentioned it in meetings and in various
channels on Slack.
On the actual day, I posted a few reminders in Slack.
How'd it go?
I figured it was worth doing once. Maybe it would be helpful? Maybe not? Maybe
it helps us reduce the amount of stuff we're doing solely for inertia
purposes?
I didn't get a lot of signal about how it went, though.
I know chutten participated and the audit was helpful for him. He has a ton of
stuff on his plate.
I know Jan-Erik participated. I don't know if it was helpful for him.
I heard that Alessio decided to do this with his team every 6 months or so.
While I did organize the event, I actually didn't participate. I forget what
happened, but something came up and I was bogged down with that.
That's about all I know. I think there are specific people who have a lot of
stuff on their plate and this was helpful, but generally either people didn't
participate (Maybe they were bogged down like me? Maybe they don't have much
they're juggling?) or I never found out they participated.
Epilog
I think it was useful to do. It was a very low-effort experiment to see if
something like this would be helpful. If it was the case that people had a lot
on their plates, seems like this would have surfaced a bunch of things allowing
us to improve peoples' work lives. I think for specific people who have a lot
on their plate, it was a helpful exercise.
I didn't get enough signal to make me want to spend the time to run it again in
December.
Given that:
Ift think it's good to run individually. If you're feeling overwhelmed with
stuff, an audit is a great place to start figuring out how to fix that.
It might be good to run in a small team as an excercise in taking stock of
what's going on and rebalance things.
It's probably not helpful to run in an org where maybe it ends up being more
bookkeeping work than it's worth.
There's an additional backwards-incompatible change here in which we drop
the --color and --no-color arguments from dennis-cmd lint.
658f951 Document dubstep (#74)
adb4ae1 Rework CI so it uses a matrix
transfer project from willkg to mozilla for ongoing maintenance and support
Retrospective
I worked on Dennis for 9 years.
It was incredibly helpful! It eliminated an entire class of bugs we were
plagued with for critical Mozilla sites like AMO, MDN, SUMO, Input [1], and others. It did it in a way that
supported and was respectful of our localization community.
It was pretty fun! The translation transforms are incredibly helpful for fixing
layout issues. Some of them also produce hilarious results:
I enjoyed writing silly things at the bottom of all the release blog posts.
I learned a lot about gettext, localization, and languages! Learning about the
nuances of plurals was fascinating.
The code isn't great. I wish I had redone the tokenization pipeline. I wish I
had gotten around to adding support for other gettext variable formats.
Regardless, this project had a significant impact on Mozilla sites which I
covered briefly in my Dennis Retrospective (2013).
Handing it off
It's been 6 years since I worked on sites that have localization, so I haven't
really used Dennis in a long time and I'm no longer a stakeholder for it.
I need to reduce my maintenance load, so I looked into whether to end this
project altogether. Several Mozilla projects still use it for linting PO files
for deploys, so I decided not to end the project, but instead hand it off.
Socorro and Tecken make up the services part of our crash reporting system
at Mozilla. We ran a small Data Sprint day to onboard a new ops person and a
new engineer. I took my existing Socorro presentation and Tecken presentation [1], combined them, reduced
them, and then fixed a bunch of issues. This is that presentation.
Back in January 2020, I wrote How to pick up a project with an audit. I received some
comments about it over the last couple of years, but I don't think I really did
anything with them. Then Sumana sent an email asking whether I'd blogged about
my experiences auditing projects and estimating how long it takes and things
like that.
That got me to re-reading the original blog post and it was clear it needed an
update, so I did that. One thing I focused on was differentiating between
"service" and "non-service" projects. The post feels better now.
But that's not this post! This post is about my experiences with auditing. What
happened in that Summer of 2019 which formed the basis of that blog post? What
were those 5 [1] fabled projects? How did those audits go? Where are those
projects now?
Everett is a configuration library for Python
apps.
Goals of Everett:
flexible configuration from multiple configured environments
easy testing with configuration
easy automated documentation of configuration for users
From that, Everett has the following features:
is flexible for your configuration environment needs and supports process
environment, env files, dicts, INI files, YAML files, and writing your own
configuration environments
facilitates helpful error messages for users trying to configure your
software
has a Sphinx extension for documenting configuration including
autocomponentconfig and automoduleconfig directives for automatically
generating configuration documentation
facilitates testing of configuration values
supports parsing values of a variety of types like bool, int, lists of
things, classes, and others and lets you write your own parsers
supports key namespaces
supports component architectures
works with whatever you’re writing–command line tools, web sites, system daemons, etc
v3.0.0 released!
This is a major release that sports three things:
Adjustments in Python support.
Everett 3.0.0 drops support for Python 3.6 and picks up support for Python
3.10.
Reworked namespaces so they work better with Everett components.
Previously, you couldn't apply a namespace after binding the configuration to
a component. Now you can.
This handles situations like this component:
classMyComponent:classConfig:http_host=Option(default="localhost")http_port=Option(default="8000",parser=int)db_host=Option(default="localhost")db_port=Option(default="5432",parser=int)config=ConfigManager.basic_config()# Bind the configuration to a specific component so you can only use# options defined in that componentcomponent_config=config.with_options(MyComponent)# Apply a namespace which acts as a prefix for options defined in# the componenthttp_config=component_config.with_namespace("http")db_config=component_config.with_namespace("db")
Overhauled Sphinx extension.
This is the new thing that I'm most excited about. This fixes a lot of my
problems with documenting configuration.
Everett now lets you:
document options and components:
Example option:
..everett:option:: SOME_OPTION
:parser: int
:default: "5"
Here's some option.
Example component:
..everett:component:: SOME_COMPONENT
..rubric:: Options
..everett:option:: SOME_OPTION
:parser: int
:default: "5"
Here's some option.
autodocument all the options defined in a Python class
This works much better with configuration in Django settings modules. This
works with component architectures. This works with centrally defining
configuration with a configuration class.
Further, all options and components are added to the index, have unique
links, and are easier to link to in your documentation.
I updated the Antenna (Mozilla crash ingestion collector) docs:
deal with different configurations between local development and
server environments
write tests for configuration values
document configuration
debug configuration issues
First-class docs. First-class configuration error help. First-class testing.
This is why I created Everett.
If this sounds useful to you, take it for a spin. It's almost a drop-in
replacement for python-decouple and os.environ.get('CONFIGVAR', 'default_value')
style of configuration so it's easy to test out.
Before explaining what it is, I want to talk about Why.
A couple of years ago, we migrated from the Raven Sentry client (Python) to
sentry-sdk. One of the things we did was implement our own sanitization code
which removed personally identifyable information and secret information (as
best as possible) from error reports.
I find the documentation for writing sanitization filters really confusing.
before_send? before_breadcrumb? When do those hooks kick off? What does
an event look like? There's a link to a page that describes an event, but there's a lot of
verbiage and no schema so it's not wildly clear what the errors my application
is sending look like. [1]
Anyhow, so when we switched to sentry-sdk, we implemented some sanitization
code because while Raven had some code, sentry-sdk did not. Then at some point
between then and now, the sanitization code stopped working. It's my fault
probably. I bet something changed in the sentry-sdk and I didn't notice.
Why didn't I notice? Am I a crappy engineer? Sure, but in this case the problem
here is that the sanitization code runs in the context of handling an unhandled
error. In handling the unhandled error, Sentry passes the event through our
broken sanitization code and that throws an exception. Nothing gets sent to
Sentry--neither the original error nor the sanitization error.
Once I realized there were errors, I looked in the logs and I can see the
original errors--but not the sanitization errors.
"You should test your sanitization code!" you say! Right on! That's what we
should be doing! We have unit tests but they run with ficticious data in a
pocket dimension. So they passed wonderfully despite the issue!
What we needed was a few things:
I needed to be able to run a fake Sentry service that I could throw errors
at and debug the sanitization code in my local environment without having to
spin up a real Sentry instance
I needed to be able to see exactly what is in the error payloads for my
application.
I needed something I can use for integration tests with the sentry-sdk.
That's how I ended up putting aside all the things I needed to do and built
Kent.
So what is Kent?
Kent is a fake Sentry service. You can run it, set the Sentry DSN of your
application to something like http://public@localhost:8000/1, and then Kent
will capture Sentry error reports.
Kent takes 2 seconds to set up. You can run it locally:
$ pip install kent
$ kent-server run
You can run it in a Docker container. There's a sample Dockerfile in the
repo.
It doesn't require databases, credentials, caching, or any of that stuff.
Kent stores things in-memory. You don't have to clean up after it.
Kent has a website letting you view errors with your browser.
Kent has an API letting you build integration tests that create the errors and
then fetch them and assert things against them.
What questionable architectural decisions did you make?
I built it with Flask. Flask is great
for stuff like this--that part is fine.
The part that's less fine is that I decided to put in the least amount of
effort in standing it up as a service and putting it behind a real WSGI server,
so I'm (ab)using Flask's cli and monkeypatching werkzeug to not print out
"helpful" (but in this case--unhelpful) messages to the console.
I used pico.css because I read about it like
yesterday and it seemed easier to use that than to go fiddling with CSS
frameworks to get a really lovely looking site for a fake Sentry service.
I may replace that at some point with something that involves less horizontal
space.
I only wrote one test. I have testing set up, but only wrote one test to make
sure it's minimally viable. I may write more at some point.
I only tested with Python sentry-sdk. I figure if other people need it, they
can let me know what else it works with and we can fix any issues that come up.
I decided to store errors in memory rather than persist things to disk. That
was easy to do and seems like the right move. Maybe we'll hit something that
requires us to do something different.
I named it Kent. I like short names. Friends suggested I name it Caerbannog
because it was a sentry of a sort. I love that name, but I can't reliably spell
it.
0.1.0 released!
I thought about making this 1.0.0, but then decided to put it into the world
and use it for a bit and fix any issues that come up and then release 1.0.0.
Initial release with minimally viable feature set.
I say that in a lot of my posts. "Let me know how this helps you!" or "Comment
by sending me an email!" or something like that. I occasionally get a
response--usually from Sumana--but most often, it's me talking to the void. I
do an awful lot of work that theoretically positively affects thousands of
people to be constantly talking to the void.
Let me know if you have positive or negative feelings about Kent by:
2020h1 was rough and 2020h2 was not to be outdone. 2021h1
was worse in a lot of ways, but I got really lucky and a bunch of things
happened that made 2021h2 much better. I'll talk a bit more about that towards
the end.
But this post isn't about stymying the corrosion of multi-year burnout--it's a
dizzying retrospective of Socorro engineering in 2021.
reduced risk of Mozilla Symbols Server outage which affects symbols
uploads from the build system
improves maintainability of symbolication service by offloading parsing
of Breakpad symbols files and symbol lookups to external library that's
used by other tools in crash reporting ecosystem at Mozilla
opens up possible futures around supporting inline functions and using
other debug file types
symbol upload API for uploading and storing debugging symbols generated by
build systems for products like Firefox, Fenix, etc
download API for downloading symbols which could be located in a variety of
different places supporting tools like Visual Studio, stackwalkers,
profilers, symbolicators, etc
symbolication API for finding symbols for memory addresses
It also has a webapp for querying symbols, debugging symbols problems, managing
API tokens, and granting permissions for uploading symbols.
All of those functions are currently handled by a single webapp service.
There are a few problems here.
First, we want to reduce risk of an outage for uploading symbols. When we have
service outages, the build systems can't upload symbols. It tries really hard
to upload symbols, so this increases the build times for Firefox and other
products. On top of that, if the build system doesn't successfully upload
symbols, any crashes in tests or channels result in unsymbolicated stacks which
obscures the details of the crash.
There are several projects that are waiting in the eve to dramatically increase
their use of the symbolication API which increases the likelihood of an outage
with the service that affects symbol uploads.
Second, the existing symbolication API implementation is an independent
implementation or sym file parsing, lookups, and symbolication. Whenever we
make adjustments to how sym files are built, structured, or the lookup
algorithms, we have to additionally update the symbolication API code.
Mozilla is in the process of rewriting crash reporting related code in Rust. It
behooves us greatly to switch from our independent ipmlementation to a shared
library.
Third, the symbolication API is missing some critical features like support
for line numbers and inline functions. The existing code can't be extended
to support either line numbers or inline functions--we need to rewrite it.
In September of 2020, I embarked on a project to break out the symbolication
API as a separate microservice and implement it using the Symbolic library. That had the following effects:
eases the risk of outage due to increasing usage of the symbolication API,
adds support for line numbers and sets us up for adding support for inline
functions, and
reduces the maintenance work because we'll be using a library used by other
parts of the crash reporting ecosystem