Open Source Project Maintenance 2025

Published: Tuesday October 28, 2025, Will Kahn-Greene | share this (mastodon)

Every October, I do a maintenance pass on all my projects. At a minimum, that involves dropping support for whatever Python version is no longer supported and adding support for the most recently released Python version. While doing that, I go through the issue tracker, answer questions, and fix whatever I can fix. Then I release new versions. Then I think about which projects I should deprecate and figure out a deprecation plan for them.

This post covers the 2025 round.

TL;DR

sphinx-js -- transferred to pyodide organization
crashstats-tools and siggen -- transferred to the Mozilla crash ingestion team, which I'm no longer on
paul-mclendahand -- deprecated and archived
pip-stale -- deprecated and archived
everett -- released v3.5.0, then deprecated and archived
fillmore -- released v2.2.0, then deprecated and archived
kent -- released v2.2.0
markus -- released v5.2.0
bleach -- released v6.3.0

Switching from pyenv to uv

Published: Thursday September 12, 2024, Will Kahn-Greene | share this (mastodon)

Premise

The 0.4.0 release of uv does everything I currently do with pip, pyenv, pipx, pip-tools, and pipdeptree. Because of that, I'm in the process of switching to uv.

This blog post covers switching from pyenv to uv.

History

2024-08-29: Initial writing.
2024-09-12: Minor updates and publishing.
2024-09-20: Rename uv-sync (which is confusing) to uv-python-symlink.

Start state

I'm running Ubuntu Linux 24.04. I have pyenv installed using the the automatic installer. pyenv is located in $HOME/.pyenv/bin/.

I have the following Pythons installed with pyenv:

$ pyenv versions
  system
  3.7.17
  3.8.19
  3.9.19
* 3.10.14 (set by /home/willkg/mozilla/everett/.python-version)
  3.11.9
  3.12.3

I'm not sure why I have 3.7 still installed. I don't think I use that for anything.

My default version is 3.10.14 for some reason. I'm not sure why I haven't updated that to 3.12, yet.

In my 3.10.14, I have the following Python packages installed:

$ pip freeze
appdirs==1.4.4
argcomplete==3.1.1
attrs==22.2.0
cffi==1.15.1
click==8.1.3
colorama==0.4.6
diskcache==5.4.0
distlib==0.3.8
distro==1.8.0
filelock==3.14.0
glean-parser==6.1.1
glean-sdk==50.1.4
Jinja2==3.1.2
jsonschema==4.17.3
MarkupSafe==2.0.1
MozPhab==1.5.1
packaging==24.0
pathspec==0.11.0
pbr==6.0.0
pipx==1.5.0
platformdirs==4.2.1
pycparser==2.21
pyrsistent==0.19.3
python-hglib==2.6.2
PyYAML==6.0
sentry-sdk==1.16.0
stevedore==5.2.0
tomli==2.0.1
userpath==1.8.0
virtualenv==20.26.2
virtualenv-clone==0.5.7
virtualenvwrapper==6.1.0
yamllint==1.29.0

That probably means I installed the following in the Python 3.10.14 Python environment:

MozPhab
pipx
virtualenvwrapper

Maybe I installed some other things for some reason lost in the sands of time.

Then I had a whole bunch of things installed with pipx.

I have many open source projects all of which have a .python-version file listing the Python versions the project uses.

I think that covers the start state.

Steps

First, I made a list of things I had.

I listed all the versions of Python I have installed so I know what I need to reinstall with uv.
```
$ pyenv versions
```
I listed all the packages I have installed in my 3.10.14 environment (the default one).
```
$ pip freeze
```
I listed all the packages I installed with pipx.
```
$ pipx list
```

I uninstalled all the packages I installed with pipx.

$ pipx uninstall PACKAGE

Then I uninstalled pyenv and everything it uses. I followed the pyenv uninstall instructions:

$ rm -rf $(pyenv root)

Then I removed the bits in my shell that add to the PATH and set up pyenv and virtualenvwrapper.

Then I started a new shell that didn't have all the pyenv and virtualenvwrapper stuff in it.

Then I installed uv using the uv standalone installer.

Then I ran uv --version to make sure it was installed.

Then I installed the shell autocompletion.

$ echo 'eval "$(uv generate-shell-completion bash)"' >> ~/dotfiles/bash.d/20-uv.bash

Then I started a new shell to pick up those changes.

Then I installed Python versions:

$ uv python install 3.8 3.9 3.10 3.11 3.12
Searching for Python versions matching: Python 3.10
Searching for Python versions matching: Python 3.11
Searching for Python versions matching: Python 3.12
Searching for Python versions matching: Python 3.8
Searching for Python versions matching: Python 3.9
Installed 5 versions in 8.14s
 + cpython-3.8.19-linux-x86_64-gnu
 + cpython-3.9.19-linux-x86_64-gnu
 + cpython-3.10.14-linux-x86_64-gnu
 + cpython-3.11.9-linux-x86_64-gnu
 + cpython-3.12.5-linux-x86_64-gnu

When I type "python", I want it to be a Python managed by uv. Also, I like having "pythonX.Y" symlinks, so I created a uv-python-symlink-sync script which creates symlinks to uv-managed Python versions:

https://github.com/willkg/dotfiles/blob/main/dotfiles/bin/uv-python-symlink

Then I installed all my tools using uv tool install.

$ uv tool install PACKAGE

For tox, I had to install the tox-uv package in the tox environment:

$ uv tool install --with tox-uv tox

Now I've got everything I do mostly working.

So what does that give me?

I installed uv and I can upgrade uv using uv self update.

Python interpreters are managed using uv python. I can create symlinks to interpreters using uv-sync script. Adding new interpreters and removing old ones is pretty straight-forward.

When I type python, it opens up a Python shell with the latest uv-managed Python version. I can type pythonX.Y and get specific shells.

I can use tools written in Python and manage them with uv tool including ones where I want to install them in an "editable" mode.

I can write scripts that require dependencies and it's a lot easier to run them now.

I can create and manage virtual environments with uv venv.

Next steps

Delete all the .python-version files I've got.

Update documentation for my projects and add a uv tool install PACKAGE option to installation instructions.

Probably discover some additional things to add to this doc.

Thanks

Thank you to the Astral crew who wrote uv.

Thank you to Rob Hudson who goaded me into posting this finally rather than sit on it another month.

crashstats-tools v2.0.0 released!

Published: Thursday April 25, 2024, Will Kahn-Greene | share this (mastodon)

What is it?

crashstats-tools is a set of command-line tools for working with Crash Stats (https://crash-stats.mozilla.org/).

crashstats-tools comes with four commands:

supersearch: for performing Crash Stats Super Search queries
supersearchfacet: for performing aggregations, histograms, and cardinality Crash Stats Super Search queries
fetch-data: for fetching raw crash, dumps, and processed crash data for specified crash ids
reprocess: for sending crash report reprocess requests

v2.0.0 released!

There have been a lot of improvements since the last blog post for the v1.0.1 release. New commands, new features, improved cli ui, etc.

v2.0.0 focused on two major things:

improving supersearchfacet to support nested aggregation, histogram, and cardinality queries
moving some of the code into a crashstats_tools.libcrashstats module improving its use as a library

Improved supersearchfacet

The other day, Alex and team finished up the crash reporter Rust rewrite. The crash reporter rewrite landed and is available in Firefox, nightly channel, where build_id >= 20240321093532.

The crash reporter is one of the clients that submits crash reports to Socorro which is now maintained by the Observability Team. Firefox has multiple crash reporter clients and there are many ways that crash reports can get submitted to Socorro.

One of the changes we can see in the crash report data now is the change in User-Agent header. The new rewritten crash reporter sends a header of crash-reporter/1.0.0. That gets captured by the collector and put in the raw crash metadata.user_agent field. It doesn't get indexed, so we can't search on it directly.

We can get a sampling of the last 100 crash reports, download the raw crash data, and look at the user agents.

$ supersearch --num=100 --product=Firefox --build_id='>=20240321093532' \
    --release_channel=nightly > crashids.txt
$ fetch-data --raw --no-dumps --no-processed crashdata < crashids.txt
$ jq .metadata.user_agent crashdata/raw_crash/*/* | sort | uniq -c
     16 "crashreporter/1.0.0"
      2 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0"
      1 "Mozilla/5.0 (Windows NT 10.0; rv:127.0) Gecko/20100101 Firefox/127.0"
      2 "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0"
     63 "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:127.0) Gecko/20100101 Firefox/127.0"
      1 "Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0"
     12 "Mozilla/5.0 (X11; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
      3 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"

16 out of 100 crash reports were submitted by the new crash reporter. We were surprised there are so many Firefox user agents. We discussed this on Slack. I loosely repeat it here because it's a great way to show off some of the changes of supersearchfacet in v2.0.0.

First, the rewritten crash reporter only affects the parent (aka main) process. The other processes have different crash reporters that weren't rewritten.

How many process types are there for Firefox crash reports in the last week? We can see that in the ProcessType annotation (docs) which is processed and saved in the process_type field (docs).

$ supersearchfacet --product=Firefox --build_id='>=20240321093532' --release_channel=nightly
    --_facets=process_type
process_type
 process_type | count
--------------|-------
 content      | 3664
 parent       | 2323
 gpu          | 855
 utility      | 225
 rdd          | 60
 plugin       | 18
 socket       | 2
 total        | 7147

Judging by that output, I would expect to see a higher percentage of crashreporter/1.0.0 in our sampling of 100 crash reports.

Turns out that Firefox uses different code to submit crash reports not just by process type, but also by user action. That's in the SubmittedFrom annotation (docs) which is processed and saved in the submitted_from field (docs).

$ supersearchfacet --product=Firefox --build_id='>=20240321093532' --release_channel=nightly \
    --_facets=submitted_from
submitted_from
 submitted_from | count
----------------|-------
 Auto           | 3477
 Client         | 1741
 CrashedTab     | 928
 Infobar        | 792
 AboutCrashes   | 209
 total          | 7147

What is "Auto"? The user can opt-in to auto-send crash reports. When Firefox upgrades and this setting is set, then Firefox will auto-send any unsubmitted crash reports. The nightly channel has two updates a day, so there's lots of opportunity for this event to trigger.

What're the counts for submitted_from/process_type pairs?

$ supersearchfacet --product=Firefox --build_id='>=20240321093532' --release_channel=nightly \
    --_aggs.process_type=submitted_from
process_type / submitted_from
 process_type / submitted_from | count
-------------------------------|-------
 content / Auto                | 2214
 content / CrashedTab          | 926
 content / Infobar             | 399
 content / AboutCrashes        | 125
 parent / Client               | 1741
 parent / Auto                 | 450
 parent / Infobar              | 107
 parent / AboutCrashes         | 25
 gpu / Auto                    | 565
 gpu / Infobar                 | 236
 gpu / AboutCrashes            | 54
 utility / Auto                | 198
 utility / Infobar             | 25
 utility / AboutCrashes        | 2
 rdd / Auto                    | 34
 rdd / Infobar                 | 23
 rdd / AboutCrashes            | 3
 plugin / Auto                 | 14
 plugin / CrashedTab           | 2
 plugin / Infobar              | 2
 socket / Auto                 | 2
 total                         | 7147

We can spot check these different combinations to see what the user-agent looks like.

For brevity, we'll just look at parent / Client in this blog post.

$ supersearch --num=100 --product=Firefox --build_id='>=20240321093532' --release_channel=nightly \
    --process_type=parent --submitted_from='~Client' > crashids_clarified.txt
$ fetch-data --raw --no-dumps --no-processed crashdata_clarified < crashids_clarified.txt
$ jq .metadata.user_agent crashdata_clarified/raw_crash/*/* | sort | uniq -c
    100 "crashreporter/1.0.0"

Seems like the crash reporter rewrite only affects crash reports where ProcessType=parent and SubmittedFrom=Client. All the other process_type/submitted_from combinations get submitted a different way where the user agent is the browser itself.

How many crash reports has the new crash reporter submitted over time?

$ supersearchfacet --_histogram.date=product --_histogram.interval=1d --denote-weekends \
    --date='>=2024-03-20' --date='<=2024-04-25' \
    --release_channel=nightly --product=Firefox --build_id='>=20240321093532' \
    --submitted_from='~Client' --process_type=parent
histogram_date.product
 histogram_date | Firefox | total
----------------|---------|-------
 2024-03-21     | 58      | 58
 2024-03-22     | 124     | 124
 2024-03-23 **  | 189     | 189
 2024-03-24 **  | 289     | 289
 2024-03-25     | 202     | 202
 2024-03-26     | 164     | 164
 2024-03-27     | 199     | 199
 2024-03-28     | 187     | 187
 2024-03-29     | 188     | 188
 2024-03-30 **  | 155     | 155
 2024-03-31 **  | 146     | 146
 2024-04-01     | 201     | 201
 2024-04-02     | 226     | 226
 2024-04-03     | 236     | 236
 2024-04-04     | 266     | 266
 2024-04-05     | 259     | 259
 2024-04-06 **  | 227     | 227
 2024-04-07 **  | 214     | 214
 2024-04-08     | 259     | 259
 2024-04-09     | 257     | 257
 2024-04-10     | 223     | 223
 2024-04-11     | 250     | 250
 2024-04-12     | 235     | 235
 2024-04-13 **  | 154     | 154
 2024-04-14 **  | 162     | 162
 2024-04-15     | 207     | 207
 2024-04-16     | 201     | 201
 2024-04-17     | 346     | 346
 2024-04-18     | 270     | 270
 2024-04-19     | 221     | 221
 2024-04-20 **  | 190     | 190
 2024-04-21 **  | 183     | 183
 2024-04-22     | 266     | 266
 2024-04-23     | 303     | 303
 2024-04-24     | 308     | 308

There are more examples in the crashstats-tools README.

crashstats_tools.libcrashstats library

Starting with v2.0.0, you can use crashstats_tools.libcrashstats as a library for Python scripts.

For example:

from crashstats_tools.libcrashstats import supersearch

results = supersearch(params={"_columns": "uuid"}, num_results=100)

for result in results:
    print(f"{result}")

libcrashstats makes using the Crash Stats API a little more ergonomic.

See the crashstats_tools.libcrashstats library documentation.

Be thoughtful about using data

Make sure to use these tools in compliance with our data policy:

https://crash-stats.mozilla.org/documentation/protected_data_access/

Where to go for more

See the project on GitHub which includes a README which contains everything about the project including examples of usage, the issue tracker, and the source code:

https://github.com/willkg/crashstats-tools

Let me know whether this helps you!

Observability Team Newsletter (2024q1)

Published: Tuesday April 16, 2024, Will Kahn-Greene | share this (mastodon)

Observability Team is a team dedicated to the problem domain and discipline of Observability at Mozilla.

We own, manage, and support monitoring infrastructure and tools supporting Mozilla products and services. Currently this includes Sentry and crash ingestion related services (Crash Stats (Socorro), Mozilla Symbols Server (Tecken), and Mozilla Symbolication Service (Eliot)).

In 2024, we'll be working with SRE to take over other monitoring services they are currently supporting like New Relic, InfluxDB/Grafana, and others.

This newsletter covers an overview of 2024q1. Please forward it to interested readers.

Highlights

🤹 Observability Services: Change in user support
🏆 Sentry: Change in ownership
‼️ Sentry: Please don't start new trials
⏲️ Sentry: Cron monitoring trial ending April 30th
⏱️ Sentry: Performance monitoring pilot
🤖 Socorro: Improvements to Fenix support
🐛 Socorro: Support guard page access information

See details below.

Blog posts

None this quarter.

Detailed project updates

Observability Services: Change in user support

We overhauled our pages in Confluence, started an #obs-help Slack channel, created a new Jira OBSHELP project, built out a support rotation, and leveled up our ability to do support for Observability-owned services.

See our User Support Confluence page for:

where to get user support
documentation for common tasks (get protected data access, create a Sentry team, etc)
self-serve instructions

Hop in #obs-help in Slack to ask for service support, help with monitoring problems, and advice.

Sentry: Change in ownership

The Observability team now owns Sentry service at Mozilla!

We successfully completed Phase 1 of the transition in Q1. If you're a member of the Mozilla Sentry organization, you should have received a separate email about this to the sentry-users Google group.

We've overhauled Sentry user support documentation to improve it in a few ways:

easier to find "how to" articles for common tasks
best practices to help you set up and configure Sentry for your project needs

Check out our Sentry user guide.

There's still a lot that we're figuring out, so we appreciate your patience and cooperation.

Sentry: Please don't start new trials

Sentry sends marketing and promotional emails to Sentry users which often include links to start a new trial. Please contact us before starting any new feature trials in Sentry.

Starting new trials may prevent us from trialing those features in the future when we’re in a better position to evaluate the feature. There's no way for admins to prevent users from starting a trial.

Sentry: Cron monitoring trial ending April 30th

The Cron Monitoring trial that was started a couple of months ago will end April 30th.

Based on feedback so far and other factors, we will not be enabling this feature once the trial ends.

This is a good reminder to build in redundancy in your monitoring systems. Don't rely solely on trial or pilot features for mission critical information!

Once the trial is over, we'll put together an evaluation summary.

Sentry: Performance monitoring pilot

Performance Monitoring is being piloted by a couple of teams; it is not currently available for general use.

In the meantime, if you are not one of these pilot teams, please do not use Performance Monitoring. There is a shared transaction event quota for the entire Mozilla Sentry organization. Once we hit that quota, events are dumped.

If you have questions about any of this, please reach out.

Once the trial is over, we'll put together an evaluation summary.

Socorro: Improvements to Fenix support

We worked on improvements to crash ingestion and the Crash Stats site for the Fenix project:

1812771: Fenix crash reporter's Socorro crash reports for Java exceptions have "Platform" = "Unknown" instead of "Android"

Previously, the platform would be "Unknown". Now the platform for Fenix crash reports is "Android". Further, the platform_pretty_version includes the Android ABI version.

/images/obs_2024q1_android_version.thumbnail.png — Figure 1: Screenshot of Crash Stats Super Search results showing Android versions for crash reports.

1819628: reject crash reports for unsupported Fenix forks

Forks of Fenix outside of our control periodically send large swaths of crash reports to Socorro. When these sudden spikes happened, Mozillians would spend time looking into them only to discover they're not related to our code or our users. This is a waste of our time and resources.

We implemented support for the Android_PackageName crash annotation and added a throttle rule to the collector to drop crash reports from any non-Mozilla releases of Fenix.

From 2024-01-18 to 2024-03-31, Socorro accepted 2,072,785 Fenix crash reports for processing and rejected 37,483 unhelpful crash reports with this new rule. That's roughly 1.7%. That's not a huge amount, but because they sometimes come in bursts with the same signature, they show up in Top Crashers wasting investigation time.

1884041: fix create-a-bug links to work with java_exception

A long time ago, in an age partially forgotten, Fenix crash reports from a crash in Java code would send a crash report with a JavaStackTrace crash annotation. This crash annotation was a string representation of the Java exception. As such, it was difficult-to-impossible to parse reliably.

In 2020, Roger Yang and Will Kahn-Greene spec'd out a new JavaException crash annotation. The value is a JSON-encoded structure mirroring what Sentry uses for exception information. This structure provides more information than the JavaStackTrace crash annotation did and is much easier to work with because we don't have to parse it first.

Between 2020 and now, we have been transitioning from crash reports that only contained a JavaStackTrace to crash reports that contained both a JavaStackTrace and a JavaException. Once all Fenix crash reports from crashes in Java code contained a JavaException, we could transition Socorro code to use the JavaException value for Crash Stats views, signature generation, generate-create-bug-url, and other things.

Recently, Fenix dropped the JavaStackTrace crash annotation. However, we hadn't yet gotten to updating Socorro code to use--and prefer--the JavaException values. This broke the ability to generate a bug for a Fenix crash with the needed data added to the bug description. Work on bug 1884041 fixed that.

Comments for Fenix Java crash reports went from:

Crash report: https://crash-stats.mozilla.org/report/index/eb6f852b-4656-4cf5-8350-fd91a0240408

to:

Crash report: https://crash-stats.mozilla.org/report/index/eb6f852b-4656-4cf5-8350-fd91a0240408

Top 10 frames:

0  android.database.sqlite.SQLiteConnection  nativePrepareStatement  SQLiteConnection.java:-2
1  android.database.sqlite.SQLiteConnection  acquirePreparedStatement  SQLiteConnection.java:939
2  android.database.sqlite.SQLiteConnection  executeForString  SQLiteConnection.java:684
3  android.database.sqlite.SQLiteConnection  setJournalMode  SQLiteConnection.java:369
4  android.database.sqlite.SQLiteConnection  setWalModeFromConfiguration  SQLiteConnection.java:299
5  android.database.sqlite.SQLiteConnection  open  SQLiteConnection.java:218
6  android.database.sqlite.SQLiteConnection  open  SQLiteConnection.java:196
7  android.database.sqlite.SQLiteConnectionPool  openConnectionLocked  SQLiteConnectionPool.java:503
8  android.database.sqlite.SQLiteConnectionPool  open  SQLiteConnectionPool.java:204
9  android.database.sqlite.SQLiteConnectionPool  open  SQLiteConnectionPool.java:196

This both fixes the bug and also vastly improves the bug comments from what we were previously doing with JavaStackTrace.

Between 2024-03-31 and 2024-04-06, there were 158,729 Fenix crash reports processed. Of those, 15,556 have the circumstances affected by this bug: a JavaException but don't have a JavaStackTrace. That's roughly 10% of incoming Fenix crash reports.

While working on this, we refactored the code that generates these crash report bugs, so it's in a separate module that's easier to copy and use in external systems in case others want to generate bug comments from processed crash data.

Further, we changed the code so that instead of dropping arguments in function signatures, it now truncates them at 80 characters.

We're hoping to improve signature generation for Java crashes using JavaException values in 2024q2. That work is tracked in bug #1541120.

Socorro: Support guard page access information

1830954: Expose crashes which were likely accessing a guard page

We updated the stackwalker to pick up the changes for determining is_likely_guard_page. Then we exposed that in crash reports in the has_guard_page_access field. We added this field to the Details tab in crash reports and made it searchable. We also added this to the signature report.

This helps us know if a crash is possibly due to a bug with memory access that could be a possible security vulnerability vector--something we want to prioritize fixing.

Since this field is security sensitive, it requires protected data access to view and search with.

Socorro misc

crashstats-tools 2.0.0 release
socorro-siggen 2.1.20240412 release
4 signature generation changes. Thank you Andrew McCreight and Jim Blandy!
Maintenance and documentation improvements.
6 production deploys. Created 71 issues. Resolved 61 issues.

Tecken/Eliot misc

Maintenance and documentation improvements.
5 production deploys. Created 21 issues. Resolved 28 issues.

More information

Find us:

Confluence page: Observability Team
User support hub: User Support
Support: #obs-help (Slack)
Crash ingestion: #crashreporting (Matrix)

Thank you for reading!

Observability Team Newsletter (2023q4)

Published: Friday December 22, 2023, Will Kahn-Greene | share this (mastodon)

Observability Team is a team dedicated to the problem domain and discipline of Observability at Mozilla. We will own, manage, and support infrastructure and tools supporting Mozilla products and services. Currently this includes crash ingestion related services: Crash Stats and crash ingestion pipeline (Socorro), Mozilla Symbols Server (Tecken), and Mozilla Symbolication Service (Eliot). In 2024, we'll be working with SRE to take over many of the observability tools that they are currently supporting like Sentry, Grafana, New Relic, and others.

This newsletter covers an overview of 2023q4. Please forward it to interested readers.

Highlights

🎉 Team Changes: Socorro Engineering becomes Observability Team and picks up new members.
📄 Documentation: Overhauled support documentation for crash ingestion services.
❤️‍🩹 Socorro: Stability: Fixed ongoing Socorro processor stability problem. [bug 1795017]
🏆 Socorro: Code-info lookup: Implemented code-info lookup for symbols files. [bug 1746940] [Retro]
🔒 Tecken: Removed private symbols bucket support.[bug 1843356]
📚 Tecken: Removed missing symbols bookkeeping. [bug 1774004]
📱 Evolving SRE: Took over application support for crash ingestion services.

See details below.

Blog posts

Detailed project updates

Team changes

Prior to October, 2023, the Socorro Engineering team maintained crash ingestion systems and related services: Crash Stats and the crash ingestion pipeline, Mozilla Symbols Server, and Mozilla Symbolication Service.

In October 2023, that team picked up a couple of new people--Bianca and Sven--and changed names to become the Observability Team. In mid-December, Observability Team picked up a fourth teammate: Relud.

As we move into 2024, we expect to pick up other observability related services and work on service stability, support, and building out documentation of best practices across them supporting Mozilla products and services.

See our Confluence page for contact information, roughly what we're working on, how to do various things (add crash annotations, get protected data access, etc), and service/support documentation.

Overhauled support documentation for crash ingestion services

Documentation for crash ingestion services has been kind of all over the place. Going forward, we're working to make it clearer and easier to find.

We're moving some "how to" documentation into this tree in Confluence. Some interesting ones:

We'll add to that and improve it as time goes on. We'll be looking at centralizing API, tools, data dictionary, and other documentation over the next year as well.

If there are things you have questions about and can't find documentation for it, please let us know.

Socorro: Fixed ongoing Socorro processor stability problem

In September 2022, Mozilla began adding inline function data into symbols files. This increased the size of symbols files significantly. For example, the symbols file for xul.dll files went from around 200mb to 700mb. The increase in file size increased the time it takes for the stackwalker to download and parse symbols files, reduced the number of files the processor could store in the on-disk symbols cache, and caused the processor instances to suddenly slow down in periods of high load. This in turn would cause the processing queue to back up and page SRE causing work disruption as we scrambled to manually add more processor instances to increase throughput and reduce the queue.

We spent a lot of time analyzing the situation, adding new metrics, rewriting portions of the code based on our theories at the time, and ended up with several mitigations that reduced the likelihood that the processing queue backed up and sat with that for several months while we worked on other things.

One of the first things the Observability Team did was revisit the issue. New minds brought new theories, one of which was to change the instance type to one with a local ssd. That eliminated the disk io throttling the processors were incurring from using EBS for the symbols cache.

Now the Socorro processors are performing much like they did prior to September 2022, we've removed all the mitigations we had in place, and the processor queue isn't backing up anymore during periods of high load due to increased crash report volume and reprocessing. [bug 1795017]

Socorro: Stackwalker will use code id when debug id isn't available to fetch the symbols file

This allows symbolication of stacks where the debug id for modules is unknown. This improves crash signatures. Better signatures gives us better visibility into what crashes our users are encountering and how often.

For example, one of the problem signatures (#3 in Top Crashers at the time) looked like this:

OOM | large | mozalloc_abort | xul.dll | _PR_NativeRunThread | pr_root

and now looks like this:

OOM | large | mozalloc_abort | webrender::renderer::Renderer::render_impl

Rough estimate is that this significantly improved the crash signatures for 10k out of the 300k Firefox Windows crash reports we get a week.

See`Code info lookup: retrospective <https://bluesock.org/~willkg/blog/mozilla/socorro_tecken_code_info_retro.html>`__ for details. [bug 1746940]

Socorro misc

socorro-siggen v2.0.20231009 release. [v2.0.20231009]
11 signature generation changes most of which were self-serve.
Lots of maintenance and documentation improvements.
11 production deploys. Created 61 issues. Resolved 58 issues.

Tecken: Remove support for private symbols bucket

The Mozilla Symbols Server stored uploaded symbols in several places: a default storage bucket for build symbols, a "try" storage bucket for symbols from try builds, and a private symbols bucket. Mozilla primarily used the private symbols bucket for Flash symbols. However, we don't support Flash anymore, so we removed the private symbols bucket and all the code to support it. Removing this simplified symbols upload/download code significantly. [bug 1843356]

Tecken: Remove missing symbols bookkeeping

Mozilla Symbols Server used to keep track of symbols that were requested but didn't exist in the symbols buckets. Tecken had an API for querying this data which was used for reporting on which symbols Mozilla is missing. This helps us understand which symbols files we're missing when unwinding and symbolication stacks in crash ingestion.

There are better ways to get this data and keeping track of missing symbols in Tecken isn't helpful. We migrated users of this API and removed the data and code from Tecken. Removing this reduced the size of the database and simplified the download API code. [bug 1774004]

Tecken misc

fx-crash-sig v1.0.1 and v1.0.2 releases. [v1.0.1, v1.0.2]
Lots of maintenance and documentation improvements.
16 production deploys. Created 58 issues. Resolved 56 issues.

Prototyping Evolving SRE

In December, we finished the work to transition the application support role from Data SRE to the Observability Team making us an engineering team that also owns application support for the services we maintain.

We've accrued a lot of experience in how to migrate from the separate Engineering team and SRE team model to the combined Engineering and SRE team model. If you're thinking about transitioning to a combined Engineering and SRE team model and have questions, come find us.

More information

Find us:

Confluence page: Observability Team
Matrix: #crashreporting

Thank you for reading!

Tecken: The long windy journey to reproducing a problem remove_orphaned_files fixes

Published: Thursday November 30, 2023, Will Kahn-Greene | share this (mastodon)

Summary

This post talks about a stability problem we have with Tecken wherein the instance runs out of disk and then becomes unhealthy, some work we're doing to make it better, and the steps to reproduce the problem in a local dev environment so we can test possible fixes.

Tecken/Socorro: Code info lookup: retrospective (2023)

Published: Monday October 30, 2023, Will Kahn-Greene | share this (mastodon)

Project

time:

6 weeks

impact:

improved visibility 3% (10k / 300k) of Firefox crash reports from Windows users by fixing symbolication and signatures
better understanding of consequences from sampling Firefox / Windows < 8.1 / ESR crash reports

Summary

In November 2021, we wrote up a bug in the Tecken product to support download symbols files using the code file and code id.

In July 2023, Mozilla migrated users for Windows 7, 8, and 8.1 from Firefox release channel to ESR channel. Firefox / Windows / release is sampled by the Socorro collector, so the system only accepts and processes 10% of incoming crash reports. When the users were migrated, their crash reports moved to an unsampled group, so then we were getting 100% of those incoming crash reports. That caused a volume increase of 30k.

While looking into adding a sampling rule for Firefox / Windows < 8.1 / ESR, I noticed many crash reports listed a xul module without a debug file and debug id. Because of that, the stackwalker isn't able to get symbols and we end up with crash reports with generic signatures that we have no visibility into.

I looked at [bug 1746940] and worked out how to fix it. I thought it would be relatively straight-forward to implement and it would solve our visibility problem, so I prioritized working on it with the assumption it'd take a week to do.

Work wasn't as straight-forward as I predicted--I hit a bunch of road bumps and it took me 6 weeks to work through several attempts, settle on a final architecture, implement it, test it, and push all the pieces to production. I finished the work on October 24th, 2023.

The end result is improved visibility for 3% of Firefox Windows crash reports and a reduction in crash reports with generic signatures because the stackwalker couldn't find the symbols file for xul.dll.

Socorro Engineering: 2022 retrospective

Published: Monday January 23, 2023, Will Kahn-Greene | share this (mastodon)

Summary

2022 took forever. At the same time, it kind of flew by. 2023 is already moving along, so this post is a month late. Here's the retrospective of Socorro engineering in 2022.

Bleach 6.0.0 release and deprecation

Published: Monday January 23, 2023, Will Kahn-Greene | share this (mastodon)

What is it?

Bleach is a Python library for sanitizing and linkifying text from untrusted sources for safe usage in HTML.

Bleach v6.0.0 released!

Bleach 6.0.0 cleans up some issues in linkify and with the way it uses html5lib so it's easier to reason about. It also adds support for Python 3.11 and cleans up the project infrastructure.

There are several backwards-incompatible changes, hence the 6.0.0 version.

https://bleach.readthedocs.io/en/latest/changes.html#version-6-0-0-january-23rd-2023

I did some rough testing with a corpus of Standup messages data and it looks like bleach.clean is slightly faster with 6.0.0 than 5.0.0.

Using Python 3.10.9:

5.0.0: bleach.clean on 58,630 items 10x: minimum 2.793s
6.0.0: bleach.clean on 58,630 items 10x: minimum 2.304s

The other big change in 6.0.0 is that I've deprecated the project and planning to move to a minimum-maintenance mode for the foreseeable future.

Bleach is deprecated

Bleach sits on top of html5lib which is not actively maintained. It is increasingly difficult to maintain Bleach in that context and I think it's nuts to build a security library on top of a library that's not in active development.

Over the years, we've talked about other options:

find another library to switch to
take over html5lib development
fork html5lib and vendor and maintain our fork
write a new HTML parser
etc

With the exception of option 1, they greatly increase the scope of the work for Bleach. They all feel exhausting to me.

Given that, I think Bleach has run its course and this journey is over.

What happens now?

Possibilities:

Pass it to someone else?

No, I won't be passing Bleach to someone else to maintain. Bleach is a security-related library, so making a mistake when passing it to someone else would be a mess. I'm not going to do that.
Switch to an alternative?

I'm not aware of any alternatives to Bleach. I don't plan to work on coordinating the migration for everyone from Bleach to something else.
Oh my goodness--you're leaving us with nothing?

Sort of.

I'm going to continue doing minimal maintenance:

security updates
support for new Python versions
fixes for egregious bugs (begrudgingly)

I'll do that for at least a year. At some point, I'll stop doing that, too.

I think that gives the world enough time for either something to take Bleach's place, or for the sanitizing web api to kick in, or for everyone to come to the consensus that they never really needed Bleach in the first place.

/images/bleach_deprecation.thumbnail.jpg — Bleach. Tired. At the end of its journey.

Thanks!

Many thanks to Greg who I worked with on Bleach for a long while and maintained Bleach for several years. Working with Greg was always easy and his reviews were thoughtful and spot-on.

Many thanks to Jonathan who, over the years, provided a lot of insight into how best to solve some of Bleach's more squirrely problems.

Many thanks to Sam who was an indispensible resource on HTML parsing and sanitizing text in the context of HTML.

Where to go for more

For more specifics on this release, see here: https://bleach.readthedocs.io/en/latest/changes.html#version-6-0-0-january-23rd-2023

Documentation and quickstart here: https://bleach.readthedocs.io/en/latest/

Source code and issue tracker here: https://github.com/mozilla/bleach

Socorro: Schema-Based Overhaul of Crash Ingestion: Retrospective (2022)

Published: Wednesday January 18, 2023, Will Kahn-Greene | share this (mastodon)

Project

time:

2+ years

impact:

radically reduces the risk of data leaks due to misconfigured permissions
centralizes and simplifies configuration and management of fields
normalization and validation are performed during processing
documentation of data reviews, data caveats, etc.
reduces the risk of bugs when adding new fields—testing is done in CI
new crash reporting data dictionary with Markdown-formatted descriptions, real examples, and relevant links

Summary

I've been working on Socorro, the crash ingestion pipeline at Mozilla, since the beginning of 2016. During that time, I've focused on streamlining maintenance of the project, paying down technical debt, reducing risk, and improving crash analysis tooling.

Early on, I observed that the crash ingestion pipeline was difficult to reason about, poorly documented, and full of risk. What did the incoming data look like? What did the processed data look like? Was it valid? Which fields contained data anyone could look at? Which fields contained data that was sensitive and required access controls to access? How do we add support for new crash annotations? What happens when data is invalid or malformed? At a given point in the system, what did we know about the data?

In 2020, Socorro moved into the Data Org, which has multiple data pipelines. After spending some time looking at how their pipelines work, I decided to rework the crash ingestion pipeline to be schema-driven and to move data validation and normalization earlier in the processor.

The end result of this project is that:

the project is easier to maintain:
- adding support for new crash annotations is done in a couple of schema files and possibly a processor rule
risk of security issues and data breaches is lower:
- typos, bugs, and mistakes when adding support for a new crash annotation are caught in CI
- permissions are specified in a central location; changing permission for fields is trivial and takes effect in the next deploy; setting permissions supports complex data structures in easy-to-reason-about ways; and mistakes are caught in CI
the data is easier to use and reason about:
- normalization and validation of crash annotation data happens during processing, and downstream uses of the data can expect it to be valid; further, we get a signal when the data isn't valid, which can indicate product bugs
- schemas describing incoming and processed data
- a crash reporting data dictionary documenting incoming data fields, processed data fields, descriptions, sources, data gotchas, examples, and permissions

What is Socorro?

Socorro is the crash ingestion pipeline for Mozilla products like Firefox, Fenix, Thunderbird, and MozillaVPN.

When Firefox crashes, the crash reporter asks the user if they would like to send a crash report. If the user answers "yes!", the crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro. Socorro saves the submitted crash report, processes it, and has tools for viewing and analyzing crash data.

State of crash ingestion at the beginning

The crash ingestion system was working and it was usable, but it was in a bad state.

Poor data management

Normalization and validation of data was all over the codebase and not consistent:
- processor rule code
- AWS S3 crash storage code
- Elasticsearch indexing code
- Telemetry crash storage code
- Super Search querying and result rendering code
- report view and template code
- signature report code and template code
- crontabber job code
- any scripts that used the data
- tests -- many of which had bad test data so who knows what they were really testing
Naive handling of minidump stackwalker output meant that any changes in the stackwalker output were predominantly unnoticed, and there was no indication as to whether changed output created issues in the system.

Further, since it was all over the place, there were no guarantees for data validity when downloading it using the RawCrash, ProcessedCrash, and SuperSearch APIs. Anyone writing downstream systems would also have to normalize and validate the data.
Poor permissions management

Permissions were defined in multiple places:
- Elasticsearch json redactor
- Super Search fields
- RawCrash API allow list
- ProcessedCrash API allow list
- report view and template code
- Telemetry crash storage code
- and other places
We couldn't effectively manage the permissions of fields in the stackwalker output because we had no idea what was there.
Poor documentation

No documentation of crash annotation fields other than CrashAnnotations.yaml, which didn't enforce anything in crash ingestion (process, valid type, data correctness, etc.) and was missing important information like data gotchas, data review URLs, and examples.

No documentation of processed crash fields at all.
Making changes was high risk

Changing fields from public to protected was high risk because you had to find all the places it might show up which was intractable. Adding support for new fields often took multiple passes over several weeks because we'd miss things. Server errors happened with some regularity due to weirdness with crash annotation values affecting the Crash Stats site.
Tangled concerns across the codebase

Lots of tangled concerns where things defined in one place affected other places that shouldn't be related. For example, the Super Search fields definition was acting as a "schema" for other parts of the system that had nothing to do with Elasticsearch or Super Search.
Difficult to maintain

It was difficult to support new products.

It was difficult to debug issues in crash ingestion and crash reporting.

The Crash Stats web app contained lots of if/then/else bits to handle weirdness in the crash annotation values. Nulls, incorrect types, different structures, etc.

Socorro contained lots of vestigial code from half-done field removal, deprecated fields, fields that were removed from crash reports, etc. These vestigial bits were all over the code base. Discovering and removing these bits was time consuming and error prone.

The code for exporting data to Telemetry built the export data using a list of fields to exclude rather than a list of fields to include. This is backward and impossible to maintain—we never should have been doing this. Further, it pulled data from the raw crash, for which we had no validation guarantees, which would cause issues downstream in the Telemetry import code.

There was no way to validate the data used in the unit tests, which meant that a lot of it was invalid. We had no way to validate the test data, which meant that CI would pass, but we'd see errors in our stage and production environments.
Different from other similar systems

In 2020, Socorro was moved to the Data Org in Mozilla which had a set of standards and conventions for collecting, storing, analyzing, and providing access to data. Socorro didn't follow any of it, which made it difficult to work on, to connect with, and to staff. Things the Data Org has that Socorro didn't:
- a schema covering specifying fields, types, and documentation
- data flow documentation
- data review policy, process, and artifacts for data being collected and how to add new data
- a data dictionary for fields for users including documentation, data review URLs, and data gotchas

In summary, we had a system that took a lot of effort to maintain, wasn't serving our users' needs, and was at high risk of a security/data breach.

Project plan

Many of these issues can be alleviated and reduced by moving to a schema-driven system where we:

define a schema for annotations and a schema for the processed crash
change crash ingestion and the Crash Stats site to use those schemas

When designing this schema-driven system, we should be thinking about:

how easy is it to maintain the system?
how easy is it to explain?
how flexible is it for solving other kinds of problems in the future?
what kinds of errors will likely happen when maintaining the system, and how can we avert them in CI?
what kinds of errors can happen and how much risk do they pose for data leaks? what of those can we avert in CI?
how flexible is the system, which needs to support multiple products potentially with different needs?

I worked out a minimal version of that vision that we could migrate to and then work with going forward.

The crash annotations schema should define:

what annotations are in the crash report?
which permissions are required to view a field
field documentation (provenance, description, data review, related bugs, gotchas, analysis tips, etc)

The processed crash schema should define:

what's in the processed crash?
which permissions are required to view a field
field documentation (provenance, description, related bugs, gotchas, analysis tips, etc)

Then we make the following changes to the system:

write a processor rule to copy, normalize, and validate data from the raw crash based on the processed crash schema
switch the Telemetry export code to using the processed crash for data to export
switch the Telemetry export code to using the processed crash schema for permissions
switch Super Search to using the processed crash for data to index
switch Super Search to using the processed crash schema for documentation and permissions
switch the Crash Stats site to using the processed crash for data to render
switch the Crash Stats site to using the processed crash schema for documentation and permissions
switch the RawCrash, ProcessedCrash, and SuperSearch APIs to using the crash annotations and processed crash schemas for documentation and permissions

After doing that, we have:

field documentation is managed in the schemas
permissions are managed in the schemas
data is normalized and validated once in the processor and everything uses the processed crash data for indexing, searching, and rendering
adding support for new fields and changing existing fields is easier and problems are caught in CI

Implementation decisions

Use JSON Schema.

Data Org at Mozilla uses JSON Schema for schema specification. The schema is written using YAML.

https://mozilla.github.io/glean_parser/metrics-yaml.html

The metrics schema is used to define metrics.yaml files which specify the metrics being emitted and collected.

For example:

https://searchfox.org/mozilla-central/source/toolkit/mozapps/update/metrics.yaml

One long-term goal for Socorro is to unify standards and practices with the Data Ingestion system. Towards that goal, it's prudent to build out a crash annotation and processed crash schemas using whatever we can take from the equivalent metrics schemas.

We'll also need to build out tooling for verifying, validating, and testing schema modifications to make ongoing maintenance easier.

Use schemas to define and drive everything.

We've got permissions, structures, normalization, validation, definition, documentation, and several other things related to the data and how it's used throughout crash ingestion spread out across the codebase.

Instead of that, let's pull it all together into a single schema and change the system to be driven from this schema.

The schema will include:

structure specification
documentation including data gotchas, examples, and implementation details
permissions
processing instructions

We'll have a schema for supported annotations and a schema for the processed crash.

We'll rewrite existing parts of crash ingestion to use the schema:

processing
1. use processing instructions to validate and normalize annotation data
super search
1. field documentation
2. permissions
3. remove all the normalization and validation code from indexing
crash stats
1. field documentation
2. permissions
3. remove all the normalization and validation code from page rendering

Only use processed crash data for indexing and analysis.

The indexing system has its own normalization and validation code since it pulls data to be indexed from the raw crash.

The crash stats code has its own normalization and validation code since it renders data from the raw crash in various parts of the site.

We're going to change this so that all normalization and validation happens during processing, the results are stored in the processed crash, and indexing, searching, and crash analysis only work on processed crash data.

By default, all data is protected.

By default, all data is protected unless it is explicitly marked as public. This has some consequences for the code:

any data not specified in a schema is treated as protected
all schema fields need to specify permissions for that field
any data in a schema is either:
- marked public, OR
- lists the permissions required to view that data
for nested structures, any child field that is public has public ancestors

We can catch some of these issues in CI and need to write tests to verify them.

This is slightly awkward when maintaining the schema because it would be more reasonable to have "no permissions required" mean that the field is public. However, it's possible to accidentally not specify the permissions, and we don't want to be in that situation. Thus, we decided to go with explicitly marking public fields as public.

Work done

Phase 1: cleaning up

We had a lot of work to do before we could start defining schemas and changing the system to use those schemas.

remove vestigial code (some of this work was done in other phases as it was discovered)
- [bug 1724933]: remove unused/obsolete annotations (2021-08)
- [bug 1743487]: remove total_frames (2021-11)
- [bug 1743704]: remove jit crash classifier (2022-02)
- [bug 1762000]: remove vestigial Winsock_LSP code (2022-03)
- [bug 1784485]: remove vestigial exploitability code (2022-08)
- [bug 1784095]: remove vestigial contains_memory_report code (2022-08)
- [bug 1787933]: exorcise flash things from the codebase (2022-09)
fix signature generation
- [bug 1753521]: use fields from processed crash (2022-02)
- [bug 1755523]: fix signature generation so it only uses processed crash data (2022-02)
- [bug 1762207]: remove hang_type (2022-04)
fix Super Search
- [bug 1624345]: stop saving random data to Elasticsearch crashstorage (2020-06)
- [bug 1706076]: remove dead Super Search fields (2021-04)
- [bug 1712055]: remove system_error from Super Search fields (2021-07)
- [bug 1712085]: remove obsolete Super Search fields (2021-08)
- [bug 1697051]: add crash_report_keys field (2021-11)
- [bug 1736928]: remove largest_free_vm_block and tiny_block_size (2021-11)
- [bug 1754874]: remove unused annotations from Super Search (2022-02)
- [bug 1753521]: stop indexing items from raw crash (2022-02)
- [bug 1762005]: migrate to lower-cased versions of Plugin* fields in processed crash (2022-03)
- [bug 1755528]: fix flag/boolean handling (2022-03)
- [bug 1762207]: remove hang_type (2022-04)
- [bug 1763264]: clean up super search fields from migration (2022-07)
fix data flow and usage
- [bug 1740397]: rewrite CrashingThreadInfoRule to normalize crashing thread (2021-11)
- [bug 1755095]: fix TelemetryBotoS3CrashStorage so it doesn't use Super Search fields (2022-03)
- [bug 1740397]: change webapp to pull crashing_thread from processed crash (2022-07)
- [bug 1710725]: stop using DotDict for raw and processed data (2022-09)
clean up the raw crash structure
- [bug 1687987]: restructure raw crash (2021-01 through 2022-10)

Phase 2: define schemas and all the tooling we needed to work with them

After cleaning up the code base, removing vestigial code, fixing Super Search, and fixing Telemetry export code, we could move on to defining schemas and writing all the code we needed to maintain the schemas and work with them.

[bug 1762271]: rewrite json schema reducer (2022-03)
[bug 1764395]: schema for processed crash, reducers, traversers (2022-08)
[bug 1788533]: fix validate_processed_crash to handle pattern_properties (2022-08)
[bug 1626698]: schema for crash annotations in crash reports (2022-11)

Phase 3: fix everything to use the schemas

That allowed us to fix a bunch of things:

[bug 1784927]: remove elasticsearch redactor code (2022-08)
[bug 1746630]: support new threads.N.frames.N.unloaded_modules minidump-stackwalk fields (2022-08)
[bug 1697001]: get rid of UnredactedCrash API and model (2022-08)
[bug 1100352]: remove hard-coded allow lists from RawCrash (2022-08)
[bug 1787929]: rewrite Breadcrumbs validation (2022-09)
[bug 1787931]: fix Super Search fields to pull permissions from processed crash schema (2022-09)
[bug 1787937]: fix Super Search fields to pull documentation from processed crash schema (2022-09)
[bug 1787931]: use processed crash schema permissions for super search (2022-09)
[bug 1100352]: remove hard-coded allow lists from ProcessedCrash models (2022-11)
[bug 1792255]: add telemetry_environment to processed crash (2022-11)
[bug 1784558]: add collector metadata to processed crash (2022-11)
[bug 1787932]: add data review urls for crash annotations that have data reviews (2022-11)

Phase 4: improve

With fields specified in schemas, we can write a crash reporting data dictionary:

[bug 1803558]: crash reporting data dictionary (2023-01)
[bug 1795700]: document raw and processed schemas and how to maintain them (2023-01)

Then we can finish:

[bug 1677143]: documenting analysis gotchas (ongoing)
[bug 1755525]: fixing the report view to only use the processed crash (future)
[bug 1795699]: validate test data (future)

Random thoughts

This was a very very long-term project with many small steps and some really big ones. Getting large projects done is futile and the only way to do it successfully is to break it into a million small steps each of which stand on their own and don't create urgency for getting the next step done.

Any time I changed field names or types, I'd have to do a data migration. Data migrations take 6 months to do because I have to wait for existing data to expire from storage. On the one hand, it's a blessing I could do migrations at all--you can't do this with larger data sets or with data sets where the data doesn't expire without each migration becoming a huge project. On the other hand, it's hard to juggle being in the middle of multiple migrations and sometimes the contortions one has to perform are grueling.

If you're working on a big project that's going to require changing data structures, figure out how to do migrations early with as little work as possible and use that process as often as you can.

Conclusion and where we could go from here

This was such a huge project that spanned years. It's so hard to finish projects like this because the landscape for the project is constantly changing. Meanwhile, being mid-project has its own set of complexities and hardships.

I'm glad I tackled it and I'm glad it's mostly done. There are some minor things to do, still, but this new schema-driven system has a lot going for it. Adding support for new crash annotations is much easier, less risky, and takes less time.

It took me about a month to pull this post together.

That's it!

That's the story of the schema-based overhaul of crash ingestion. There's probably some bits missing and/or wrong, but the gist of it is here.

If you have any questions or bump into bugs, I hang out on #crashreporting on chat.mozilla.org. You can also write up a bug for Socorro.

Hopefully this helps. If not, let us know!