This is long because 2022 was busy busy busy. I didn't spend much time going through my engineering journal for things I missed, so there's probably stuff I missed.

Found a bug in Fenix crash reporter that had eluded us for a long while When I rewrote the collector's payload parsing, we started seeing weird errors with crash reports reported by Fenix. I looked into that and discovered a minor bug in the Fenix crash reporter code that had been there for a while. bug: [bug 1757854]

bug: android components issue #11809

Fix telemetry marionette test cases This was at the beginning of 2022. If I recall, there were two problems. The first is that minidump_stackwalk needed an additional flag, otherwise it wouldn't download symbols and then when the telemetry marionette tests crashed, the stacks would be unsymbolicated--but only on Windows. I found and fixed that. bug: [bug 1594515] After I fixed that, Aria replaced that stackwalker with the new rust-minidump one which was infinitely better in a million ways. Then I wanted to verify that when the Telemetry marionette tests crashed and there were symbols on disk, whether the stackwalker would use the symbols on disk instead of pulling them from symbols.mozilla.org. After a series of try pushes verifying the different possibilities, I was able to verify it was using the symbols on disk and I closed out the bug. bug: [bug 1746747]

Socorro/Tecken overview I did a "Socorro/Tecken Overview 2022" presentation. I then wrote a script to export the slides into images and converted the whole thing to a blog post. Blog post: Socorro/Tecken Overview: 2022, presentation

Migrate from Mozilla on-prem Sentry to hosted Sentry From what I understand, for other projects at Mozilla, migrating to hosted Sentry was straight-forward and it consisted of changing the sentry_dsn somewhere--no big deal. For Socorro and Tecken, it was a long project and consisted of a bunch of steps: Audited Sentry usage in Socorro and Tecken. In Socorro, we had these using Sentry: collector, processor, processor disk cache manager, js in the webapp, webapp, crontabber scheduled task runner, and migration deploy script. In Tecken, we had these using Sentry: webapp, js in the webapp, migration deploy script, celery, symbolication webapp, and symbolication disk cache manager. In total, this involved auditing Sentry usage across 13 things. Remove Sentry client usage from parts of Socorro and Tecken that weren't really using them. Migrated Socorro and Tecken services that were still using the old, deprecated Raven Python client to the new sentry-sdk client. Had to read through the Sentry integration code. Socorro and Tecken are composed of 13 things that use Sentry all of which pick up different integrations by default. I had to figure out which integrations to use and which to not use because they violated our data policy restrictions.

Had to implement Sentry event filtering. Since I have like 6 services across 3 repositories for 2 projects, it was better to do this as a library. Thus was born fillmore. Fillmore makes it possible to filter Sentry events, test filtering, and verify that events aren't changing as you update sentry-sdk or other libraries.

Had to implement a fake sentry service I could run with docker-compose. Since I have 3 repositories across 2 projects, it was better to do this as a library. Thus was born kent. Kent makes it possible to see what's getting sent in Sentry event payloads, test filtering, and build automated integration tests. Since Socorro holds category 4 data and we were now sending exception data to a third-party, I had to do a security-focused audit and an RRA to make sure what Socorro could send was safe to send and that we had all the infrastructure we needed to ensure nothing changed in the future. It was a lot of work. Some of it was done at the end of 2021 and the rest in 2022. Socorro bug: [bug 1812325] (backfilled to collect all the work)

Tecken bug: [bug 1812330] (backfilled to collect all the work)

Fillmore: https://fillmore.readthedocs.io/

Kent: https://github.com/willkg/kent/

Blog post: Kent v0.1.0 released! And the story of Kent in the first place.....

Ran a Volunteer Responsibility Amnesty Day for Data Org I learned about Volunteer Responsibility Amnesty Day where you take some time to take stock of all the open source things you're doing and which commitments you need to change or end. One of the goals we had in Data Org was to "land planes" and reduce maintenance burden. I wondered if Volunteer Responsibility Amnesty Day could be a helpful exercise towards that goal. Ultimately, I don't know how much it really helped, but it was worth doing and thinking about and maybe it's a tool we can use more effectively some day. Blog post: Volunteer Responsibility Amnesty Day: 06-2022

Redid how the stackwalker gets into Socorro Docker images Originally, Socorro built the stackwalker when building the Docker image. That took a lot of time and the stackwalker doesn't change often. I split the stackwalker building out to another project that generates an artifact when we update it and then changed the Socorro Docker image building to use that artifact. This dropped the build time from 22 minutes to 7. Bug: [bug 1759065]

Offloaded/ended/deprecated a bunch of projects I ended the Puente project.

I passed Dennis off to new maintainers. Blog post: Dennis v1.0.0 released! Retrospective! Handing it off!

I ended github-bugzilla-pr-linker.

I planned to deprecate Bleach, but didn't finish that work until today. Blog post: Bleach 6.0.0 release and deprecation Each one of these involved auditing the project and figuring out: who used the project

what state the project was in

what alternatives existed and possible migration options Then I shopped those audits around to stakeholders and other people who might have opinions on the future direction for the project. Ending projects takes more energy than continuing to maintain them.

Schema-driven overhaul of crash ingestion For a while, I've been slogging through converting Socorro into a schema-driven system. In 2022, I finally finished it. Adding support for new annotations takes an hour for most annotations. Making the changes is straight-forward. Testing them is also straight-forward. There's a lot of tooling and validation tests to make sure everything is correct. I can make changes with confidence. I can't understate how fantastic this is. Also, I was able to throw together a crash ingestion data dictionary in 4 hours. When people ask questions on #crashreporting , I can more easily point them to documentation on specific fields and what the values look like. It was a lot of work, but Socorro is easier to maintain now. Blog post: Socorro: Schema based overhaul of crash ingestion: retrospective (2022)

Unloaded module support Just after we switched from minidump-stackwalk to the rust-minidump stackwalker, Aria added support for listing unloaded module information. I did the minimum I could do to support that until I had finished up the processed crash schema of the schema-driven overhaul. Once I had a processed crash schema, I: added the relevant fields to the processed crash schema

added support for the unloaded module information to the report view

added support for unloaded modules to signature generation This was a big improvement for crash analysis. Bug: [bug 1746630]

Bug: [bug 1797742]

Inline function support In July 2023, Markus took on implementing support for inline function data in crash reporting. He did a ton of work across a bunch of project. However, all that work was blocked by me adding support to Tecken (symbolication) and Socorro (stackwalker, report view, signature generation, etc). Once I finished up the processed crash schema part of the schema-driven overhaul, I tackled reviewing, verifying, and landing all of Markus' changes: update the stackwalker to 0.14.0 ([bug 1779630])

the processed crash schema ([bug 1788267])

the crashing thread stack in the report view ([bug 1788103])

signature generation ([bug 1788269])

the top 10 frames of the stack in create-a-bug bugzilla links ([bug 1791271])

fix "top most filenames" to include inline function data ([bug 1800141]) Then Gabriele, Markus, Jeff, and others tweaked signature generation to account for inline function data. That work is still ongoing. The other thing that happened is that the size of sym files ballooned. I knew they were going to grow in size, but I wasn't paying enough attention to the conversations in July (I might have been on PTO--I forget what happened) and didn't realize just how much they were growing. I spent the bulk of 2022q4 dealing with service degradation issues around the sym file size changes. Sym files had increased dramatically I knew that sym files had increased in size, but I didn't understand the magnitude. We have limited data available and I can't look at the sym files directly (easily), so it took me a while to figure out how to measure what I needed to measure to see what we were looking at. The end result was a Jupyter notebook. https://github.com/willkg/socorro-jupyter/blob/main/notebooks/bug_1796120_sym_sizes.ipynb The summary is that the size of the xul module sym files had increased by 500-600mb. Yipes. Bug: [bug 1796120]

Mozilla Symbols Server nodes monotonically use disk space and die The size of the symbols.zip file had increased so much that when the Firefox build system went to upload the symbols.zip file, it took so long it triggered a timeout in the Gunicorn worker causing Gunicorn to kill the worker off leaving whatever data the worker was working on on disk. Since there's nothing to clean up the data on disk and the Firefox build system would try again, eventually the disk for an instance would fill up and then the instance would die. Bug: [bug 1790808]

Mozilla Symbolication Server ran out of memory Symbolication involves downloading sym files from the Mozilla Symbols Server, parsing them converting them to symcache files, and then doing address lookups in those files to get the frame symbols. The sym files had increased dramatically in size. Having multiple workers download, parsing, and looking up addresses in sym files meant we had a bunch of sym files in memory at the same time. Because the stacks we're symbolicating are all stacks for Firefox or Fenix or other Mozilla products, they almost always had a xul module involved. And just like that, we were running out of memory and the Eliot instances were crashing. We fixed this in a few ways, but the big changes were that we reduced the number of workers on an instance and I rewrote symbolication so that it reordered the symbols to look up so that a worker only had one the sym file for a single module in memory at any given time. Bug: [bug 1793984]

SymCacheErrorBadDebugFile There was a bug in dump_syms where the INLINE_ORIGIN line had an address, but no symbol. We were getting a lot of errors. I wrote some new tooling to help figure out what was going on. Gabriele wrote up an issue and fixed it. Then we had to figure out what to do with the errant sym files. I decided to fix them manually since there were tens of them and once fixed, a sym file wouldn't be problematic anymore. I wrote tooling to do that. I wrote up issue 487 to include an INFO line to the sym file that indicated what version of dump_syms generated it. While manually fixing sym files, I noticed that I was seeing ones with this new INFO line. Then we discovered a second issue with dump_syms where symbols had naughty characters like newlines in them which caused them to get split across lines. I noticed this and wrote up issue 511 which Markus fixed. I'm still fixing the occasional bad sym file. Bug: [bug 1794095]

Numeric 2,644,960,066 out of range The symbols.zip files were so big that when Tecken went to record the size in the upload_upload.size field, it exceeded the maximum for that field type. We had to do a database migration to fix the type and allow the larger values. Oh, but we hadn't done a database migration in a long time, so we had to dust off all the migration code. Also, this was on a big table, so I had to coordinate with SRE and do an outage window to run the migration. Bug: [bug 1796264]

Processor lag causes queue build up After we landed all the inline function data code, the processors started taking longer to process crash reports because they were spending more time downloading sym files, saving sym files to disk, and loading files from disk. Also, the processor is a single Python process that runs multi-threaded. My theory is that the increased size means the processor is spending more time downloading sym files from symbols.mozilla.org and since it's multi-threaded, it's spending a lot more time in io_wait and only using a single VCPU. That'd be fine, except they were taking longer and spending more time in io_wait and the average CPU usage dropped and then our scaling triggers didn't work so the processor cluster stopped scaling up under load. This took a long time to sort out and while I have a theory, I haven't had a chance to address it, yet. We've worked around the issue by increasing the minimum number of processor instances. I'm hoping to fix this in 2023. Bug: [bug 1795017]

It's wild because Socorro and Tecken have largely been stable for years. Increasing the sym file size tipped the boat over and with the skeleton crew we have, it took a lot of calendar days to stabilize things. But now we've got better infrastructure dashboards, I have a bunch of new tools for analysis, the code is emitting some new helpful metrics, and I rewrote a bunch of code that was due for fixing. This was a big improvement for crash analysis.

Helped with a sphinx-js release unblocking Python 3.10 support for Firefox engineering sphinx-js is a Sphinx extension used to build the Firefox source docs. It hadn't really been actively maintained in a while. I helped Lonnen sort out the issues preventing it from working in Python 3.10 and Sphinx 4.1.0+. sphinx-js issue #209

sphinx-js PR #210

Sphinx issue #11021

[bug 1763971]