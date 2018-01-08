Socorro in 2017
Summary
Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad crash reporter asks the user if the user would like to send a crash report. If the user answers "yes!", then the Breakpad crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro. Socorro saves the crash report, processes it, and provides an interface for aggregating, searching, and looking at crash reports.
2017 was a big year for Socorro. In this blog post, I opine about our accomplishments.
Turnover in 2017
At the beginning of the year, we had three full-time developers. Then Adrian left for Pontoon (a completely different project) and Peter left to continue working on Tecken (all things symbols in Socorro) on another team. That left me alone-ish for a month which was tough. Then we picked up Mike in October.
Throughout, Lonnen managed to find time to fix Socorro things and review PRs. That alleviated some of the lack-of-critical-mass problems we had during the year.
We also had a cadre of engineers and other people who contributed fixes mostly around signature generation.
That's a lot of turnover for a very very small team. At one point, I was the only developer which was really hard because Socorro is a huge code base. We made due and still did great things.
Highlights 2017
2017 was a big year. I really can't overstate that. Despite the turnover, we accomplished a lot. Some highlights:
-
Replaced the Socorro collector. Replaced the Socorro collector with a top-to-bottom rewrite code-named Antenna. We put it in production in April 2017 and fixed a few minor issues that came up. We haven't touched it since then--it's been solid.
In July, I wrote a post-mortem and project wrap-up.
Ops, QA, engineering--we did awesome on this project!
-
Created a new Docker-based local development environment. This radically improved our ability to trouble-shoot, debug, reproduce issues, fix issues, and verify correctness of fixes. It was a game-changer.
In September, I wrote Socorro local development environment.
-
Rewrote signature generation code and added a command line interface. This allows us to verify signature generation changes and experiment with new ones. We can confidently make changes to signature generation code now and know roughly what the effects will be.
Not only that, but the tools are easy to use and make it possible for anyone to test their signature generation changes.
In October, I wrote Socorro signature generation overhaul and command line interface.
-
Built a new Docker-based -stage environment. Our current infrastructure has some rough edges and it's really different than the other systems at Mozilla. In order to be more like other systems, we're building a new infrastructure for Socorro that uses Ops-preferred Dockerflow bits. This new infrastructure will make it easier to scale individual components, deploy, back out deploys, and manage everything.
Getting a working -stage environment was a huge accomplishment. From writing Docker files and new command scripts, to infrastructure glue and deploy pipeline bits, to getting everything including our tests working on Circle CI, to rewriting Socorro code that had underlying assumptions about how it was being run to work with the new system.
Work for this ongoing project is covered in [bug 1391034] and a bunch of bugs blocking that one.
-
Rewrote Snappy symbolification server and all things symbols. We rewrote the Snappy symbolification server which engineers use to symbolicate stacks to get meaningful stack traces. This new system is code-named Tecken.
In addition to that, Peter took the project several steps further and centralized all things symbols into Tecken.
Socorro's minidump stackwalker now asks Tecken for symbol lookups allowing Tecken to keep track of missing symbols. Soon, we'll be able to remove all the missing symbol bookkeeping code from Socorro.
We're also switching to Tecken for symbol uploads. Soon, we'll be able to remove all the symbol upload code from Socorro, too.
Peter wrote a plog entry on load testing Tecken which covers some other bits about Tecken as well.
-
Removed lots of code and other things from the repository. Adrian and Peter worked on the "deprecation rampage" focusing on removing unused API endpoints. We spent time removing Postgres tables, stored procedures, and views we weren't using. We removed the fakedata generation code. We removed the middleware component (most of it was folded into the webapp). We removed the aging and broken Vagrant development environment. We removed a bunch of scripts whose purpose has long been forgotten. We removed code for cron jobs we no longer run. We removed bits and bobs for projects long abandoned (running Socorro on Heroku, using hbase, etc).
There's still a lot of code ripe for removal and cleaning up, but we made significant progress towards reducing the code base to a size that's maintainable by a small team.
This is covered by a bunch of bugs like [bug 1361394], [bug 1314814], [bug 1424027], [bug 1424370], [bug 1398946], and [bug 1387493].
-
Updated Python dependencies and reworked how we manage them. We updated all the Python dependencies (some of which were several years old), switched to a requirements file and constraints file to specify them, and set up monthly dependency reviews for non-security updates and daily dependency reviews for security updates.
This automates the majority of the work required to stay up-to-date.
This work is covered in [bug 1306731].
-
Updated JavaScript dependencies and switched to npm to manage them. Our webapp relies on a bunch of JavaScript libraries. We had copies of these libraries in the repository. We removed the vendored copies and switched to npm to install them from a requirements file. Additionally, the updated the dependencies to more recent versions and set up monthly review for updates.
This work is covered in [bug 1388593].
-
Built better metrics infrastructure for the webapp. We switched the webapp to use a library I wrote for Antenna called Markus. This makes it much easier to measure things like how often API endpoints are being used. Adding metrics to the webapp is now a two-line code-change.
I want to update the rest of Socorro in similar ways. Hopefully, I can get to that in early 2018.
This work is covered in [bug 1412590].
-
Cleaned up bugs. We triaged and resolved 1,221 bugs. We resolved bugs that were obsolete, for projects we abandoned, fixed, and otherwise not helpful anymore. We're down to under 500 bugs now.
-
Switched from nose to pytest. We switched from nose to pytest. We have hundreds of tests, so this was an overhaul of our test code which took a while. The end result is that we're now using a test library that has features that will make writing and maintaining tests much easier.
This is covered in [bug 1361764] and [bug 1405675].
-
Linted Python code and added linting to CI. We linted all the Python code, fixed issues, and added linting to our CI. Linting is an important tool for finding certain classes of bugs. Being able to lint in CI reduces the risk of code changes.
This is covered in [bug 1377254].
-
Overhauled documentation. We overhauled the documentation. We now have a new Getting Started guide that gets you a local development environment in roughly 4 steps. It documents the scripts we use for manipulating that environment and running the various components of Socorro individually as well as in conjunction with other components.
We also updated all the documentation related to administrating and maintaining the infrastructure.
There's still a lot of work to do here, but we made significant progress.
-
Wrote a system checklist. We wrote up a system checklist for verifying that the entire system is working as expected. This is helpful after big changes like upgrading Python versions or critical libraries.
This also gives us a list of important things in the system so we can automate verification as much as possible and change parts that are hard to verify.
-
Radically reduced onboarding time for new developers. When I started in 2016, it took me more than 6 months before I was up-to-speed and had a working development environment enough to be productive.
Contrast that experience with Mike who was up-to-speed in a few weeks.
It was a good year!
Lowlights 2017
We had a bunch of highlights, but we also had some low lights:
-
Elasticsearch cluster upgrade fails. We've been having problems with our Elasticsearch cluster for a while now. In 2017, we tried several times to upgrade our Elasticsearch cluster from 1.4 to 5.1 hoping that this will alleviate some of our problems.
We tried this in our -stage environment several times failing each time. This project was supposed to be pretty straight-forward, but it had complexities we didn't understand until later.
First, we discovered we had a lot of problems in our data making it difficult to migrate it over. We have a lot of data, so we had to copy the data from one cluster to another cluster and transform it along the way. It's really difficult to do that quickly. We were mucking with Groovy script embedded in a reindex-from-remote command. The iteration cycle was rough, too--we'd run the script for a day and then discover more issues that we had to fix.
Second, we had to rewrite and update a lot of code and our testing had a lot of holes in it. We'd get some things working in -stage only to discover new issues.
Since we have only one -stage environment, these experiments blocked all Socorro development.
After the third abandoned attempt, I suggested we back up a step and build a local development environment with both Elasticsearch cluster versions, test everything out there, and work out the issues. Meanwhile, we can fix some of our data problems which is probably a good idea anyhow.
Meanwhile, we're also trying to redo our infrastructure. We have a really small team. We can't do two big projects like this at the same time. I reprioritized them a few times hoping we could get one of them done and reduce the number of big projects we were juggling. I think that only made things worse.
That work is being done in [bug 1322630].
-
Another year with Postgres crash storage. The Socorro processor processes a raw crash into a processed crash and then saves it to a bunch of crash stores. We've been trying to remove Postgres from the crash store destinations.
This work has been really hard. The code is really tangled and slides between Python-land and Postgres-stored-procedure-land. Some of it is well tested, but some of it has no tests at all and interesting side-effects.
I thought we were really close to dropping Postgres as a crash store. I tried to pick up where Adrian and Peter left off, but essentially ran out of time in the year to finish this off.
This work is being done in [bug 1257531].
-
Switch from ftpscraper to buildhub. We currently have a script called ftpscraper that scrapes archive.mozilla.org for new and updated build information. It has a bunch of "interesting logic" for traversing the directory trees and interpreting the data. It then executes a bunch of stored procedures that convert that build information into some form and stores it in the database.
Those stored procedures do interesting things. They handle a bunch of "one-off" scenarios in the build information some of which stem from goofs and some from the ever evolving Firefox build system. They also enforce invariants that aren't true anymore as far as I can tell. They have no tests.
Socorro's system for accruing build information is really hard to debug. It takes days to understand how the data flowed and why weird things happened. Many issues are ephemeral, so they're not reproducible after the fact.
Over the summer, Buildhub was written and stood up to build and maintain a set of build data much like what we're getting with ftpscraper. I looked at dropping our ftpscraper script for a similar Buildhub-based script, but haven't had time to continue that work and keep pushing it off in order to finish other things. In the meantime, we continue to have problems with build information which we spend/waste gobs of time debugging.
This work is being done in [bug 1366301].
-
Spent the bulk of our time addressing technical debt. We worked through a lot of technical debt that had been accreting for years. That's great, but it was at the cost of spending time improving things that people use.
We could have spent more time honing the Crash Stats webapp interface. We could have spent more time improving bits to make QA easier. We could have spent more time fixing our API documentation to make it more usable.
There's never enough time to do everything, but it would be better if we had accomplished more user-facing things.
Hopefully, we get to these in 2018.
Thanks!
Thank you to the Socorro team: Lonnen, Peter, Adrian, Matt, Miles, Grumpy, Greg, Mike, and Will!
We accomplished a lot this year. We're in a really good position coming into 2018.
Bugzilla and GitHub stats for 2017
data platform 1304902: [tracker] Remove correlations from Socorro 1306731: [tracker] Don't be so behind on python dependencies 1314814: [tracker] Deprecation Rampage 1315258: [tracker] switch to antenna for incoming crashes 1351302: [tracker] Remove GCCrashes 1357444: [tracker] Remove obsolete cron jobs 1373997: [tracker] rewrite docs 1387104: [tracker] make running webapp and processor in docker environment useful 1427117: [tracker] purge data associated with bug 1427111 Statistics Youngest bug : 0.0d: 1329736: upload_file_minidump files should get... Average bug age : 632.3d Median bug age : 41.0d Oldest bug : 3419.0d: 425399: Allow querying by modules in crashing... 