Socorro Engineering: Year in Review 2019


Last year at about this time, I wrote a year in review blog post. Since I only worked on Socorro at the time, it was all about Socorro. In 2019, that changed, so this blog post covers the efforts of two people across a bunch of projects.

2019 was pretty nuts. We accomplished a lot, but picking up a bunch of new projects really threw a wrench in the wheel of ongoing work.

This year in review covers highlights, some numbers, and some things I took away.

Here's the list of projects we worked on over the year:

Highlights 2019

While there are good reasons for why 2019 was nuts, it was soooo nuts. Some highlights:

  • I released Everett v1.0.0.

    Everett is a Python library for managing configuration. It's similar to other libraries, but it includes support for documenting configuration and testing with configuration which makes development and using projects that use Everett a lot easier.

    I released version 1.0.0 in January.

  • I reimplemented crontabber in Socorro.

    Socorro has a scheduled tasks system and relied on a library called crontabber. crontabber was initially part of Socorro and was extracted so other people could use it.

    The crontabber code wasn't well maintained and it had a lot of issues. I decided it was easier and more convenient to rewrite it as Django management commands than to take on maintaining the crontabber library. So I did.

  • I stepped down as the Bleach maintainer.

    Bleach is a Python library that makes user-provided content safe in an HTML context. It's used in a lot of places. It's slow, it's a difficult and complex problem domain, it's finicky and fragile, and it relies on another library called html5lib which has its own set of daunting problems.

    In March, I stepped down because I was burned out and needed to reduce the number of things I was working on out of sheer obligation.

    Months after I put it down, I still feel lousy that I walked away. Every time I think about how I feel lousy, I tell myself it was the right thing to do.

    I talk about stepping down from Bleach.

  • I migrated Socorro from RabbitMQ to Google Pub/Sub.

    I redid how Socorro handles queuing crash reports for processing. Previously, it used RabbitMQ. I switched it to Google Pub/Sub. In doing this, I removed one of the components between the collector and the processor which was sometimes flaky, so that was good. This was the first step in moving all of Socorro to Google Cloud Platform.

    Later in the year, we decided not to move Socorro to Google Cloud Platform. Fun times!

  • I got a new co-worker!

    I was working with Osmose for parts of 2018, but he left in early 2019 and even when he was around, he was on other projects and I was mostly working on my own and increasingly feeling disconnected and isolated. That kind of sucked.

    In April-ish, John joined me on Socorro. John's great to work with! Not only does he reduce the bus factor for our projects, allow me to go on vacations with less anxiety, and have a predilection towards deep dives into mysterious problems, but he's also wonderful to talk with. Yay for co-workers!

  • We took over and audited Buildhub.

    In April-ish, John and I inherited Buildhub. It was written a couple of years prior to be an index of build information for Mozilla projects. The build process creates artifacts on which is an AWS S3 bucket with a web interface of the directory structure. Buildhub consumes that information, puts it into an index, and provides a search interface for it.

    Socorro needs this information for converting (buildid, channel, version) to a proper version string. Mission Control needs this information for similar things. There are other systems that use it, too.

    John and I audited the Buildhub project. We wrote up a bunch of issues for things that surfaced from the audit. We fixed security issues and performed some necessary maintenance.

  • We took over and audited Buildhub2.

    The way Buildhub was built, it was really challenging to debug problems with build artifact ingestion. It had a lot of problems with missing information for unclear reasons.

    Buildhub2 was another attempt at building a build information index. It was launched at the end of 2018. It took a different approach and was a stricter mirror of the information on didn't attempt to infer anything from older builds which didn't have buildhub.json files.

    We audited the Buildhub2 project, wrote up a bunch of issues that surfaced from the audit, fixed security issues, updated dependencies, rewrote the documentation, and wrote a basic runbook.

  • We shut down Buildhub.

    During the Buildhub and Buildhub2 audits, I decided that while Buildhub2 has a different set of issues with its data, it was better than maintaining two indexes. I wrote up a plan to shut down Buildhub, identified and fixed blocking issues in Buildhub2, and migrated projects from Buildhub to Buildhub2.

    Then we shut down and dismantled Buildhub.

  • We took over PollBot and Dependency Dashboard.

    "Took over" is a bit of a stretch here. We did a rough audit of both projects and fixed some security issues with dependencies. However, we didn't get very far in absorbing either of these and still don't know much about them.

  • We took over and audited Tecken.

    Tecken is the Mozilla Symbols Server. It's used by a bunch of projects including Socorro for symbolicating stacks.

    Tecken was in pretty good shape, so we haven't had to spend a lot of time on urgent work.

  • I wrote an essay on crash pings (Telemetry) vs crash reports (Socorro/Crash Stats).

    In July, I wrote Crash pings (Telemetry) and crash reports (Socorro/Crash Stats). It took a while to write because it goes into a lot of detail for specific things. I know there have been changes in Telemetry-land as they moved to GCP, so I bet parts of it are wrong now. Writing it sure helped me and other people understand the current situation regarding crash report data and which data is good to use for what purposes.

    Will Lachance and I bandied about writing a more permanent manual for crash report data. I think that's a good idea, but I had to switch projects and haven't had time to spend on it, yet.

    I want to go through the essay and do an update at some point soon.

  • I released crashstats-tools v1.0.1.

    In 2018, I was tinkering with crashstats-tools. as a standalone set of command line tools that make it easier to manipulate crash report data from Crash Stats using the Crash Stats APIs in a command-line context.

    I use these tools in a few different ways mostly when looking into issues with Socorro processing. I wasn't sure if anyone else would use it, so I didn't tell anyone for a while--I didn't want to add another project to my plate that required ongoing maintenance work.

    In 2019, Gabriele and Marco spent a lot of time improving the situation around system library symbols. Up until recently, we had system library symbols for Windows in some cases and some for some versions of Mac OS, but parts of it were really manual and we didn't have a good story for Linux and it was generally just not great. This is a problem when walking and symbolicating stacks. Without symbols, the stackwalker has to guess where the frames are and that's problematic. Further, the result isn't human readable. For example, you end up with stuff like this:

    0             context
    1             frame_pointer
    2             frame_pointer
    3             scan
    4             scan
    5             scan
    6             scan
    7             scan
    8             scan
    9             scan
    10             scan
    11             scan
    12             scan
    13             scan
    14             scan
    15             scan
    16             scan
    17             scan

    After Gabriele and Marco's work, we now have this:

    0       <hashglobe::hash_map::HashMap<K, V, S>>::clear  /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/servo/components/hashglobe/src/         context
    1       style::stylist::CascadeData::clear      /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/servo/components/style/  cfi
    2       Servo_StyleSet_FlushStyleSheets         /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/servo/components/style/  cfi
    3       mozilla::ServoStyleSet::UpdateStylist()         /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/layout/style/ServoStyleSet.cpp:1374     cfi
    4       mozilla::ServoStyleSet::ResolveInheritingAnonymousBoxStyle(nsAtom*, mozilla::ServoStyleContext*)        /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/layout/style/ServoStyleSet.cpp:592      cfi
    5       nsCSSFrameConstructor::ConstructRootFrame()     /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/layout/base/nsCSSFrameConstructor.cpp:2661      cfi
    6       mozilla::PresShell::Initialize()        /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/layout/base/PresShell.cpp:1685  cfi
    7       nsContentSink::StartLayout(bool)        /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/dom/base/nsContentSink.cpp:1203         cfi
    8       nsHtml5TreeOpExecutor::StartLayout(bool*)       /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/parser/html/nsHtml5TreeOpExecutor.cpp:639       cfi
    9       nsHtml5TreeOperation::Perform(nsHtml5TreeOpExecutor*, nsIContent**, bool*, bool*)       /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/parser/html/nsHtml5TreeOperation.cpp:1110       cfi
    10       nsHtml5TreeOpExecutor::RunFlushLoop()   /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/parser/html/nsHtml5TreeOpExecutor.cpp:456       cfi
    11       nsHtml5ExecutorFlusher::Run()   /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/parser/html/nsHtml5StreamParser.cpp:125         cfi
    12       mozilla::SchedulerGroup::Runnable::Run()        /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/xpcom/threads/SchedulerGroup.cpp:370    cfi
    13       nsThread::ProcessNextEvent(bool, bool*)         /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/xpcom/threads/nsThread.cpp:975  cfi
    14       NS_ProcessNextEvent(nsIThread*, bool)   /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/xpcom/threads/nsThreadUtils.cpp:455     cfi
    15       mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*)    /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/ipc/glue/MessagePump.cpp:88     cfi
    16       MessageLoop::Run()      /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/ipc/chromium/src/base/       cfi
    17       nsBaseAppShell::Run()   /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/widget/nsBaseAppShell.cpp:136   cfi
    18       XRE_RunAppShell()       /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/toolkit/xre/nsEmbedFunctions.cpp:860    cfi
    19       MessageLoop::Run()      /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/ipc/chromium/src/base/       cfi
    20       XRE_InitChildProcess(int, char**, XREChildData const*)  /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/toolkit/xre/nsEmbedFunctions.cpp:698    cfi
    21  firefox-esr     content_process_main(mozilla::Bootstrap*, int, char**)  /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/browser/app/../../ipc/contentproc/plugin-container.cpp:49       cfi
    22  firefox-esr     main    /build/firefox-esr-Mag8OK/firefox-esr-60.7.1esr/browser/app/nsBrowserApp.cpp:254        cfi
    Ø 23            cfi
    24  firefox-esr     firefox-esr@0x561f              scan
    25  firefox-esr     firefox-esr@0x596f              scan
    Ø 26               scan
    27  firefox-esr     firefox-esr@0x596f              scan
    28  firefox-esr     _start          scan
    29          @0x7ffe1c04eb37

    Big difference, right?!

    Gabriele told me he's using crashstats-tools in their symbols upload scripts. So the scripts upload symbols for modules that are missing in Mozilla Symbols server, then do a search on Crash Stats for crash reports where those modules show up in the stack, and reprocess those crash reports. That's immensely helpful.

    I wrote about the crashstats-tools release.

  • John and I picked up Mozilla Location Services.

    Mozilla Location Services had been dormant for years. It was running Python 2.6 on Scientific Linux. It had a deploy pipeline that was several generations old. It was in an unmaintainable state.

    John, me, and our ops person ckolos overhauled the project, finished up the Docker-ization of the services, finished the mostly-done migration from Python 2.6 to Python 3, updated dependencies, reduced a bunch of complexity, wrote a lot of documentation, fixed a ton of issues, pushed out a new deploy pipeline and Docker-based infrastructure, and did a series of stop-gap fixes for processing.

    It was a massive undertaking. The infrastructure migration went smoothly--the site was unavailable for like 15 minutes during the switch over from the old infrastructure and old code base to the new one.

    There are still a bunch of issues with the system. We're triaging them now. However, it's maintainable and we can do deploys so it's vastly improved situation.

    This is currently our primary project, so we'll be spending most of our time on this in early 2019.

  • We passed off some projects.

    After picking up Mozilla Location Services, John and I were spread waaaay too thin, so we passed off Buildhub2 and PollBot to the build engineering team. That was a little tricky because I had only had these projects for a short period of time, so it was hard to answer questions about them.

  • I released Markus v2.0.0.

    Markus is a Python library for metrics generation. It wraps statsd and dogstatsd and some other libraries and makes it much easier to develop and test code that generates metrics. I use it in all my projects.

    Version 2.0.0 involved a minor rewrite to support filters. Filters let you adjust metrics before they get emitted. This makes it easier to add tags to all metrics generated for a service with things like the host and service type.

    I wrote about Markus v2.0.0 here.

  • I redid the queueing code in Socorro to use AWS SQS.

    Since I switched Socorro to use Google Pub/Sub for queuing crash reports to process, Socorro has been split across two cloud platforms. That's kind of annoying. Since Socorro was staying in AWS, I decided to switch it to use AWS SQS.

    That turned out to be a lot harder than I thought it would be because I had to fix a bunch of technical debt around boto -> boto3.

    I managed to get all the code changes done. We'll probably migrate from Google Pub/Sub to AWS SQS in 2019q1.

That's the highlights of 2019. While I think we accomplished a lot, it's frustrating because there were so many things I wanted to do in 2019, but just couldn't find time or energy for. For example, we still have a bunch of technical debt in Socorro that I want to get through. It's good to have a team of more-than-one. I feel a little less anxious to go on vacation.

2020 will have us focusing on Mozilla Location Services with Socorro and Tecken in maintenance mode.

Thank you!

Thank you to Gabriele and Marco for their work on getting systems library symbols and working with us to improve Socorro and Tecken. I look forward to the efforts on Rust-ifying everything. That'll help a lot!

Thank you to Stephen Michaud who spent a ton of time improving crash reporting and analysis on Mac OS x86_64h (I think that's what it was)! Keeping up with him was tough because there were so many other things going on, but he did awesome work!

Thank you to everyone who submitted signature generation fixes in Socorro!

Thank you to Liz and Marcia who are very patient with me!

Thank you to everyone who submitted bugs and pull requests and helped in other ways!

Thank you to Brian and Miles for Socorro and Tecken ops! Thank you ckolos for MLS ops!

Thank you to Lonnen and Laura for helping us survive 2019!

Summarized Bugzilla and GitHub stats for 2019

We've got so many projects now and we did so much work, the output of my review script is nuts, so this is a summary of the bits I think are interesting.

Period (2019-01-01 -> 2019-12-31)


  Bugs created: 457
  Creators: 101
  Top 10 count-wise:

       Will Kahn-Greene [:willkg] ET  : 236
           John Whitlock [:jwhitlock] : 22
       Steven Michaud [:smichaud] (Re : 21
           Gabriele Svelto [:gsvelto] : 16
                          Brian Pitts : 16
       Marcia Knous [:marcia - needin : 14
       Kartikaya Gupta (email:kats@mo : 5
           Jeff Muizelaar [:jrmuizel] : 4
                 Liz Henry (:lizzard) : 4
            Andrew McCreight [:mccr8] : 4

  Bugs resolved: 462

                              INVALID : 14
                                FIXED : 384
                           INCOMPLETE : 5
                              WONTFIX : 36
                           WORKSFORME : 11
                            DUPLICATE : 6
                                      : 1
                             INACTIVE : 5

  Resolvers: 32
  Top 10 count-wise:

       Will Kahn-Greene [:willkg] ET  : 332
           John Whitlock [:jwhitlock] : 66
                          Brian Pitts : 16
            Osmose [:osmose, :mkelly] : 6
       Marco Castelluccio [:marco] (P : 6
       Miles Crabill [:miles] [also m : 6
            Andrew McCreight [:mccr8] : 5
           Peter Bengtsson [:peterbe] : 4
                  mozillamarcia.knous : 3
                         chris.lonnen : 1

  Commenters: 161
  Top 10 count-wise:

                               willkg : 1681
                            jwhitlock : 200
                             smichaud : 78
                              peterbe : 61
                              gsvelto : 52
                  mozillamarcia.knous : 46
                    mozilla+bugcloser : 40
                               bpitts : 40
                        mcastelluccio : 36
                                   me : 31


      Youngest bug : 0.0d: 1517290: socorro: deploy 358
   Average bug age : 148.2d
    Median bug age : 5.0d
        Oldest bug : 3383.0d: 539370: Missing symbols for GTK system libraries, libgt...



    Merged PRs: 307
               willkg :   230  (+37601, -39109,  653 files)
            jwhitlock :    39  ( +2140,  -1399,  106 files)
        rvandermeulen :     9  (    +9,     -9,    1 files)
               Osmose :     7  (  +315,   -250,   26 files)
             pyup-bot :     5  (  +586,   -565,   33 files)
      dependabot[bot] :     3  (   +28,    -20,    2 files)
             jrmuizel :     3  (    +4,     -0,    1 files)
           amccreight :     2  (    +3,     -3,    2 files)
              lizzard :     1  (    +1,     -1,    1 files)
             jcristau :     1  (    +1,     -1,    1 files)
             glandium :     1  (    +1,     -0,    1 files)
               emilio :     1  (    +2,     -2,    1 files)
            bobbyg603 :     1  (    +1,     -1,    1 files)
         philipp-sumo :     1  (    +2,     -0,    2 files)
            staktrace :     1  (    +1,     -0,    1 files)
              peterbe :     1  (    +1,     -1,    1 files)
              froydnj :     1  (    +2,     -2,    1 files)

                Total :        (+40698, -41363,  664 files)

    Most changed files:
      requirements/default.txt (33)
      webapp-django/crashstats/settings/ (31)
      socorro/signature/siglists/prefix_signature_re.txt (31)
      webapp-django/crashstats/crashstats/ (25)
      requirements/constraints.txt (22)
      socorro/external/es/ (19)
      webapp-django/crashstats/api/tests/ (14)
      webapp-django/crashstats/crashstats/tests/ (14)
      docker/config/local_dev.env (14)
      webapp-django/crashstats/crashstats/ (14)

    Age stats:
          Youngest PR : 0.0d: 5076: bug 1597730: remove quit check
       Average PR age : 0.5d
        Median PR age : 0.0d
            Oldest PR : 27.0d: 4931: bug 1545446: Remove Fira-Sans, reduce font list


    Merged PRs: 56
               willkg :    54  ( +5811,  -3992,   79 files)
             pyup-bot :     2  (   +36,    -36,    1 files)

                Total :        ( +5847,  -4028,   79 files)

    Most changed files:
      requirements/default.txt (19)
      requirements/constraints.txt (12)
      antenna/ (11)
      antenna/ext/s3/ (7)
      tests/unittest/ (7)
      antenna/ext/pubsub/ (7)
      docker/Dockerfile (6)
      antenna/ (6)
      docker/config/local_dev.env (5)
      docker/ (5)

    Age stats:
          Youngest PR : 0.0d: 373: bug 1601455: add AWS SQS crashpublish support
       Average PR age : 0.0d
        Median PR age : 0.0d
            Oldest PR : 1.0d: 299: fix bug 1527343: implement publishing to Pub/Sub


    Closed issues: 2
    Merged PRs: 62
        renovate[bot] :    25  (  +448,   -502,    6 files)
             pyup-bot :    21  (  +272,   -257,    4 files)
               willkg :    12  ( +2705,  -3557,   16 files)
            jwhitlock :     4  ( +2246,  -3466,   26 files)

                Total :        ( +5671,  -7782,   35 files)

    Most changed files:
      requirements.txt (25)
      frontend/package.json (20)
      frontend/yarn.lock (20)
      Dockerfile (9)
      frontend/Dockerfile (4)
      requirements-constraints.txt (3)
      docs-requirements.txt (3)
      docs/dev.rst (3)
      tecken/ (2)
      tecken/api/ (2)

    Age stats:
          Youngest PR : 0.0d: 1908: Update handlerbars to 4.5.3
       Average PR age : 1.6d
        Median PR age : 0.0d
            Oldest PR : 9.0d: 1903: Update react monorepo to v16.12.0


    Closed issues: 79
                               willkg : 29
                            jwhitlock : 21
                               ckolos : 1

    Merged PRs: 106
            jwhitlock :    55  (+10755,  -6313,  121 files)
               willkg :    43  (+14501, -12999,  239 files)
      dependabot[bot] :     2  (    +6,     -6,    1 files)
             pyup-bot :     2  (  +222,   -212,    2 files)
              rindeal :     1  (    +5,     -5,    1 files)
      Mozilla-GitHub- :     1  (   +15,     -0,    1 files)
               ckolos :     1  (    +7,     -1,    1 files)
               lonnen :     1  (  +275,    -32,   12 files)

                Total :        (+25786, -19568,  285 files)

    Most changed files:
      requirements/default.txt (31)
      requirements/constraints.txt (24)
      ichnaea/ (12)
      ichnaea/ (11)
      ichnaea/ (11)
      ichnaea/ (11)
      ichnaea/webapp/ (8)
      ichnaea/taskapp/ (7)
      .circleci/config.yml (7)
      ichnaea/content/tests/ (7)

    Age stats:
          Youngest PR : 0.0d: 1017: Bump waitress from 1.3.1 to 1.4.0 in /requirements
       Average PR age : 6.0d
        Median PR age : 0.0d
            Oldest PR : 411.0d: 522: fix line separator issues in public csv exporter

  All repositories:

    Total closed issues: 81
    Total merged PRs: 531
Want to comment? Send an email to willkg at bluesock dot org. Include the url for the blog entry in your comment so I have some context as to what you're talking about.