Socorro in 2017

Summary

Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad crash reporter asks the user if the user would like to send a crash report. If the user answers "yes!", then the Breakpad crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro. Socorro saves the crash report, processes it, and provides an interface for aggregating, searching, and looking at crash reports.

2017 was a big year for Socorro. In this blog post, I opine about our accomplishments.

Read more…

html5lib-python 1.0 released!: retrospective (2017)

Project

time:

3 months

impact:
  • reduced technical debt and maintenance friction for html5lib which impacts a variety of projects like PyPI, pip, readme_renderer, Jupyter, TensorFlow

  • reduced security risks for Bleach

html5lib-python v1.0 released!

Yesterday, Sam released html5lib 1.0 [1]! The changes aren't wildly interesting for users, but are important for the health of the project.

The more interesting part for me is how the release happened and experimenting with interim maintainers to get projects going again. I'm going to spend the rest of this post talking about that.

The story of Bleach and html5lib

I work on Bleach which is a Python library for sanitizing and linkifying text from untrusted sources for safe usage in HTML. It relies heavily on another library called html5lib-python. Most of the work that I do on Bleach consists of figuring out how to make html5lib do what I need it to do.

Over the last few years, maintainers of the html5lib library have been working towards a 1.0. Those well-meaning efforts got them into a versioning model which had some unenthusing properties. I would often talk to people about how I was having difficulties with Bleach and html5lib 0.99999999 (8 9s) and I'd have to mentally count how many 9s I had said. It was goofy [2].

In an attempt to deal with the effects of the versioning, there's a parallel set of versions that start with 1.0b. Because there are two sets of versions, it was a total pain in the ass to correctly specify which versions of html5lib that Bleach worked with.

While working on Bleach 2.0, I bumped into a few bugs and upstreamed a patch for at least one of them. That patch sat in the PR queue for months. That's what got me wondering--what's going on with html5lib?

I tracked down Sam and talked with her a bit on IRC. She seems to be the only active maintainer. She was really busy with other things, html5lib doesn't pay, there's a ton of stuff to do, she's burned out, and recently there have been spats of negative comments in the issues and PRs. Generally the project had a lot of stop energy.

Some time in August, I offered to step up as an interim maintainer and shepherd html5lib to 1.0. The goals being:

  1. land or close as many old PRs as possible

  2. triage, fix, and close as many issues as possible

  3. clean up testing and CI

  4. clean up documentation

  5. ship 1.0 which ends the versioning issues

Thoughts on being an interim maintainer

I see a lot of open source projects that are in trouble in the sense that they don't have a critical mass of people and energy. When the sole part-time volunteer maintainer burns out, the project languishes. Then users show up, complain, demand changes, and talk about how horrible the situation is and everyone should be ashamed. It's tough--people are frustrated and then do a bunch of things that make everything so much worse. How do projects escape the raging inferno death spiral?

For a while now, I've been thinking about a model for open source projects where someone else pops in as an interim maintainer for a short period of time with specific goals, works to achieve those goals, and then steps down. Maybe this alleviates users' frustrations? Maybe this gives the part-time volunteer burned-out maintainer a breather? Maybe this can get the project moving again? Maybe the temporary interim maintainer can make some of the hard decisions that a regular long-term maintainer just can't?

I wondered if I should try that model out here. In the process of convincing myself that stepping up as an interim maintainer was a good idea [3], I looked at projects that rely on html5lib [4]:

  • pip vendors it

  • Bleach relies upon it heavily, so anything that uses Bleach uses html5lib (jupyter, hypermark, readme_renderer, TensorFlow, ...)

  • anything that uses readme_renderer like PyPI and tools around Python packages

  • most web browsers (Firefox, Chrome, servo, etc) have it in their repositories because web-platform-tests uses it

I talked with Sam and offered to step up with these goals in mind.

I started with cleaning up the milestones in GitHub. I decided the 0.9999999999 (10 9s) milestone was going to be 1.0. I bumped everything from the 0.9999999999 (10 9s) milestone to the 1.0 milestone. I went through all the issues and PRs and threw any that piqued my interest in the 1.0 milestone bucket.

Then I went through the issue tracker and triaged all the issues. I tried to get steps to reproduce and any other data that would help resolve the issue. I closed some issues I didn't think would ever get resolved.

I triaged all the pull requests. Some of them had been open for a long time. I apologized to people who had spent their time to upstream a fix that sat around for years. In some cases, the changes had bitrotted severely they had to be re-done [5].

Then I plugged away at issues and pull requests for a couple of months landing and fixing, and pushed anything out of the milestone that wasn't well-defined or something we couldn't fix in a week.

At the end of all that, Sam released version 1.0 and here we are today!

Conclusion and more thoughts

I finished up as interim maintainer for html5lib. I don't think I'm going to continue actively as a maintainer. Yes, Bleach uses it, but I've got other things I should be doing.

I think this was an interesting experiment. I also think it was a successful experiment in regards to achieving my stated goals, but I don't know if it gave the project much momentum to continue forward.

I'd love to see other examples of interim maintainers stepping up, achieving specific goals, and then stepping down again. Does it bring in new people to the community? Does it affect the raging inferno death spiral at all? What kinds of projects would benefit from this the most? What kinds of projects wouldn't benefit at all?

Markus v1.0 released! Better metrics API for Python projects.

What is it?

Markus is a Python library for generating metrics.

Markus makes it easier to generate metrics in your program by:

  • providing multiple backends (Datadog statsd, statsd, logging, logging roll-up, and so on) for sending data to different places

  • sending metrics to multiple backends at the same time

  • providing a testing framework for easy testing

  • providing a decoupled architecture making it easier to write code to generate metrics without having to worry about making sure creating and configuring a metrics client has been done--similar to the Python logging module in this way

I use it at Mozilla in the collector of our crash ingestion pipeline. Peter used it to build our symbols lookup server, too.

v1.0 released!

This is the v1.0 release. I pushed out v0.2 back in April 2017. We've been using it in Antenna (the collector of the Firefox crash ingestion pipeline) since then. At this point, I think the API is sound and it's being used in production, ergo it's production-ready.

This release also adds Python 2.7 support.

Why you should take a look at Markus

Markus does three things that make generating metrics a lot easier.

First, it separates creating and configuring the metrics backends from generating metrics.

Let's create a metrics client that sends data nowhere:

import markus

markus.configure()

That's not wildly helpful, but it works and it's 2 lines.

Say we're doing development on a laptop on a speeding train and want to spit out metrics to the Python logging module so we can see what's being generated. We can do this:

import markus

markus.configure(
    backends=[
        {
            'class': 'markus.backends.logging.LoggingMetrics'
        }
    ]
)

That will spit out lines to Python logging. Now I can see metrics getting generated while I'm testing my code.

I'm ready to put my code in production, so let's add a statsd backend, too:

import markus

markus.configure(
    backends=[
        {
            # Log metrics to the logs
            'class': 'markus.backends.logging.LoggingMetrics',
        },
        {
            # Log metrics to statsd
            'class': 'markus.backends.statsd.StatsdMetrics',
            'options': {
                'statsd_host': 'statsd.example.com',
                'statsd_port': 8125,
                'statsd_prefix': '',
            }
        }
    ]
)

That's it. Tada!

Markus can support any number of backends. You can send data to multiple statsd servers. You can use the LoggingRollupBackend which will generate statistics every flush_interval of count, current, min, and max for incr stats and count, min, average, median, 95%, and max for timing/histogram stats for metrics data.

If Markus doesn't have the backends you need, writing your own metrics backend is straight-forward.

For more details, see the usage documentation and the backends documentation.

Second, writing code to generate metrics is straight-forward and easy to do.

Much like the Python logging module, you add import markus at the top of the Python module and get a metrics interface. The interface can be module-level or in a class. It doesn't matter.

Here's a module-level metrics example:

import markus

metrics = markus.get_metrics(__name__)

Then you use it:

@metrics.timer_decorator('chopping_vegetables')
def some_long_function(vegetable):
    for veg in vegetable:
        chop_vegetable()
        metrics.incr('vegetable', 1)

That's it. No bootstrapping problems, nice handling of metrics key prefixes, decorators, context managers, and so on. You can use multiple metrics interfaces in the same file. You can pass them around. You can reconfigure the metrics client and backends dynamically while your program is running.

For more details, see the metrics overview documentation.

Third, testing metrics generation is easy to do.

Markus provides a MetricsMock to make testing easier:

import markus
from markus.testing import MetricsMock


def test_something():
    with MetricsMock() as mm:
        # ... Do things that might publish metrics

        # This helps you debug and write your test
        mm.print_records()

        # Make assertions on metrics published
        assert mm.has_metric(markus.INCR, 'some.key', {'value': 1})

I use it with pytest on my projects, but it is testing-system agnostic.

For more details, see the testing documentation.

Why not use statsd directly?

You can definitely use statsd/dogstatsd libraries directly, but using Markus is a lot easier.

With Markus you don't have to worry about the order in which you create/configure the statsd client versus using the statsd client. You don't have to pass around the statsd client. It's a lot easier to use in Dango and Flask where bootstrapping the app and passing things around is tricky sometimes.

With Markus you get to degrade to sending metrics data to the Python logging library which helps surface issues in development. I've had a few occasions when I thought I wrote code to send data, but it turns out I hadn't or that I had messed up the keys or tags.

With Markus you get a testing mock which lets you write tests guaranteeing that your code is generating metrics the way you're expecting.

If you go with using the statsd/dogstatsd libraries directly, that's fine, but you'll probably want to write some/most of these things yourself.

Where to go for more

For more specifics on this release, see here: https://markus.readthedocs.io/en/latest/history.html#october-30th-2017

Documentation and quickstart here: https://markus.readthedocs.io/en/latest/index.html

Source code and issue tracker here: https://github.com/willkg/markus

Let me know whether this helps you!

rob-bugson 1.0: or how I wrote a webextension

I work on Socorro and other projects which use GitHub for version control and code review and use Mozilla's Bugzilla for bug tracking.

After creating a pull request in GitHub, I attach it to the related Bugzilla bug which is a contra-dance of clicking and copy-and-paste. Github tweaks for Bugzilla simplified that by adding a link to the GitHub pull request page that I could click on, edit, and then submit the resulting form. However, that's a legacy addon and I use Firefox Nightly and it doesn't look like anyone wrote a webextension version of it, so I was out-of-luck.

Today, I had to bring in my car for service and was sitting around at the dealership for a few hours. I figured instead of working on Socorro things, I'd take a break and implement an attach-pr-to-bug webextension.

I've never written a webextension before. I had written a couple of addons years ago using the SDK and then Jetpack (or something like that). My JavaScript is a bit rusty, especially ES6 stuff. I figured this would be a good way to learn about webextensions.

It took me about 4 hours of puzzling through docs, writing code, and debugging and then I had something that worked. Along the way, I discovered exciting things like:

  • host permissions let you run content scripts in web pages

  • content scripts can't access browser.tabs--you need a background script for that

  • you can pass messages from content scripts to background scripts

  • seems like everything returns a promise, but async/await make that a lot easier to work with

  • the attachment page on Bugzilla isn't like the create-bug page and ignores querystring params

The MDN docs for writing webextensions and the APIs involved are fantastic. The webextension samples are also great--I started with them when I was getting my bearings.

I created a new GitHub repository. I threw the code into a pull request making it easier for someone else to review it. Mike Cooper kindly skimmed it and provided insightful comments. I fixed the issues he brought up.

TheOne helped me resurrect my AMO account which I created in 2012 back when Gaia apps were the thing.

I read through Publishing your webextension, generated a .zip, and submitted a new addon.

About 10 minutes later, the addon had been reviewed and approved.

Now it's a thing and you can install rob-bugson.

Socorro signature generation overhaul and command line interface: retrospective (2017)

Project

time:

6 months

impact:
  • improved ease of contribution for signature generation changes

  • improved ability to experiment with signatures

  • improved ability to use Socorro-style crash signatures in other projects

Summary

This quarter I worked on creating a command line interface for signature generation and in doing that extracted it from the processor into a standalone-ish module.

The end result of this work is that:

  1. anyone making changes to signature generation can can test the changes out on their local machine using a Socorro local development environment

  2. I can trivially test incoming signature generation changes--this both saves me time and gives me a much higher confidence of correctness without having to merge the code and test it in our -stage environment [1]

  3. we can research and experiment with changes to the signature generation algorithm and how that affects existing crash signatures

  4. it's a step closer to being usable by other groups

This blog post talks about that work briefly and then talks about some of the things I've been able to do with it.

Read more…

Socorro local development environment: retrospective (2017)

Project

time:

1 year

impact:
  • vastly reduced time-to-onboard for new developers and contributors

  • vastly improved developer efficacy

Summary

Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad crash reporter asks the user if the user would like to send a crash report. If the user answers "yes!", then the Breakpad crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro. Socorro saves the crash report, processes it, and provides an interface for aggregating, searching, and looking at crash reports.

This (long-ish) blog post talks about how when I started on Socorro, there wasn't really a local development environment and how I went on a magical journey through dark forests and craggy mountains to find one.

If you do anything with Socorro at Mozilla, you definitely want to at least read the "Tell me more about this local development environment" part.

Read more…

Socorro and Firefox 57

Summary

Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad crash reporter asks the user if the user would like to send a crash report. If the user answers "yes!", then the Breakpad crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro--specifically the Socorro collector.

Teams at Mozilla are feverishly working on Firefox 57. That's super important work and we're getting down to the wire. Socorro is a critical part of that development work as it collects incoming crashes, processes them, and has tools for analysis.

This blog post covers some of the things Socorro engineering has been doing to facilitate that work and what we're planning from now until Firefox 57 release.

This quarter

This quarter, we replaced Snappy with Tecken for more reliable symbol lookup in Visual Studio and other clients.

We built a Docker-based local dev environment for Socorro making it easier to run Socorro on your local machine configured like crash-stats.mozilla.com. It now takes five steps to getting Socorro running on your computer.

We also overhauled the signature generation system in Socorro and slapped on a command-line interface. Now you can test the effects of signature generation changes on specific crashes as well as groups of crashes on your local machine.

We've also been fixing stability issues and bugs and myriad other things.

Now until Firefox 57

Starting today and continuing until after Firefox 57 release, we are:

  1. prioritizing your signature generation changes, getting them landed, and pushing them to -prod

  2. triaging Socorro bugs into "need it right now" and "everything else" buckets

  3. deferring big changes to Socorro until after Firefox 57 including API endpoint deprecation, major UI changes to the crash-stats interface, and other things that would affect your workflow

We want to make sure crash analysis is working as best as it can so you can do the best you can so we can have a successful Firefox 57.

Please contact us if you need something!

We hang out on #breakpad on irc.mozilla.org. You can also write up bugs.

Hopefully this helps. If not, let us know!

Soloists: code review on a solo project

Summary

I work on some projects with other people, but I also spend a lot of time working on projects by myself. When I'm working by myself, I have difficulties with the following:

  1. code review

  2. bouncing ideas off of people

  3. peer programming

  4. long slogs

  5. getting help when I'm stuck

  6. publicizing my work

  7. dealing with loneliness

  8. going on vacation

I started a #soloists group at Mozilla figuring there are a bunch of other Mozillians who are working on solo projects and maybe if we all work alone together, then that might alleviate some of the problems of working solo. We hang out in the #soloists IRC channel on irc.mozilla.org. If you're solo, join us!

I keep thinking about writing a set of blog posts for things we've talked about in the channel and how I do things. Maybe they'll help you.

This one covers code review.

Read more…

Antenna: retrospective (2017)

Project

time:

6 months

impact:
  • reduced risk for deployments

  • improved reliability of collector

  • reduced technical debt

  • improved developer efficacy

Problem statement

Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad crash reporter asks the user if the user would like to send a crash report. If the user answers "yes!", then the Breakpad crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro--specifically the Socorro collector.

The Socorro collector is one of several components that comprise Socorro. Each of the components has different uptime requirements and different security risk profiles. However, all the code is maintained in a single repository and we deploy everything every time we do a deploy. This is increasingly inflexible and makes it difficult for us to make architectural changes to Socorro without affecting everything and incurring uptime risk for components that have high uptime requirements.

Because of that, in early 2016, we embarked on a re-architecture to split out some components of Socorro into separate services. The first component to get split out was the Socorro collector since it needs has the highest uptime requirements of all the Socorro components, but rarely changes, so it'd be a lot easier to meet those requirements if it was separate from the rest of Socorro.

Thus I was tasked with splitting out the Socorro collector and this blog post covers that project. It's a bit stream-of-consciousness, because I think there's some merit to explaining the thought process behind how I did the work over the course of the project for other people working on projects.

Read more…

The Soloists

Building Firefox is a big endeavor. There are many teams and projects covering initiatives, maintenance, bug fixing, triage, localization, support, understanding feedback, marketing, communication, releasing, supporting infrastructure, crash analysis, and a bazillion other activities all to build a family of browsers and applications.

Teams and projects aren't static. People move around as priorities change and the landscape shifts and projects complete or are scuttled.

Sometimes projects get started up with a single person. Sometimes all the people except one move off a project. Sometimes we find ourselves working alone, in a basement office, with only a stapler equivalent to keep us company.

We are the soloists. You wouldn't believe the list of things we work on. Alone.

Where to find soloists: IRC, Slack

There's an IRC channel #soloists on irc.mozilla.org.

There's also a Slack channel #soloists on the Mozilla Slack [1].

These two places (and whatever other places soloists want to hang out at) are places where we can:

  • find some solace from the weary drudgery of being alone on their projects for days on end

  • ask for help

  • bounce ideas off each other

  • vent frustrations in a friendly forgiving place

  • get advice on dealing with things like code reviews and how to go on vacation

  • get recognition for a job well done

and a variety of other things that alleviate many of the problems we have as soloists.

Stickers at the All Hands!

Over the last month or so, we spent some time figuring out #soloists stickers because we like stickers and you like stickers and everyone likes stickers.

They look like this:

/images/soloist_2017_handdrawn.thumbnail.png

Soloist 2017 sticker.

They're 2" by 2" and round. They're warm to the touch. They make you want to climb things. By yourself. Alone. With appropriate safety gear. [2]

If you're a soloist, come find one of us and get a sticker. Also, consider joining soloist channels.

If you support soloists, come find one of us and get a sticker. Ask us about the things we're working on. We may be solo, but we're working on real projects that almost certainly affect you. As a group, we did great things in the last 6 months. Alone. So alone.