Socorro: October 2018 happenings

Friday November 2, 2018 09:00, Will Kahn-Greene | Tweet this

Summary

Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad crash reporter asks the user if the user would like to send a crash report. If the user answers "yes!", then the Breakpad crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro. Socorro saves the crash report, processes it, and provides an interface for aggregating, searching, and looking at crash reports.

October was a busy month! This blog post covers what happened.

Bleach v3.0.0 released!

Wednesday October 3, 2018 12:00, Will Kahn-Greene | Tweet this

What is it?

Bleach is a Python library for sanitizing and linkifying text from untrusted sources for safe usage in HTML.

Bleach v3.0.0 released!

Bleach 3.0.0 focused on easing the problems with the html5lib dependency and fixing regressions created in the Bleach 2.0 rewrite

For the first, I vendored html5lib 1.0.1 into Bleach and wrote a shim module. Bleach code uses things in the shim module which import things from html5lib. In this way I:

keep the two separated to some exten
the shim is easy to test on its own
it shouldn't be too hard to update html5lib versions
we don't have to test Bleach against multiple versions of html5lib (which took a lot of time)
no one has to deal with Bleach requiring one version of html5lib and other libraries requiring other versions

I think this is a big win for all of us.

The second was tricky. The Bleach 2.0 rewrite changed clean and linkify from running in the tokenizing step of HTML parsing to running after parsing is done. The parser (un)helpfully would clean up the HTML before passing it to Bleach. Because of that, the cleaned text would end up with all this extra stuff.

For example, with Bleach 2.1.4, you'd have this:

>>> import bleach
>>> bleach.clean('This is terrible.<sarcasm>')
'This is terrible.&lt;sarcasm&gt;&lt;/sarcasm&gt;'

The tokenizer would parse out things that looked like HTML tags, the parser, would see an end tag that didn't have a start tag and would add the start tag, then clean would escape the start and end tags because they weren't in the list of allowed tags. Blech.

Bleach 3.0.0 fixes that by tweaking the tokenizer to know about the list of allowed tags. With this knowledge, it can see a start, end, or empty tag and strip or escape it during tokenization. Then the parser doesn't try to fix anything.

With Bleach 3.0.0, we get this:

>>> import bleach
>>> bleach.clean('This is terrible.<sarcasm>')
'This is terrible.&lt;sarcasm&gt;'

What I could use help with

I could use help with improving the documentation. I think it's dense and all over the place focus-wise. I find it difficult to read.

If you're good with documentation, I sure could use your help. See issue 397 for more.

Where to go for more

For more specifics on this release, see here: https://bleach.readthedocs.io/en/latest/changes.html#version-3-0-0-october-3rd-2018

Documentation and quickstart here: https://bleach.readthedocs.io/en/latest/

Source code and issue tracker here: https://github.com/mozilla/bleach

Socorro: 2018q3 review

Monday October 1, 2018 09:00, Will Kahn-Greene | Tweet this

Summary

2018q3 was a busy quarter. This blog post covers what happened.

Siggen (Socorro signature generator) v0.2.0 released!

Wednesday August 29, 2018 12:00, Will Kahn-Greene | Tweet this

Siggen

Siggen (sig-gen) is a Socorro-style signature generator extracted from Socorro and packaged with pretty bows and wrapping paper in a Python library. Siggen generates Socorro-style signatures from your crash data making it easier for you to bucket your crash data using the same buckets that Socorro uses.

The story

Back in June of 2017, the signature generation code was deeply embedded in Socorro's processor. I spent a couple of weeks extracting it and adding tooling so as to:

make it easier for others to make and test signature generation changes
make it easier for me to review signature generation changes
make it easier to experiment with algorithm changes and understand how it affects existing signatures
make it easier for other groups to use with their crash data

I wrote a blog post about extracting signature generation. That project went really well and as a result, we've made many big changes to signature generation with full confidence about how they would affect things. I claim this was a big success.

The fourth item in that list was a "hope", but wasn't meaningfully true. While it was theoretically possible, because while the code was in its own Python module, it was still all tied up with the rest of Socorro and effectively impossible for other people to use.

A year passed....

Early this summer, Will Lachance took on Ben Wu as an intern to look at Telemetry crash ping data. One of the things Ben wanted to do was generate Socorro-style signatures from the data. Then we could do analysis on crash ping data using Telemetry tools and then do deep dives on specific crashes in Socorro.

I forked the Socorro signature generation code and created Siggen and released it on PyPI. Ben and I fixed some realllly rough edges and did a few releases. We documented parts of signature generation that had never been documented before.

Ben wrote some symbolication code to convert the frames to symbols, then ran that through Siggen to generate a Socorro style signature. That's in fix-crash-sig. He did some great things with his internship project!

So then I had this problem where I had two very different versions of Socorro's signature generation code. I did several passes at unifying the two versions and fixing both sides so the code worked inside of Socorro as well as outside of Socorro. It was effectively a rewrite of the code.

The result of that work is Siggen v0.2.0.

Usage

Siggen can be installed using pip:

$ pip install siggen

Siggen comes with two command line tools.

You can generate a signature from crash data on Socorro given crashids:

$ signature <CRASHID> [<CRASHID> ...]

This is the same as doing socorro-cmd signature in the Socorro local development environment.

You can also generate a signature from crash data in JSON format:

$ signify <JSONFILE>

You can use it as a library in your Python code:

from siggen.generator import SignatureGenerator

generator = SignatureGenerator()

crash_data = {
    ...
}

ret = generator.generate(crash_data)
print(ret['signature'])

The schema is "documented" in the README which can be viewed online at https://github.com/willkg/socorro-siggen#crash-data-schema.

There's more Siggen documentation in the README though that's one area where this project is sort of lacking. There's also more documentation about the signature generation algorithm in the Socorro docs on signature generation.

What's the future of this library

This is alpha-quality software. It's possible the command line tools and API bits will change as people use it and issues pop up. Having said that, it's in use in a couple of places now, so it probably won't change much.

Some people want different kinds of signature generation. That's cool--this neither helps nor hinders that.

This doesn't solve everyone's Socorro signature generation problems, but I think it gives us a starting point for some of them and it was a doable first step.

Some people want to produce Socorro-style signatures from their crash data. This will help with that. Unless you need the code in some other language in which case this is probably not helpful.

I wrote some tools to update Siggen from changes in Socorro. That was how I built v0.2.0. I think that worked well and it's pretty easy to do, so I plan to keep this going for a while.

If you use this library, pleeeeeeease tell me where you're using it. That's how I'll know it's being used and that the time and effort to maintain it are worth while. Even better, add a star in GitHub so I have a list of you all and can contact you later. Plus it's a (terrible) indicator of library popularity.

If no one uses this library or if no one tells me (I can't tell the difference), then I'll probably stop maintaining it.

If there's interest in this algorithm, but implemented with a different language, pleeeeeeeease let me know. I'm interested in helping to build a version in Rust. Possibly other languages.

If there's interest in throwing a webapp with an API around this, chime in with specifics in [bug 828452].

Hopefully this helps. If so, let me know! If not, let me know!

Standup report: End of days

Wednesday August 29, 2018 08:00, Will Kahn-Greene | Tweet this

What is Standup?

Standup is a system for capturing standup-style posts from individuals making it easier to see what's going on for teams and projects. It has an associated IRC bot standups for posting messages from IRC.

Short version

Standups project is done and we're going to decommission all the parts in the next couple of months.

Long version: End of days

Paul and I run Standups and have for a couple of years now. We do the vast bulk of work on it covering code changes, feature implementing, bug fixing, site administration and user support. Neither of us use it.

There are occasional contributions from other people, but not enough to keep the site going without us.

Standups has a lot of issues and a crappy UI/UX. Standups continues to accrue technical debt.

The activity seems to be dwindling over time. Groups are going elsewhere.

In June, I wrote Standup report: June 8th, 2018 in which I talked about us switching to swag-driven development as a way to boost our energy level in the project, pull in contributors, etc. We added a link to the site. It was a sort of last-ditch attempt to get the project going again.

Nothing happened. I heard absolutely nothing from anyone in any medium about the post or any of the thoughts therein.

Sometimes, it's hard to know when a project is dead. You sometimes have metrics that could mean something about the health of a project, but it's hard to know for sure. Sometimes it's hard to understand why you're sitting in a room all by yourself. Will something happen? Will someone show up? What if we just wait 15 minutes more?

I don't want to wait anymore. The project is dead. If it's not actually really totally dead, it's such a fantastic fascimile of dead that I can't tell the difference. There's no point in me waiting anymore; nothing's going to change and no one is going to show up.

Sure, maybe I could wait another 15 minutes--what's the harm since it's so easy to just sit and wait? The harm is that I've got so many things on my plate that are more important and have more value than this project. Also, I don't really like working on this project. All I've experienced in the last year was the pointy tips of bug reports most of them related to authentication. The only time anyone appreciates me spending my very very precious little free time on Standups is when I solicit it.

Also, I don't even know if it's "Standups" with an "s" or "Standup" without the "s". It might be both. I'm tired of looking it up to avoid embarrassment.

Timeline for shutdown

Shutting down projects like Standups is tricky and takes time and energy. I think tentatively, it'll be something like this:

August 29th -- Announce end of Standups. Add site message (assuming site comes back up). Adjust irc bot reply message.
October 1st -- Shut down IRC bots.
October 15th -- Decommission infrastructure: websites, DNS records, Heroku infra; archive github repositories, etc.

I make no promises on that timeline. Maybe things will happen faster or slower depending on circumstances.

What we're not going to do

There are a few things we're definitely not going to do when decommissioning:

First, we will not be giving the data to anyone. No db dumps, no db access, nothing. If you want the data, you can slurp it down via the existing APIs on your own. [1]

Second, we're not going to keep the site going in read-only mode. No read-only mode or anything like that unless someone gives us a really really compelling reason to do that and has a last name that rhymes with Honnen. If you want to keep a historical mirror of the site, you can do that on your own.

Third, we're not going to point www.standu.ps or standu.ps at another project. Pretty sure the plan will be to have those names point to nothing or something like that. IT probably has a process for that.

[1]	Turns out when we did the rewrite in 2016, we didn't reimplement the GET API. Issue 478 covers creating a temporary new one. If you want it, please help them out.

Alternatives for people using it

If you're looking for alternatives or want to discuss alternatives with other people, check out issue 476.

But what if you want to save it!

Maybe you want to save it and you've been waiting all this time for just the right moment? If you want to save it, check out issue 477.

Many thanks!

Thank you to all the people who worked on Standups in the early days! I liked those days--they were fun.

Thank you to everyone who used Standups over the years. I hope it helped more than it hindered.

Update August 30, 2018: Added note about GET API not existing.

Thoughts on Guido retiring as BDFL of Python

Monday July 16, 2018 08:00, Will Kahn-Greene | Tweet this

I read the news of Guido van Rossum announcing his retirement as BDFL of Python and it made me a bit sad.

I've been programming in Python for almost 20 years on a myriad of open source projects, tools for personal use, and work. I helped out with several PyCon US conferences and attended several others. I met a lot of amazing people who have influenced me as a person and as a programmer.

I started PyVideo in March 2012. At a PyCon US after that (maybe 2015?), I found myself in an elevator with Guido and somehow we got to talking about PyVideo and he asked point-blank, "Why work on that?" I tried to explain what I was trying to do with it: create an index of conference videos across video sites, improve the meta-data, transcriptions, subtitles, feeds, etc. I remember he patiently listened to me and then said something along the lines of how it was a good thing to work on. I really appreciated that moment of validation. I think about it periodically. It was one of the reasons Sheila and I worked hard to transition PyVideo to a new group after we were burned out.

It wouldn't be an overstatement to say that through programming in Python, I've done some good things and become a better person.

Thank you, Guido, for everything!

Standup report: June 8th, 2018

Friday June 8, 2018 08:00, Will Kahn-Greene | Tweet this

What is Standup?

Project report

Over the last six months, we've done:

monthly library updates
revamped static assets management infrastructure
service maintenance
fixed the textarea to be resizeable (Thanks, Arai!)

The monthly library updates have helped with reducing technical debt. That takes a few hours each month to work through.

Paul redid how Standup does static assets. We no longer use django-pipeline, but instead use gulp. It works muuuuuuch better and makes it possible to upgrade to Djagno 2.0 soon. That was a ton of work over the course of a few days for both of us.

We've been keeping the Standup service running. That includes stage and production websites as well as stage and production IRC bots. That also includes helping users who are stuck--usually with accounts management. That's been a handful of hours.

Arai fixed the textareas so they're resizeable. That helps a ton! I'd love to get more help with UI/UX fixing.

Some GitHub stats:

GitHub
======

  mozilla/standup: 15 prs

    Committers:
             pyup-bot :     6  (  +588,   -541,   20 files)
               willkg :     5  (  +383,   -169,   27 files)
                 pmac :     2  ( +4179,   -223,   58 files)
               arai-a :     1  (    +2,     -1,    1 files)
                  g-k :     1  (    +3,     -3,    1 files)

                Total :        ( +5155,   -937,   89 files)

    Most changed files:
      requirements.txt (11)
      requirements-dev.txt (7)
      standup/settings.py (5)
      docker-compose.yml (4)
      standup/status/jinja2/base.html (3)
      standup/status/models.py (3)
      standup/status/tests/test_views.py (3)
      standup/status/urls.py (3)
      standup/status/views.py (3)
      standup/urls.py (3)

    Age stats:
          Youngest PR : 0.0d: 466: Add site-wide messaging
       Average PR age : 2.3d
        Median PR age : 0.0d
            Oldest PR : 10.0d: 459: Scheduled monthly dependency update for May


  All repositories:

    Total merged PRs: 15


Contributors
============

  arai-a
  g-k
  pmac
  pyup-bot
  willkg

That's it for the last six months!

Switching to swag-driven development

Do you use Standup?

Did you use Standup, but the glacial pace of fixing issues was too much so you switched to something else?

Do you want to use Standup?

We think there's still some value in having Standup around and there are still people using it. There's still some technical debt to fix that makes working on it harder than it should be. We've been working through that glacially.

As a project, we have the following problems:

The bulk of the work is being done by Paul and Will.
We don't have time to work on Standup.
There isn't anyone else contributing.

Why aren't users contributing? Probably a lot of reasons. Maybe everyone has their own reason! Have I spent a lot of time to look into this? No, because I don't have a lot of time to work on Standup.

Instead, we're just going to make some changes and see whether that helps. So we're doing the following:

Will promises to send out Standup project reports every 6 months before the All Hands and in doing this raise some awareness of what's going on and thank people who contributed.
We're fixing the Standup site to be clearer on who's doing work and how things get fixed so it's more likely your ideas come to fruition rather than get stale.
We're switching Standup to swag-driven development!

What's that you say? What's swag-driven development?

I mulled over the idea in my post on swag-driven development.

It's a couple of things, but mainly an explicit statement that people work on Standup in our spare time at the cost of not spending that time on other things. While we don't feel entitled to feeling appreciated, it would be nice to feel appreciated sometimes. Not feeling appreciated makes me wonder whether I should spend the time elsewhere. (And maybe that's the case--I have no idea.) Maybe other people would be more interested in spending their spare time on Standup if they knew there were swag incentives?

So what does this mean?

It means that we're encouraging swag donations!

If your team has stickers at the All Hands and you use Standup, find Paul and Will and other Standup contributors and give them one!
If there are features/bugs you want fixed and they've been sitting in the queue forever, maybe bribing is an option.

For the next quarter

Paul and I were going to try to get together at the All Hands and discuss what's next.

We don't really have an agenda. I know I look at the issue tracker and go, "ugh" and that's about where my energy level is these days.

Possible things to tackle in the next 6 months off the top of my head:

update to Django 2.0 (https://github.com/mozilla/standup/issues/464)
ability to retire projects (https://github.com/mozilla/standup/issues/451)
write tests for things--we have terrible test coverage at the moment

If you're interested in meeting up with us, toss me an email at willkg at mozilla dot com.

AWS Lambda dev with Python

Thursday April 12, 2018 12:00, Will Kahn-Greene | Tweet this

A story of a pigeon

I work on Socorro which is the crash ingestion pipeline for Mozilla's products.

The pipeline starts at the collector which handles incoming HTTP POST requests, pulls out the payload, futzes with it a little, and then saves it to AWS S3. Socorro then processes some of those crashes in the processor. The part that connects the two is called Pigeon. It was intended as a short-term solution to bridge the collector and the processor, but it's still around a year later and the green grass grows all around all around and the green grass grows all around.

Pigeon is an AWS Lambda function that triggers on S3 ObjectCreated:Put events, looks at the filename, and then adds things to the processing queue depending on the filename structure. We called it Pigeon for various hilarious reasons that are too mundane to go into in this blog post.

It's pretty basic. It doesn't do much. It was a short term solution we thought we'd throw away pretty quickly. I wrote some unit tests for the individual parts of it and a "client" that invoked the function in a faux AWS Lambda like way. That was good enough.

But then some problems

Pigeon was written with Python 2 because at the time AWS Lambda didn't have a Python 3 runtime. That changed--now there's one with Python 3.6.

In January, I decided to update Pigeon to work with Python 3.6. I tweaked the code, tweaked the unit tests, and voila--it was done! Then we deployed it to our -stage environment where it failed epically in technicolor glory (but no sound!) and we had to back it out and return to the Python 2 version.

What happened? I'll tell you what happened--we had a shit testing environment. Sure, we had tests, but they lacked several things:

At no point do we test against the build artifact for Pigeon. The build artifact for AWS Lambda jobs in Python is a .zip file that includes the code and all the libraries that it uses.
The tests "invoke" Pigeon with a "client", but it was pretty unlike the AWS Lambda Python 3.6 runtime.
Turns out I had completely misunderstood how I should be doing exception handling in AWS Lambda.

So our tests tested some things, but missed some important things and a big bug didn't get caught before going to -stage.

It sucked. I felt chagrinned. I like to think I have a tolerance for failure since I do it a lot, but this felt particularly faily and some basic safeguards would have prevented it from happening.

Fleshing out AWS Lambda in Python project

We were thinking of converting another part of the Socorro pipeline to AWS Lambda, but I put that on hold until I had wrapped my head around how to build a development environment that included scaffolding for testing AWS Lambda functions in a real runtime.

Miles or Brian mentioned aws-sam-local. I looked into that. It's written in Go, they suggest installing it with npm, it does a bunch of things, and it has some event generation code. But for the things I needed, it seemed like it would just be a convenience cli for docker-lambda.

I had been aware of docker-lambda for a while, but hadn't looked at the project recently. They added support for passing events via stdin. Their docs have examples of invoking Lambda functions. That seemed like what I needed.

I took that and built the developer environment scaffolding that we've got in Pigeon now. Further, I decided to use this same model for future AWS Lambda function development.

How does it work?

Pigeon is a Python project, so it uses Python libraries. I maintain those requirements in a requirements.txt file.

I install the requirements into a ./build directory:

$ pip install --ignore-installed --no-cache-dir -r requirements.txt -t build/

I copy the Pigeon source into that directory, too:

$ cp pigeon.py build/

That's all I need for the runtime to use.

The tests are in the tests/ directory. I'm using pytest and in the conftest.py file have this at the top:

import os
import sys

# Insert build/ directory in sys.path so we can import pigeon
sys.path.insert(
    0,
    os.path.join(
        os.path.dirname(os.path.dirname(__file__)),
        'build'
    )
)

I'm using Docker and docker-compose to aid development. I use a test container which is a python:3.6 image with the test requirements installed in it.

In this way, tests run against the ./build directory.

Now I want to be able to invoke Pigeon in an AWS Lambda runtime so I can debug issues and also write an integration test.

I set up a lambda-run container that uses the lambci/lambda:python3.6 image. I mount ./build as /var/task since that's where the AWS Lambda runtime expects things to be.

I created a shell script for invoking Pigeon:

#!/bin/bash

docker-compose run \
    --rm \
    -v "$PWD/build":/var/task \
    --service-ports \
    -e DOCKER_LAMBDA_USE_STDIN=1 \
    lambda-run pigeon.handler $@

That's based on the docker-lambda invoke examples.

Let's walk through that:

It runs the lambda-run container with the services it depends on as defined in my docker-compose.yml file.
It mounts the ./build directory as /var/task because that's where the runtime expectes the code it's running to be.
The DOCKER_LAMBDA_USE_STDIN=1 environment variable causes it to look at stdin for the event. That's pretty convenient.
It runs invokes pigeon.handler which is the handler function in the pigeon Python module.

I have another script that generates fake AWS S3 ObjectCreated:Put events. I cat the result of that into the invoke shell script. That runs everything nicely:

$ ./bin/generate_event.py --key v2/raw_crash/000/20180313/00007bd0-2d1c-4865-af09-80bc00180313 > event.json
$ cat event.json | ./bin/run_invoke.sh
Starting socorropigeon_rabbitmq_1 ... done
START RequestId: 921b4ecf-6e3f-4bc1-adf6-7d58e4d41f47 Version: $LATEST
{"Timestamp": 1523588759480920064, "Type": "pigeon", "Logger": "antenna", "Hostname": "300fca32d996", "EnvVersion": "2.0", "Severity": 4, "Pid": 1, "Fields": {"msg": "Please set PIGEON_AWS_REGION. Returning original unencrypted data."}}
{"Timestamp": 1523588759481024512, "Type": "pigeon", "Logger": "antenna", "Hostname": "300fca32d996", "EnvVersion": "2.0", "Severity": 4, "Pid": 1, "Fields": {"msg": "Please set PIGEON_AWS_REGION. Returning original unencrypted data."}}
{"Timestamp": 1523588759481599232, "Type": "pigeon", "Logger": "antenna", "Hostname": "300fca32d996", "EnvVersion": "2.0", "Severity": 6, "Pid": 1, "Fields": {"msg": "number of records: 1"}}
{"Timestamp": 1523588759481796864, "Type": "pigeon", "Logger": "antenna", "Hostname": "300fca32d996", "EnvVersion": "2.0", "Severity": 6, "Pid": 1, "Fields": {"msg": "looking at key: v2/raw_crash/000/20180313/00007bd0-2d1c-4865-af09-80bc00180313"}}
{"Timestamp": 1523588759481933056, "Type": "pigeon", "Logger": "antenna", "Hostname": "300fca32d996", "EnvVersion": "2.0", "Severity": 6, "Pid": 1, "Fields": {"msg": "crash id: 00007bd0-2d1c-4865-af09-80bc00180313 in dev_bucket"}}
MONITORING|1523588759|1|count|socorro.pigeon.accept|#env:test
{"Timestamp": 1523588759497482240, "Type": "pigeon", "Logger": "antenna", "Hostname": "300fca32d996", "EnvVersion": "2.0", "Severity": 6, "Pid": 1, "Fields": {"msg": "00007bd0-2d1c-4865-af09-80bc00180313: publishing to socorrodev.normal"}}
END RequestId: 921b4ecf-6e3f-4bc1-adf6-7d58e4d41f47
REPORT RequestId: 921b4ecf-6e3f-4bc1-adf6-7d58e4d41f47 Duration: 101 ms Billed Duration: 200 ms Memory Size: 1536 MB Max Memory Used: 28 MB

null

Then I wrote an integration test that cleared RabbitMQ queue, ran the invoke script with a bunch of different keys, and then checked what was in the processor queue.

Now I've got:

tests that test the individual bits of Pigeon
a way to run Pigeon in the same environment as -stage and -prod
an integration test that runs the whole setup

A thing I hadn't mentioned was that Pigeon's documentation is entirely in the README. The docs cover setup and development well enough that I can hand this off to normal people and future me. I like simple docs. Building scaffolding such that docs are simple makes me happy.

Summary

You can see the project at https://github.com/mozilla-services/socorro-pigeon.

Socorro Smooth Mega-Migration 2018

Wednesday April 4, 2018 12:00, Will Kahn-Greene | Tweet this

Summary

Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad crash reporter asks the user if the user would like to send a crash report. If the user answers "yes!", then the Breakpad crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro. Socorro collects and saves the crash report, processes it, and provides an interface for aggregating, searching, and looking at crash reports.

Over the last year and a half, we've been working on a new infrastructure for Socorro and migrating the project to it. It was a massive undertaking and involved changing a lot of code and some architecture and then redoing all the infrastructure scripts and deploy pipelines.

On Thursday, March 28th, we pushed the button and switched to the new infrastructure. The transition was super smooth. Now we're on new infra!

This blog post talks a little about the old and new infrastructures and the work we did to migrate.