Data Org Working Groups: retrospective (2020)

Project

time

1 month

impact

established cross organization groups as a tool for grouping people

Problem statement

Data Org architects, builds, and maintains a data ingestion system and the ecosystem of pieces around it. It covers a swath of engineering and data science disciplines and problem domains. Many of us are generalists and have expertise and interests in multiple areas. Many projects cut across disciplines, problem domains, and organizational structures. Some projects, disciplines, and problem domains benefit from participation of other stakeholders who aren't in Data Org.

In order to succeed in tackling the projects of tomorrow, we need to formalize creating, maintaining, and disbanding groups composed of interested stakeholders focusing on specific missions. Further, we need a set of best practices to help make these groups successful.

Read more…

Markus v3.0.0 released! Better metrics API for Python projects.

What is it?

Markus is a Python library for generating metrics.

Markus makes it easier to generate metrics in your program by:

  • providing multiple backends (Datadog statsd, statsd, logging, logging roll-up, and so on) for sending metrics data to different places

  • sending metrics to multiple backends at the same time

  • providing testing helpers for easy verification of metrics generation

  • providing a decoupled architecture making it easier to write code to generate metrics without having to worry about making sure creating and configuring a metrics client has been done--similar to the Python logging module in this way

We use it at Mozilla on many projects.

v3.0.0 released!

I released v3.0.0 just now. Changes:

Features

  • Added support for Python 3.9 (#79). Thank you, Brady!

  • Changed assert_* helper methods on markus.testing.MetricsMock to print the records to stdout if the assertion fails. This can save some time debugging failing tests. (#74)

Backwards incompatible changes

  • Dropped support for Python 3.5 (#78). Thank you, Brady!

  • markus.testing.MetricsMock.get_records and markus.testing.MetricsMock.filter_records return markus.main.MetricsRecord instances now.

    This might require you to rewrite/update tests that use the MetricsMock.

Where to go for more

Changes for this release: https://markus.readthedocs.io/en/latest/history.html#february-5th-2021

Documentation and quickstart here: https://markus.readthedocs.io/en/latest/index.html

Source code and issue tracker here: https://github.com/willkg/markus/

Let me know how this helps you!

Socorro: This Period in Crash Stats: Volume 2021.1

New features and changes in Crash Stats

Crash Stats crash report view pages show Breadcrumbs information

In 2020q3, Roger and I worked out a new Breadcrumbs crash annotation for crash reports generated by the android-components crash reporter. It's a JSON-encoded field with a structure just like the one that the sentry-sdk sends. Incoming crash reports that have this annotation will show the data on the crash report view page to people who have protected data access.

/images/tpics_2021_1_breadcrumbs.thumbnail.png

Figure 1: Screenshot of Breadcrumbs data in Details tab of crash report view on Crash Stats.

I implemented it based on what we get in the structure and what's in the Sentry interface.

Breadcrumbs information is not searchable with Supersearch. Currently, it's only shown in the crash report view in the Details tab.

Does this help your work? Are there ways to improve this? If so, let me know!

This work was done in [bug 1671276].

Crash Stats crash report view pages show Java exceptions

For the longest of long times, crash reports from a Java process included a JavaStackTrace annotation which was a big unstructured string of problematic parseability and I couldn't do much with it.

In 2020q4, Roger and I worked out a new JavaException crash annotation which was a JSON-encoded structured string containing the exception information. Not only does it have the final exception, but it also has cascading exceptions if there are any! With a structured form of the exception, we can suddenly do a lot of interesting things.

As a first step, I added display of the Java exception information to the crash report view page in the Display tab. It's in the same place that you would see the crashing thread stack if this were a C++/Rust crash.

Just like JavaStackTrace, the JavaException annotation has some data in it that can have PII in it. Because of that, the Socorro processor generates two versions of the data: one that's sanitized (no java exception message values) and one that's raw. If you have protected data access, you can see the raw one.

The interface is pretty wide and exceeds the screenshot. Sorry about that.

/images/tpics_2021_1_exception.thumbnail.png

Figure 2: Screenshot of Java exception data in Details tab of crash report view in Crash Stats.

My next step is to use the structured exception information to improve Java crash report signatures. I'm doing that work in [bug 1541120] and hoping to land that in 2021q1. More on that later.

Does this help your work? Are there ways to improve this? If so, let me know!

This work was done in [bug 1675560].

Changes to crash report view

One of the things I don't like about the crash report view is that it's impossible to intuit where the data you're looking at is from. Further, some of the tabs were unclear about what bits of data were protected data and what bits weren't. I've been improving that over time.

The most recent step involved the following changes:

  1. The "Metadata" tab was renamed to "Crash Annotations". This tab holds the crash annotation data from the raw crash before processing as well as a few fields that the collector adds when accepting a crash report from the client. Most of the fields are defined in the CrashAnnotations.yaml file in mozilla-central. The ones that aren't, yet, should get added. I have that on my list of things to get to.

  2. The "Crash Annotations" tab is now split into public and protected data sections. I hope this makes it a little clearer which is which.

  3. I removed some unneeded fields that the collector adds at ingestion.

Does this help your work? Are there ways to improve this? If so, let me know!

What's in the queue

In addition to the day-to-day stuff, I'm working on the following big projects in 2021q1.

Remove the Email field

Last year, I looked into who's using the Email field and for what, whether the data was any good, and in what circumstances do we even get Email data. That work was done in [bug 1626277].

The consensus is that since not all of the crash reporter clients let a user enter in an email address, it doesn't seem like we use the data, and it's pretty toxic data to have, we should remove it.

The first step of that is to delete the field from the crash report at ingestion. I'll be doing that work in [bug 1688905].

The second step is to remove it from the webapp. I'll be doing that work in [bug 1688907].

Once that's done, I'll write up some bugs to remove it from the crash reporter clients and wherever else it is in products.

Does this affect you? If so, let me know!

Redo signature generation for Java crashes

Currently, signature generation for Java crashes is pretty basic and it's not flexible in the ways we need it. Now we can fix that.

I need some Java crash expertise to bounce ideas off of and to help me verify "goodness" of signatures. If you're interested in helping in any capacity or if you have opinions on how it should work or what you need out of it, please let me know.

I'm hoping to do this work in 2021q1.

The tracker bug is [bug 1541120].

Closing

Thank you to Roger Yang who implemented Breadcrumbs and JavaException reporting and Gabriele Svelto who advised on new annotations and how things should work! Thank you to everyone who submits signature generation changes--I really appreciate your efforts!

Socorro Engineering: Half in Review 2020 h2 and 2020 retrospective

Summary

2020h1 was rough. 2020h2 was also rough: more layoffs, 2 re-orgs, Covid-19.

I (and Socorro and Tecken) got re-orged into the Data Org. Data Org manages the Telemetry ingestion pipeline as well as all the things related to it. There's a lot of overlap between Socorro and Telemetry and being in the Data Org might help reduce that overlap and ease maintenance.

But this post isn't about the future--it's about the past! Let's talk about what happened in 2020h2 and then a brief retrospective of 2020.

Prepare to dive in!

Read more…

Everett v1.0.3 released!

What is it?

Everett is a configuration library for Python apps.

Goals of Everett:

  1. flexible configuration from multiple configured environments

  2. easy testing with configuration

  3. easy documentation of configuration for users

From that, Everett has the following features:

  • is composeable and flexible

  • makes it easier to provide helpful error messages for users trying to configure your software

  • supports auto-documentation of configuration with a Sphinx autocomponent directive

  • has an API for testing configuration variations in your tests

  • can pull configuration from a variety of specified sources (environment, INI files, YAML files, dict, write-your-own)

  • supports parsing values (bool, int, lists of things, classes, write-your-own)

  • supports key namespaces

  • supports component architectures

  • works with whatever you're writing--command line tools, web sites, system daemons, etc

v1.0.3 released!

This is a minor maintenance update that fixes a couple of minor bugs, addresses a Sphinx deprecation issue, drops support for Python 3.4 and 3.5, and adds support for Python 3.8 and 3.9 (largely adding those environments to the test suite).

Why you should take a look at Everett

At Mozilla, I'm using Everett for a variety of projects: Mozilla symbols server, Mozilla crash ingestion pipeline, and some other tooling. We use it in a bunch of other places at Mozilla, too.

Everett makes it easy to:

  1. deal with different configurations between local development and server environments

  2. test different configuration values

  3. document configuration options

First-class docs. First-class configuration error help. First-class testing. This is why I created Everett.

If this sounds useful to you, take it for a spin. It's a drop-in replacement for python-decouple and os.environ.get('CONFIGVAR', 'default_value') style of configuration so it's easy to test out.

Enjoy!

Where to go for more

For more specifics on this release, see here: https://everett.readthedocs.io/en/latest/history.html#october-28th-2020

Documentation and quickstart here: https://everett.readthedocs.io/

Source code and issue tracker here: https://github.com/willkg/everett

Socorro Engineering: Half in Review 2020 h1

Summary

2020h1 was rough. Layoffs, re-org, Berlin All Hands, Covid-19, focused on MLS for a while, then I switched back to Socorro/Tecken full time, then virtual All Hands.

It's September now and 2020h1 ended a long time ago, but I'm only just getting a chance to catch up and some things happened in 2020h1 that are important to divulge and we don't tell anyone about Socorro events via any other medium.

Prepare to dive in!

Read more…

RustConf 2020 thoughts

Last year, I went to RustConf 2019 in Portland. It was a lovely conference. Everyone I saw was so exuberantly happy to be there--it was just remarkable. It was my first RustConf. Plus while I've been sort-of learning Rust for a while and cursorily related to Rust things (I work on crash ingestion and debug symbols things), I haven't really done any Rust work. Still, it was a remarkable and very exciting conference.

RustConf 2020 was entirely online. I'm in UTC-4, so it occurred during my afternoon and evening. I spent the entire time watching the RustConf 2020 stream and skimming the channels on Discord. Everyone I saw on the channels were so exuberantly happy to be there and supportive of one another--it was just remarkable. Again! Even virtually!

I missed the in-person aspect of a conference a bit. I've still got this thing about conferences that I'm getting over, so I liked that it was virtual because of that and also it meant I didn't have to travel to go.

I enjoyed all of the sessions--they're all top-notch! They were all pretty different in their topics and difficulty level. The organizers should get gold stars for the children's programming between sessions. I really enjoyed the "CAT!" sightings in the channels--that was worth the entrance fee.

This is a summary of the talks I wrote notes for.

Read more…

Experimenting with Symbolic

One of the things I work on is Tecken which runs Mozilla Symbols Server. It's a server that handles Breakpad symbols files upload, download, and stack symbolication.

Bug #1614928 covers adding line numbers to the symbolicated stack results for the symbolication API. The current code doesn't parse line records in Breakpad symbols files, so it doesn't know anything about line numbers. I spent some time looking at how much effort it'd take to improve the hand-written Breakpad symbol file parsing code to parse line records which requires us to carry those changes through to the caching layer and some related parts--it seemed really tricky.

That's the point where I decided to go look at Symbolic which I had been meaning to look at since Jan wrote the Native Crash Reporting: Symbol Servers, PDBs, and SDK for C and c++ blog post a year ago.

What is symbolication?

There are lots of places where stacks are interesting. For example:

  1. the stack of the crashing thread in a crash report

  2. the stacks of the parent and child processes in a hung IPC channel

  3. the stack of a thread being profiled

  4. the stack of a thread at a given point in time for debugging

"The stack" is an array of addresses in memory corresponding to the value of the instruction pointer for each of those stack frames. You can use the module information to convert that array of memory offsets to an array of [module, module_offset] pairs. Something like this:

[ 3, 6516407 ],
[ 3, 12856365 ],
[ 3, 12899916 ],
[ 3, 13034426 ],
[ 3, 13581214 ],
[ 3, 13646510 ],
...

with modules:

[ "firefox.pdb", "5F84ACF1D63667F44C4C44205044422E1" ],
[ "mozavcodec.pdb", "9A8AF7836EE6141F4C4C44205044422E1" ],
[ "Windows.Media.pdb", "01B7C51B62E95FD9C8CD73A45B4446C71" ],
[ "xul.pdb", "09F9D7ECF31F60E34C4C44205044422E1" ],
...

That's neat, but hard to work with.

What you really want is a human-readable stack of function names and files and line numbers. Then you can go look at the code in question and start your debugging adventure.

When the program is compiled, the act of compiling produces a bunch of compiler debugging information. We use dump_syms to extract the symbol information and put it into the Breakpad symbols file format. Those files get uploaded to Mozilla Symbols Server where they join all the symbols files for all the builds for the last 2 years.

Symbolication takes the array of [module, module_offset] pairs, the list of modules in memory, and the Breakpad symbols files for those modules and looks up the symbols for the [module, module_offset] pairs producing symbolicated frames.

Then you get something nicer like this:

0  xul.pdb  mozilla::ConsoleReportCollector::FlushReportsToConsole(unsigned long long, nsIConsoleReportCollector::ReportAction)
1  xul.pdb  mozilla::net::HttpBaseChannel::MaybeFlushConsoleReports()",
2  xul.pdb  mozilla::net::HttpChannelChild::OnStopRequest(nsresult const&, mozilla::net::ResourceTimingStructArgs const&, mozilla::net::nsHttpHeaderArray const&, nsTArray<mozilla::net::ConsoleReportCollected> const&)
3  xul.pdb  std::_Func_impl_no_alloc<`lambda at /builds/worker/checkouts/gecko/netwerk/protocol/http/HttpChannelChild.cpp:1001:11',void>::_Do_call()
...

Yay! Much rejoicing! Something we can do something with!

I wrote about this a bit in Crash pings and crash reports.

Tecken has a symbolication API, so you can send in a well-crafted HTTP POST and it'll symbolicate the stack for you and return it.

Quickstart with Symbolic in Python

Symbolic is a Rust crate with a Python library wrapper. The Sentry folks do a great job of generating wheels and uploading those to PyPI, so installing Symbolic is as easy as:

pip install symbolic

The Symbolic docs are terse. I found the following documentation:

That helped, but I had questions those didn't answer. I have an intrepid freshman understanding of Rust, so I ended up reading the code, tests, and examples.

The one big thing that tripped me up was that Symbolic can't parse Breakpad symbols files from a byte stream--they need to be files on disk. Tecken doesn't store Breakpad symbols files on disk--they're in AWS S3 buckets. So it downloads them and parses the byte stream. In order to use Symbolic, we'll have to adjust that to save the file to disk, then parse it, then delete the file afterwards. 1

1

If that's not true, please let me know.

Anyhow, here's some sample annotated code using Symbolic to do symbol lookups:

import symbolic

# This is a Breakpad symbols file I have on disk.
archive = symbolic.Archive.open("XUL/75A79CFA0E783A35810F8ADF2931659A0/XUL.sym")

# We do debug ids as all-uppercase with no hyphens. However, symbolic
# requires that get normalized into the form it likes.
debug_id = symbolic.normalize_debug_id("75A79CFA0E783A35810F8ADF2931659A0")

# This parses the Breakpad symbols file and returns a symcache that we can
# look up addresses in.
obj = archive.get_object(debug_id=ndebug_id)
symcache = obj.make_symcache()

# Symbol lookup returns a list of LineInfo objects.
lineinfos = symcache.lookup(0xf5aa0)

print("line: %s symbol: %s" % (lineinfos[0].line, lineinfos[0].symbol))

Cool!

Symbolic parses Breakpad symbols files. It uses a cache format for fast symbol lookups. Loading the cache file is very fast.

Further, Symbolic parses files of a variety of other debug binary formats. This could be handy for skipping the intermediary Breakpad symbol file and using the debug binaries directly. More on that idea later.

Tecken is maintained by a team of two and we have other projects, so it spends a lot of time sitting in the corner feeling sad. Meanwhile, Symbolic is actively worked on by Sentry and a cadre of other contributors including Mozilla engineers because it's one of the cornerstone crates for the great Rust rewrite of Breakpad things. That's a big win for me.

So then I built a prototype

Today, I threw together a web app that does symbolication using Symbolic and called it Sherwin Syms.

https://github.com/willkg/sherwin-syms/

Building a separate prototype gives me something to tinker with that's not in production. I was able to add line number information pretty quickly. I can experiment with caching on disk. I can compare the symbolication API output for stacks between the prototype and what the Mozilla Symbols Server produces.

There's a lot of scaffolding in there. The Symbolic-using bits are in this file:

https://github.com/willkg/sherwin-syms/blob/master/src/sherwin_syms/symbols.py

Next steps

I need to integrate this into Tecken. I think that means writing a new v6 API view because the v4 and v5 code is tangled up with downloading and caching.

Markus and Gabriele suggested Tecken skip the Breakpad symbols files and use the debug binaries directly--Symbolic can handle those, too. The compelling reason for this is that Breakpad symbols files lose all the information for symbolicating inline functions correctly. I hope to look into that soon.

Summary

That summarizes the week I spent with Symbolic.

Switching from pyup to dependabot

Switching from pyup to dependabot

I maintain a bunch of Python-based projects including some major projects like Crash Stats, Mozilla Symbols Server, and Mozilla Location Services. In order to keep up with dependency updates, we used pyup to monitor dependencies in those projects and create GitHub pull requests for updates.

pyup was pretty nice. It would create a single pull request with many dependency updates in it. I could then review the details, wait for CI to test everything, make adjustments as necessary, and then land the pull request and go do other things.

Starting in October of 2019, pyup stopped doing monthly updates. A co-worker of mine tried to contact them to no avail. I don't know what happened. I got tired of waiting for it to start working again.

Since my projects are all on GitHub, we had already switched to GitHub security alerts. Given that, I decided it was time to switch from pyup to dependabot (also owned by GitHub).

Switching from pyup to dependabot

I had to do a bunch of projects, so I ended up with a process along these lines:

  1. Remove projects from pyup.

    All my projects are either in mozilla or mozilla-services organizations on GitHub.

    We had a separate service account configure pyup, so I'm not able to make changes to pyup myself.

    I had to ask Greg to remove my projects from pyup.

    I wouldn't suggest proceeding until your project has been removed from pyup. Otherwise, it's possible you'll get PRs from pyup and dependabot for the same updates.

  2. Add dependabot configuration to repo.

    Then I added the required dependabot configuration to my repository and removed the pyup configuration.

    I used these resources:

    I created a pull request with these changes, reviewed it, and landed it.

  3. Enable dependabot.

    For some reason, I couldn't enable dependabot for my projects. I had to ask Greg who I think asked Hal to enable dependabot for my projects.

    Once this was done, then dependabot created a plethora of pull requests.

While there are Mozilla-specific bits in here, it's probably generally helpful.

Dealing with incoming pull requests

dependabot isn't as nice as pyup was. It can only update one dependency per PR. That stinks for a bunch of reasons:

  1. working through 30 PRs is extremely time consuming

  2. every time you finish up work on one PR, it triggers dependabot to update the others and that triggers email notifications, CI builds, and a bunch of spam and resource usage

  3. dependencies often depend on each other and need to get updated as a group

Since we hadn't been keeping up with Python dependencies, we ended up with between 20 and 60 pull requests to deal with per repository.

For Antenna, I rebased each PR, reviewed it, and merged it by hand. That took a day to do. It sucked. I can't imagine doing this four times every month.

While working on PRs for Socorro, I hit a case where I needed to update multiple dependencies at the same time. I decided to write a tool that combined pull requests.

Thus was born paul-mclendahand. Using this tool, I can combine pull requests. Using paul-mclendahand, I worked through 20 pull requests for Tecken in about an hour. This saves me tons of time!

My process goes like this:

  1. create a new branch on my laptop based off of the main branch

  2. list all open pull requests by running pmac listprs

  3. make a list of pull requests to combine into it

  4. for each pull request, I:

    1. run pmac add PR

    2. resolve any cherry-pick conflicts

    3. (optional) rebuild my project and run tests

  5. push the new branch to GitHub

  6. create a pull request

  7. run pmac prmsg and copy-and-paste the output as the pull request description

I can then review the pull request. It has links to the other pull requests and the data that dependabot puts together for each update. I can rebase, add additional commits, etc.

When I'm done, I merge it and that's it!

paul-mclendahand v1.0.0

I released paul-mclendahand 1.0.0!

Install it with pipx:

pipx install paul-mclendahand

Install it with pip:

pip install paul-mclendahand

It doesn't just combine pull requests from dependabot--it's general and can work on any pull requests.

If you find any issues, please report them in the issue tracker.

I hope this helps you!

How to pick up a project with an audit

Over the last year, I was handed a bunch of projects in various states. One of the first things I do when getting a new project that I'm suddenly responsible for is to audit the project. That helps me figure out what I'm looking at and what I need to do with it next.

This blog post covers my process for auditing projects I'm suddenly the proud owner of.

Read more…