Experimenting with Symbolic ๐Ÿ”—

One of the things I work on is Tecken which runs Mozilla Symbols Server. It's a server that handles Breakpad symbols files upload, download, and stack symbolication.

Bug #1614928 covers adding line numbers to the symbolicated stack results for the symbolication API. The current code doesn't parse line records in Breakpad symbols files, so it doesn't know anything about line numbers. I spent some time looking at how much effort it'd take to improve the hand-written Breakpad symbol file parsing code to parse line records which requires us to carry those changes through to the caching layer and some related parts--it seemed really tricky.

That's the point where I decided to go look at Symbolic which I had been meaning to look at since Jan wrote the Native Crash Reporting: Symbol Servers, PDBs, and SDK for C and c++ blog post a year ago.

What is symbolication?

There are lots of places where stacks are interesting. For example:

  1. the stack of the crashing thread in a crash report

  2. the stacks of the parent and child processes in a hung IPC channel

  3. the stack of a thread being profiled

  4. the stack of a thread at a given point in time for debugging

"The stack" is an array of addresses in memory corresponding to the value of the instruction pointer for each of those stack frames. You can use the module information to convert that array of memory offsets to an array of [module, module_offset] pairs. Something like this:

[ 3, 6516407 ],
[ 3, 12856365 ],
[ 3, 12899916 ],
[ 3, 13034426 ],
[ 3, 13581214 ],
[ 3, 13646510 ],
...

with modules:

[ "firefox.pdb", "5F84ACF1D63667F44C4C44205044422E1" ],
[ "mozavcodec.pdb", "9A8AF7836EE6141F4C4C44205044422E1" ],
[ "Windows.Media.pdb", "01B7C51B62E95FD9C8CD73A45B4446C71" ],
[ "xul.pdb", "09F9D7ECF31F60E34C4C44205044422E1" ],
...

That's neat, but hard to work with.

What you really want is a human-readable stack of function names and files and line numbers. Then you can go look at the code in question and start your debugging adventure.

When the program is compiled, the act of compiling produces a bunch of compiler debugging information. We use dump_syms to extract the symbol information and put it into the Breakpad symbols file format. Those files get uploaded to Mozilla Symbols Server where they join all the symbols files for all the builds for the last 2 years.

Symbolication takes the array of [module, module_offset] pairs, the list of modules in memory, and the Breakpad symbols files for those modules and looks up the symbols for the [module, module_offset] pairs producing symbolicated frames.

Then you get something nicer like this:

0  xul.pdb  mozilla::ConsoleReportCollector::FlushReportsToConsole(unsigned long long, nsIConsoleReportCollector::ReportAction)
1  xul.pdb  mozilla::net::HttpBaseChannel::MaybeFlushConsoleReports()",
2  xul.pdb  mozilla::net::HttpChannelChild::OnStopRequest(nsresult const&, mozilla::net::ResourceTimingStructArgs const&, mozilla::net::nsHttpHeaderArray const&, nsTArray<mozilla::net::ConsoleReportCollected> const&)
3  xul.pdb  std::_Func_impl_no_alloc<`lambda at /builds/worker/checkouts/gecko/netwerk/protocol/http/HttpChannelChild.cpp:1001:11',void>::_Do_call()
...

Yay! Much rejoicing! Something we can do something with!

I wrote about this a bit in Crash pings and crash reports.

Tecken has a symbolication API, so you can send in a well-crafted HTTP POST and it'll symbolicate the stack for you and return it.

Quickstart with Symbolic in Python

Symbolic is a Rust crate with a Python library wrapper. The Sentry folks do a great job of generating wheels and uploading those to PyPI, so installing Symbolic is as easy as:

pip install symbolic

The Symbolic docs are terse. I found the following documentation:

That helped, but I had questions those didn't answer. I have an intrepid freshman understanding of Rust, so I ended up reading the code, tests, and examples.

The one big thing that tripped me up was that Symbolic can't parse Breakpad symbols files from a byte stream--they need to be files on disk. Tecken doesn't store Breakpad symbols files on disk--they're in AWS S3 buckets. So it downloads them and parses the byte stream. In order to use Symbolic, we'll have to adjust that to save the file to disk, then parse it, then delete the file afterwards. 1

1

If that's not true, please let me know.

Anyhow, here's some sample annotated code using Symbolic to do symbol lookups:

import symbolic

# This is a Breakpad symbols file I have on disk.
archive = symbolic.Archive.open("XUL/75A79CFA0E783A35810F8ADF2931659A0/XUL.sym")

# We do debug ids as all-uppercase with no hyphens. However, symbolic
# requires that get normalized into the form it likes.
debug_id = symbolic.normalize_debug_id("75A79CFA0E783A35810F8ADF2931659A0")

# This parses the Breakpad symbols file and returns a symcache that we can
# look up addresses in.
obj = archive.get_object(debug_id=ndebug_id)
symcache = obj.make_symcache()

# Symbol lookup returns a list of LineInfo objects.
lineinfos = symcache.lookup(0xf5aa0)

print("line: %s symbol: %s" % (lineinfos[0].line, lineinfos[0].symbol))

Cool!

Symbolic parses Breakpad symbols files. It uses a cache format for fast symbol lookups. Loading the cache file is very fast.

Further, Symbolic parses files of a variety of other debug binary formats. This could be handy for skipping the intermediary Breakpad symbol file and using the debug binaries directly. More on that idea later.

Tecken is maintained by a team of two and we have other projects, so it spends a lot of time sitting in the corner feeling sad. Meanwhile, Symbolic is actively worked on by Sentry and a cadre of other contributors including Mozilla engineers because it's one of the cornerstone crates for the great Rust rewrite of Breakpad things. That's a big win for me.

So then I built a prototype

Today, I threw together a web app that does symbolication using Symbolic and called it Sherwin Syms.

https://github.com/willkg/sherwin-syms/

Building a separate prototype gives me something to tinker with that's not in production. I was able to add line number information pretty quickly. I can experiment with caching on disk. I can compare the symbolication API output for stacks between the prototype and what the Mozilla Symbols Server produces.

There's a lot of scaffolding in there. The Symbolic-using bits are in this file:

https://github.com/willkg/sherwin-syms/blob/master/src/sherwin_syms/symbols.py

Next steps

I need to integrate this into Tecken. I think that means writing a new v6 API view because the v4 and v5 code is tangled up with downloading and caching.

Markus and Gabriele suggested Tecken skip the Breakpad symbols files and use the debug binaries directly--Symbolic can handle those, too. The compelling reason for this is that Breakpad symbols files lose all the information for symbolicating inline functions correctly. I hope to look into that soon.

Summary

That summarizes the week I spent with Symbolic.

Switching from pyup to dependabot ๐Ÿ”—

Switching from pyup to dependabot

I maintain a bunch of Python-based projects including some major projects like Crash Stats, Mozilla Symbols Server, and Mozilla Location Services. In order to keep up with dependency updates, we used pyup to monitor dependencies in those projects and create GitHub pull requests for updates.

pyup was pretty nice. It would create a single pull request with many dependency updates in it. I could then review the details, wait for CI to test everything, make adjustments as necessary, and then land the pull request and go do other things.

Starting in October of 2019, pyup stopped doing monthly updates. A co-worker of mine tried to contact them to no avail. I don't know what happened. I got tired of waiting for it to start working again.

Since my projects are all on GitHub, we had already switched to GitHub security alerts. Given that, I decided it was time to switch from pyup to dependabot (also owned by GitHub).

Switching from pyup to dependabot

I had to do a bunch of projects, so I ended up with a process along these lines:

  1. Remove projects from pyup.

    All my projects are either in mozilla or mozilla-services organizations on GitHub.

    We had a separate service account configure pyup, so I'm not able to make changes to pyup myself.

    I had to ask Greg to remove my projects from pyup.

    I wouldn't suggest proceeding until your project has been removed from pyup. Otherwise, it's possible you'll get PRs from pyup and dependabot for the same updates.

  2. Add dependabot configuration to repo.

    Then I added the required dependabot configuration to my repository and removed the pyup configuration.

    I used these resources:

    I created a pull request with these changes, reviewed it, and landed it.

  3. Enable dependabot.

    For some reason, I couldn't enable dependabot for my projects. I had to ask Greg who I think asked Hal to enable dependabot for my projects.

    Once this was done, then dependabot created a plethora of pull requests.

While there are Mozilla-specific bits in here, it's probably generally helpful.

Dealing with incoming pull requests

dependabot isn't as nice as pyup was. It can only update one dependency per PR. That stinks for a bunch of reasons:

  1. working through 30 PRs is extremely time consuming

  2. every time you finish up work on one PR, it triggers dependabot to update the others and that triggers email notifications, CI builds, and a bunch of spam and resource usage

  3. dependencies often depend on each other and need to get updated as a group

Since we hadn't been keeping up with Python dependencies, we ended up with between 20 and 60 pull requests to deal with per repository.

For Antenna, I rebased each PR, reviewed it, and merged it by hand. That took a day to do. It sucked. I can't imagine doing this four times every month.

While working on PRs for Socorro, I hit a case where I needed to update multiple dependencies at the same time. I decided to write a tool that combined pull requests.

Thus was born paul-mclendahand. Using this tool, I can combine pull requests. Using paul-mclendahand, I worked through 20 pull requests for Tecken in about an hour. This saves me tons of time!

My process goes like this:

  1. create a new branch on my laptop based off of the main branch

  2. list all open pull requests by running pmac listprs

  3. make a list of pull requests to combine into it

  4. for each pull request, I:

    1. run pmac add PR

    2. resolve any cherry-pick conflicts

    3. (optional) rebuild my project and run tests

  5. push the new branch to GitHub

  6. create a pull request

  7. run pmac prmsg and copy-and-paste the output as the pull request description

I can then review the pull request. It has links to the other pull requests and the data that dependabot puts together for each update. I can rebase, add additional commits, etc.

When I'm done, I merge it and that's it!

paul-mclendahand v1.0.0

I released paul-mclendahand 1.0.0!

Install it with pipx:

pipx install paul-mclendahand

Install it with pip:

pip install paul-mclendahand

It doesn't just combine pull requests from dependabot--it's general and can work on any pull requests.

If you find any issues, please report them in the issue tracker.

I hope this helps you!

How to pick up a project with an audit ๐Ÿ”—

Over the last year, I was handed a bunch of projects in various states. One of the first things I do when getting a new project that I'm suddenly responsible for is to audit the project. That helps me figure out what I'm looking at and what I need to do with it next.

This blog post covers my process for auditing projects I'm suddenly the proud owner of.

Read moreโ€ฆ

Socorro Engineering: Year in Review 2019 ๐Ÿ”—

Summary

Last year at about this time, I wrote a year in review blog post. Since I only worked on Socorro at the time, it was all about Socorro. In 2019, that changed, so this blog post covers the efforts of two people across a bunch of projects.

2019 was pretty crazy. We accomplished a lot, but picking up a bunch of new projects really threw a wrench in the wheel of ongoing work.

This year in review covers highlights, some numbers, and some things I took away.

Here's the list of projects we worked on over the year:

Read moreโ€ฆ

Markus v2.0.0 released! Better metrics API for Python projects. ๐Ÿ”—

What is it?

Markus is a Python library for generating metrics.

Markus makes it easier to generate metrics in your program by:

  • providing multiple backends (Datadog statsd, statsd, logging, logging roll-up, and so on) for sending metrics data to different places
  • sending metrics to multiple backends at the same time
  • providing a testing framework for easy metrics generation testing
  • providing a decoupled architecture making it easier to write code to generate metrics without having to worry about making sure creating and configuring a metrics client has been done--similar to the Python logging module in this way

We use it at Mozilla on many projects.

v2.0.0 released!

I released v2.0.0 just now. Changes:

Features

  • Use time.perf_counter() if available. Thank you, Mike! (#34)
  • Support Python 3.7 officially.
  • Add filters for adjusting and dropping metrics getting emitted. See documentation for more details. (#40)

Backwards incompatible changes

  • tags now defaults to [] instead of None which may affect some expected test output.

  • Adjust internals to run .emit() on backends. If you wrote your own backend, you may need to adjust it.

  • Drop support for Python 3.4. (#39)

  • Drop support for Python 2.7.

    If you're still using Python 2.7, you'll need to pin to <2.0.0. (#42)

Bug fixes

  • Document feature support in backends. (#47)
  • Fix MetricsMock.has_record() example. Thank you, John!

Where to go for more

Changes for this release: https://markus.readthedocs.io/en/latest/history.html#september-19th-2019

Documentation and quickstart here: https://markus.readthedocs.io/en/latest/index.html

Source code and issue tracker here: https://github.com/willkg/markus

Let me know whether this helps you!

Socorro Engineering: July 2019 happenings and putting it on hold ๐Ÿ”—

Summary

Socorro Engineering team covers several projects:

This blog post summarizes our activities in July.

Highlights of July

  • Socorro: Added modules_in_stack field to super search allowing people to search the set of module/debugid for functions that are in teh stack of the crashing thread.

    This lets us reprocess crash reports that have modules for which symbols were just uploaded.

  • Socorro: Added PHC related fields, dom_fission_enabled, and bug_1541161 to super search.

  • Socorro: Fixed some things further streamlining the local dev environment.

  • Socorro: Reformatted Python code with Black.

  • Socorro: Extracted supersearch and fetch-data commands as a separate Python library: https://github.com/willkg/crashstats-tools

  • Tecken: Upgraded to Python 3.7 and adjusted storage bucket code to work better for multiple storage providers.

  • Tecken: Added GCS emulator for local development environment.

  • PollBot: Updated to use Buildhub2.

Hiatus and project changes

In April, we picked up Tecken, Buildhub, Buildhub2, and PollBot in addition to working on Socorro. Since then, we've:

  • audited Tecken, Buildhub, Buildhub2, and PollBot
  • updated all projects, updated dependencies, and performed other necessary maintenance
  • documented deploy procedures and basic runbooks
  • deprecated Buildhub in favor of Buildhub2 and updated projects to use Buildhub2

Buildhub is decomissioned now and is being dismantled.

We're passing Buildhub2 and PollBot off to another team. They'll take ownership of those projects going forward.

Socorro and Tecken are switching to maintenance mode as of last week. All Socorro/Tecken related projects are on hold. We'll continue to maintain the two sites doing "keep the lights on" type things:

  • granting access to memory dumps
  • adding new products
  • adding fields to super search
  • making changes to signature generation and updating siggen library
  • responding to outages
  • fixing security issues

All other non-urgent work will be pushed off.

As of August 1st, we've switched to Mozilla Location Services. We'll be auditing that project, getting it back into a healthy state, and bringing it in line with current standards and practices.

Given that, this is the last Socorro Engineering status post for a while.

Read moreโ€ฆ

crashstats-tools v1.0.1 released! cli for Crash Stats. ๐Ÿ”—

What is it?

crashstats-tools is a set of command-line tools for working with Crash Stats (https://crash-stats.mozilla.org/).

crashstats-tools comes with two commands:

  • supersearch: for performing Crash Stats Super Search queries
  • fetch-data: for fetching raw crash, dumps, and processed crash data for specified crash ids

v1.0.1 released!

I extracted two commands we have in the Socorro local dev environment as a separate Python project. This allows anyone to use those two commands without having to set up a Socorro local dev environment.

The audience for this is pretty limited, but I think it'll help significantly for testing analysis tools.

Say I'm working on an analysis tool that looks at crash report minidump files and does some additional analysis on it. I could use supersearch command to get me a list of crash ids to download data for and the fetch-data command to download the requisite data.

$ export CRASHSTATS_API_TOKEN=foo
$ mkdir crashdata
$ supersearch --product=Firefox --num=10 | \
    fetch-data --raw --dumps --no-processed crashdata

Then I can run my tools on the dumps in crashdata/upload_file_minidump/.

Be thoughtful about using data

Make sure to use these tools in compliance with our data policy:

https://crash-stats.mozilla.org/documentation/memory_dump_access/

Where to go for more

See the project on GitHub which includes a README which contains everything about the project including examples of usage, the issue tracker, and the source code:

https://github.com/willkg/crashstats-tools

Let me know whether this helps you!

Crash pings (Telemetry) and crash reports (Socorro/Crash Stats) ๐Ÿ”—

I keep getting asked questions that stem from confusion about crash pings and crash reports, the details of where they come from, differences between the two data sets, what each is currently good for, and possible future directions for work on both. I figured I'd write it all down.

This is a brain dump and sort of a blog post and possibly not a good version of either. I desperately wished it was more formal and mind-blowing like something written by Chutten or Alessio.

It's likely that this is 90% true today but as time goes on, things will change and it may be horribly wrong depending on how far in the future you're reading this. As I find out things are wrong, I'll keep notes. Any errors are my own.

Summary

We (Mozilla) have two different data sets for crashes: crash pings in Telemetry and crash reports in Socorro/Crash Stats. When Firefox crashes, the crash reporter collects information about the crash and this results in crash ping and crash report data. From there, the two different data things travel two different paths and end up in two different systems.

This blog post covers these two different journeys, their destinations, and the resulting properties of both data sets.

This blog post specifically talks about Firefox and not other products which have different crash reporting stories.

Updates

Updates and changes to this blog post since I wrote it:

  • 2019-07-04: Crash ping data is not publicly available. Blog post updated accordingly. Thank you, Chutten!
  • 2019-11-07: Added note that you can download symbols files using any http client and that the file you get back is gzipped. Thank you, Steven!

CRASH!

Firefox crashes. It happens.

The crash reporter kicks in. It uses the Breakpad library to collect data about the crashed process, package it up into a minidump. The minidump has information about the registers, what's in memory, the stack of the crashing thread, stacks of other threads, what modules are in memory, and so on.

Additionally, the crash reporter collects a set of annotations for the crash. Annotations like ProductName, Version, ReleaseChannel, BuildID and others help us group crashes for the same product and build.

The crash reporter assembles the portions of the crash report that don't have personally identifiable information (PII) in them into a crash ping. It uses minidump-analyzer to unwind the stack. The crash ping with this stack is sent via the crash ping sender to Telemetry.

If Telemetry is disabled, then the crash ping will not get sent to Telemetry.

The crash reporter will show a crash report dialog informing the user that Firefox crashed. The crash report dialog includes a comments field for additional data about the crash and an email field. The user can choose to send the crash report or not.

If the user chooses to send the crash report, then the crash report is sent via HTTP POST to the collector for the crash ingestion pipeline. The entire crash ingestion pipeline is called Socorro. The website part is called Crash Stats.

If the user chooses not to send the crash report, then the crash report never goes to Socorro.

If Firefox is unable to send the crash report, it keeps it on disk. It might ask the user to try to send it again later. The user can access about:crashes and send it explicitly.

Relevant backstory

What is symbolication?

Before we get too much further, let's talk about symbolication.

minidump-whatever will walk the stack starting with the top-most frame. It uses frame information to find the caller frame and works backwards to produce a list of frames. It also includes a list of modules that are in memory.

For example, part of the crash ping might look like this:

  "modules": [
    ...
    {
      "debug_file": "xul.pdb",
      "base_addr": "0x7fecca50000",
      "version": "69.0.0.7091",
      "debug_id": "4E1555BE725E9E5C4C4C44205044422E1",
      "filename": "xul.dll",
      "end_addr": "0x7fed32a9000",
      "code_id": "5CF2591C6859000"
    },
    ...
  ],
  "threads": [
    {
      "frames": [
        {
          "trust": "context",
          "module_index": 8,
          "ip": "0x7feccfc3337"
        },
        {
          "trust": "cfi",
          "module_index": 8,
          "ip": "0x7feccfb0c8f"
        },
        {
          "trust": "cfi",
          "module_index": 8,
          "ip": "0x7feccfae0af"
        },
        {
          "trust": "cfi",
          "module_index": 8,
          "ip": "0x7feccfae1be"
        },
...

The "ip" is an instruction pointer.

The "module_index" refers to another list of modules that were all in memory at the time.

The "trust" refers to how the stack unwinding figured out how to unwind that frame. Sometimes it doesn't have enough information and it does an educated guess.

Symbolication takes the module name, the module debug id, and the offset and looks it up with the symbols it knows about. So for the first frame, it'd do this:

  1. module index 8 is xul.dll
  2. get the symbols for xul.pdb debug id 4E1555BE725E9E5C4C4C44205044422E1 which is at https://symbols.mozilla.org/xul.pdb/4E1555BE725E9E5C4C4C44205044422E1/xul.sym
  3. figure out that 0x7feccfc3337 (ip) - 0x7fecca50000 (base addr for xul.pdb module) is 0x573337
  4. look up 0x573337 in the SYM file and I think that's nsTimerImpl::InitCommon(mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const &,unsigned int,nsTimerImpl::Callback &&)

Symbolication does that for every frame and then we end up with a helpful symbolicated stack.

Note

You can use wget, curl, or any http client to download symbols using urls like in step 2. One thing to know is that the file you get back is gzipped, so you'll need to un-gzip it to read it.

Tecken has a symbolication API which takes the module and stack information in a minimal form and symbolicates using symbols it manages.

It takes a form like this:

{
  "memoryMap": [
    [
      "xul.pdb",
      "44E4EC8C2F41492B9369D6B9A059577C2"
    ],
    [
      "wntdll.pdb",
      "D74F79EB1F8D4A45ABCD2F476CCABACC2"
    ]
  ],
  "stacks": [
    [
      [0, 11723767],
      [1, 65802]
    ]
  ]
}

This has two data structures. The first is a list of (module name, module debug id) tuples. The second is a list of (module id, memory offset) tuples.

What is Socorro-style signature generation?

Socorro has a signature generator that goes through the stack, normalizes the frames so that frames look the same across platforms, and then uses that to generate a "signature" for the crash that suggests a common cause for all the crash reports with that signature.

It's a fragile and finicky process. It works well for some things and poorly for others. There are other ways to generate signatures. This is the one that Socorro currently uses. We're constantly honing it.

I export Socorro's signature generation system as a Python library called siggen.

For examples of stacks -> signatures, look at crash reports on Crash Stats.

What is it and where does it go?

Crash pings in Telemetry

Crash pings are only sent if Telemetry is enabled in Firefox.

The crash ping contains the stack for the crash, but little else about the crashed process. No register data, no memory data, nothing found on the heap.

The stack is unwound by minidump-analyzer in the client on the user's machine. Because of that, driver information can be used by unwinding so for some kinds of crashes, we may get a better stack than crash reports in Socorro.

Stacks in crash pings are not symbolicated.

There's an error aggregates data set generated from the crash pings which is used by Mission Control.

Crash reports in Socorro

Socorro does not get crash reports if the user chooses not to send a crash report.

Socorro collector discards crash reports for unsupported products.

Socorro collector throttles incoming crash reports for Firefox release channel--it accepts 10% of those for processing and rejects the other 90%.

The Socorro processor runs minidump-stackwalk on the minidump which unwinds the stack. Then it symbolicates the stack using symbols uploaded during the build process to symbols.mozilla.org.

If we don't have symbols for modules, minidump-stackwalk will guess at the unwinding. This can work poorly for crashes that involve drivers and system libraries we don't have symbols for.

Crash pings vs. Crash reports

Because of the above, there are big differences in collection of crash data between the two systems and what you can do with it.

Representative of the real world

Because crash ping data doesn't require explicit consent by users on a crash-by-crash basis and crash pings are sent using the Telemetry infrastructure which is pretty resilient to network issues and other problems, crash ping data in Telemetry is likely more representative of crashes happening for our users.

Crash report data in Socorro is limited to what users explicitly send us. Further, there are cases where Firefox isn't able to run the crash reporter dialog to ask the user.

For example, on Friday, June 28th, 2019 for Firefox release channel:

  • Telemetry got 1,706,041 crash pings
  • Socorro processed 42,939 crash reports, so figure it got around 420,000 crash reports

Stack quality

A crash report can have a different stack in the crash ping than in the crash report.

Crash ping data in Telemetry is unwound in the client. On Windows, minidump-analyzer can access CFI unwinding data, so the stacks can be better especially in cases where the stack contains system libraries and drivers.

We haven't implemented this yet on non-Windows platforms.

Crash report data in Socorro is unwound by the Socorro processor and is heavily dependent on what symbols we have available. It doesn't do a good job with unwinding through drivers and we often don't have symbols for Linux system libraries.

Gabriele says sometimes stacks are unwound better for crashes on MacOS and Linux platforms than what the crash ping contains.

Symbolication and signatures

Crash ping data is not symbolicated and we're not generating Socorro-style signatures, so it's hard to bucket them and see change in crash rates for specific crashes.

There's an fx-crash-sig Python library which has code to symbolicate crash ping stacks and generate a Socorro-style signature from that stack. This is helpful for one-off analysis but this is not a long-term solution.

Crash report data in Socorro is symbolicated and has Socorro-style signatures.

The consequence of this is that in Telemetry, we can look at crash rates for builds, but can't look at crash rates for specific kinds of crashes as bucketed by signatures.

The Signature Report and Top Crashers Report in Crash Stats can't be implemented in Telemetry (yet).

Tooling

Telemetry has better tooling for analyzing crash ping data.

Crash ping data drives Mission Control.

Socorro's tooling is limited to Supersearch web ui and API which is ok at some things and not great at others. I've heard some people really like the Supersearch web ui.

There are parts of the crash report which are not searchable. For example, it's not possible to search for crash reports where a certain module is in the stack. Socorro has a signature report and a topcrashers page which help, but they're not flexible for answering questions outside of what we've explicitly coded them for.

Socorro sends a copy of processed crash reports to Telemetry and this is in the "socorro_crash" dataset.

PII and data access

Telemetry crash ping data does not contain PII. It is not publicly available, except in aggregate via Mission Control.

Socorro crash report data contains PII. A subset of the crash data is available to view and search by anyone. The PII data is restricted to users explicitly granted access to it. PII data includes user email addresses, user-provided comments, CPU register data, what else was in memory, and other things.

Data expiration

Telemetry crash ping data isn't expired, but I think that's changing at some point.

Socorro crash report data is kept for 6 months.

Data latency

Socorro data is near real-time. Crash reports get collected and processed and are available in searches and reports within a few minutes.

Crash ping data gets to Telemetry almost immediately.

Main ping data has some latency between when it's generated and when it is collected. This affects normalization numbers if you were looking at crash rates from crash ping data.

Derived data sets may have some latency depending on how they're generated.

Conclusions and future plans

Socorro

Socorro is still good for deep dives into specific crash reports since it contains the full minidump and sometimes a user email address and user comments.

Socorro has Socorro-style signatures which make it possible to aggregate crash reports into signature buckets. Signatures are kind of fickle and we adjust how they're generated over time as compilers, symbols extraction, and other things change. We can build Signature Reports and Top Crasher reports and those are ok, but they use total counts and not rates.

I want to tackle switching from Socorro's minidump-stackwalk to minidump-analyzer so we're using the same stack walker in both places. I don't know when that will happen.

Socorro is going to GCP which means there will be different tools available for data analysis. Further, we may switch to BigQuery or some other data store that lets us search the stack. That'd be a big win.

Telemetry

Telemetry crash ping data is more representative of the real world, but stacks are symbolicated and there's no signature generation, so you can't look at aggregates by cause.

Symbolication and signature generation of crash pings will get figured out at some point.

Work continues on Mission Control 2.0.

Telemetry is going to GCP which means there will be different tools available for data analysis.

Together

At the All Hands, I had a few conversations about fixing tooling for both crash reports and crash pings so the resulting data sets were more similar and you could move from one to the other. For example, if you notice a troubling trend in the crash ping data, can you then go to Crash Stats and find crash reports to deep dive into?

I also had conversations around use cases. Which data set is better for answering certain questions?

We think having a guide that covers which data set is good for what kinds of analysis, tools to use, and how to move between the data sets would be really helpful.

Thanks!

Many thanks to everyone who helped with this: Will Lachance, W Chris Beard, Gabriele Svelto, Nathan Froyd, and Chutten.

Also, many thanks to Chutten and Alessio who write fantastic blog posts about Telemetry things. Those are gold.

Socorro Engineering: June 2019 happenings ๐Ÿ”—

Summary

Socorro Engineering team covers several projects:

This blog post summarizes our activities in June.

Highlights of June

  • Socorro: Fixed the collector's support of a single JSON-encoded field in the HTTP POST payload for crash reports. This is a big deal because we'll get less junk data in crash reports going forward.
  • Socorro: Reworked how Crash Stats manages featured versions: if the product defines a product_details/PRODUCTNAME.json file, it'll pull from that. Otherwise it calculates featured versions based on the crash reports it's received.
  • Buildhub: deprecated Buildhub in favor of Buildhub2. Current plan is to decommission Buildhub in July.
  • Across projects: Updated tons of dependencies that had security vulnerabilities. It was like a hamster wheel of updates, PRs, and deploys.
  • Tecken: Worked on GCS emulator for local dev environment.
  • All hands discussions:
    • GCP migration plan for Tecken and figure out what needs to be done.
    • Possible GCP migration schedule for Tecken and Socorro.
    • Migrating applications using Buildhub to Buildhub2 and decommissioning Buildhub in July.
    • What would happen if we switched from Elasticsearch to BigQuery?
    • Switching from Socorro's minidump-stackwalk to minidump-analyzer.
    • Re-implementing the Socorro Top Crashers and Signature reports using Telemetry tools and data.
    • Writing a symbolicator and Socorro-style signature generator in Rust that can be used for crash reports in Socorro and crash pings in Telemetry.
    • The crash ping vs. crash report situation (blog post coming soon).

Read moreโ€ฆ

Socorro: May 2019 happenings ๐Ÿ”—

Summary

Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the crash reporter collects data about the crash, generates a crash report, and submits that report to Socorro. Socorro saves the crash report, processes it, and provides an interface for aggregating, searching, and looking at crash reports.

This blog post summarizes Socorro activities in May.

Read moreโ€ฆ