Eliot: retrospective (2021)

Project

time:

1 year

impact:
  • reduced risk of Mozilla Symbols Server outage which affects symbols uploads from the build system

  • improves maintainability of symbolication service by offloading parsing of Breakpad symbols files and symbol lookups to external library that's used by other tools in crash reporting ecosystem at Mozilla

  • opens up possible futures around supporting inline functions and using other debug file types

Problem statement

Tecken is the project for the Mozilla Symbols Service. This service manages several things:

  • symbol upload API for uploading and storing debugging symbols generated by build systems for products like Firefox, Fenix, etc

  • download API for downloading symbols which could be located in a variety of different places supporting tools like Visual Studio, stackwalkers, profilers, symbolicators, etc

  • symbolication API for finding symbols for memory addresses

It also has a webapp for querying symbols, debugging symbols problems, managing API tokens, and granting permissions for uploading symbols.

All of those functions are currently handled by a single webapp service.

There are a few problems here.

First, we want to reduce risk of an outage for uploading symbols. When we have service outages, the build systems can't upload symbols. It tries really hard to upload symbols, so this increases the build times for Firefox and other products. On top of that, if the build system doesn't successfully upload symbols, any crashes in tests or channels result in unsymbolicated stacks which obscures the details of the crash.

There are several projects that are waiting in the eve to dramatically increase their use of the symbolication API which increases the likelihood of an outage with the service that affects symbol uploads.

Second, the existing symbolication API implementation is an independent implementation or sym file parsing, lookups, and symbolication. Whenever we make adjustments to how sym files are built, structured, or the lookup algorithms, we have to additionally update the symbolication API code.

Mozilla is in the process of rewriting crash reporting related code in Rust. It behooves us greatly to switch from our independent ipmlementation to a shared library.

Third, the symbolication API is missing some critical features like support for line numbers and inline functions. The existing code can't be extended to support either line numbers or inline functions--we need to rewrite it.

In September of 2020, I embarked on a project to break out the symbolication API as a separate microservice and implement it using the Symbolic library. That had the following effects:

  1. eases the risk of outage due to increasing usage of the symbolication API,

  2. adds support for line numbers and sets us up for adding support for inline functions, and

  3. reduces the maintenance work because we'll be using a library used by other parts of the crash reporting ecosystem

This post covers that project.

What is symbolication?

The symbolication API takes a list of memory addresses and a list of modules, retrieves the symbol files for the modules from the symbols store, and then looks up the memory addresses in those symbols files.

For example, a symbolication API request payload might have something like this:

{
  "jobs": [
    {
      "stacks": [
        [
          [0, 103598100],
          [0, 103686693]
        ]
      ],
      "memoryMap": [
        ["libxul.so", "F53783197E19F4D010E2FEE918021D060"],
        ["firefox-bin", "C8D098E2FE2082391BA6F602315F6C6B0"]
      ]
    }
  ]
}

which would return something like this:

{
  "results": [
    {
      "stacks": [
        [
          {
            "frame": 0,
            "module": "libxul.so",
            "module_offset": "0x62cc814",
            "function": "mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList(JS::loader::ScriptLoadRequest*)",
            "function_offset": "0x54",
            "file": "hg:hg.mozilla.org/mozilla-central:dom/workers/ScriptLoader.cpp:8a97830c36c3a6aedd56c1519773df80ddd68f9c",
            "line": 977
          },
          {
            "frame": 1,
            "module": "libxul.so",
            "module_offset": "0x62e2225",
            "function": "mozilla::dom::workerinternals::loader::NetworkLoadHandler::OnStreamComplete(nsIStreamLoader*, nsISupports*, nsresult, unsigned int, unsigned char const*)",
            "function_offset": "0x3f5",
            "file": "hg:hg.mozilla.org/mozilla-central:dom/workers/loader/NetworkLoadHandler.cpp:8a97830c36c3a6aedd56c1519773df80ddd68f9c",
            "line": 61
          },
        ]
      ],
      "found_modules": {
        "libxul.so/F53783197E19F4D010E2FEE918021D060": true,
        "firefox-bin/C8D098E2FE2082391BA6F602315F6C6B0": true,
      }
    }
  ]
}

With symbolication, instead of something like this:

0  libxul.so/F53783197E19F4D010E2FEE918021D060  0x62cc814
1  libxul.so/F53783197E19F4D010E2FEE918021D060  0x62e2225

We get to see something like this:

0  mozilla::dom::workerinternals::loader::WorkerScriptLoader::DispatchMaybeMoveToLoadedList(JS::loader::ScriptLoadRequest*)
   file: hg:hg.mozilla.org/mozilla-central:dom/workers/ScriptLoader.cpp:8a97830c36c3a6aedd56c1519773df80ddd68f9c
   line: 977
1  mozilla::dom::workerinternals::loader::NetworkLoadHandler::OnStreamComplete(nsIStreamLoader*, nsISupports*, nsresult, unsigned int, unsigned char const*)
   file: hg:hg.mozilla.org/mozilla-central:dom/workers/loader/NetworkLoadHandler.cpp:8a97830c36c3a6aedd56c1519773df80ddd68f9c
   line: 61

Taht's a lot more informative and actionable.

Requirements

  1. Implement symbolication API as a new microservice.

  2. It will use libraries Mozilla is standardizing on for parsing sym files and doing lookups.

  3. Implement existing Symbolication v4 and v5 APIs.

  4. Support Dockerflow and other mozilla-services standards.

  5. It should point to symbols.mozilla.org for for downloading symbols. Then we don't have to worry about AWS and GCP implementations in this service.

  6. Support CORS headers so that browser applications like the Firefox profiler can use the API for symbolication.

  7. It will use existing infrastructure for service support.

Implementation decisions

It will re-use existing Tecken project infrastructure.

We will reuse existing metrics, logging, deploy pipelines, and repository management. If we need to, we can split it out later.

This will reduce the SRE work significantly and reduce the time it takes to put into production.

Use Symbolic Python library.

We will use the Symbolic Python library which is a wrapper around the Symbolic Rust crate and maintained by Sentry.

This is what we're standardizing on at Mozilla for all code that works with sym files.

It will have an on-disk LRU cache for symcache files.

Parsing sym files is very expensive. For example, libxul.so/1410FAF03AD925211450AE25E0CB9AE50 is 569mb and takes 8.2 seconds to parse.

Symbolic can parse sym files and then export the internal symcache structure as a binary blob. We can save this to disk and use it for future fetches. Parsing a symcache file from disk takes milliseconds.

Because symbolication tends to use recent sym files (crash ingestion tends to receive crash reports from recent versions of products), we can cache them with an LRU to improve cache hits vs. misses.

Because Symbolic's symcache format is not stable and not guaranteed to work across versions, we're going to go with an on-disk LRU cache for now. When we do a new deploy of the service, existing symcache files will be lost.

Timeline

  1. Initial prototype with testing to see how it handles.

  2. Flesh out service in staging environment.

  3. Build out regression testing and load testing tools for verifying and validating the service.

  4. Put service in production.

  5. Add redirects from old service to new service.

Finishing up

The service is in production and has been working for a while. Switching to Symbolic has been fruitful since we can now take advantage of shared work (bug fixes, new features, etc).

Self-assessment

Regrets:

  • This took much longer than I wanted it to. Symbolic had a couple of bugs that were blockers and I needed to wait for them to get fixed. Then I hit logistic and scheduling issues that delayed figuring out production deployment.

  • I'm working on the Socorro-verse by my self. Because of that, it's hard to push big projects over the finish line because I'm often juggling multiple projects.

Contents:

  • I feel good about the job I did prototyping, figuring out the bounds of the project, building a project plan, building the service, working through infrastructure decisions, testing, verifying, validating, figuring out migration, and getting it into production. It's a lot of different kinds of roles.

  • This was a big win project-wise. Finishing this project unblocks several other projects at Mozilla including symbolicating and generating crash signatures for crash pings. That has potential to tell us a lot more about what's going on crash-wise for our users including illuminating crashes that we can't otherwise see evidence of because they don't result in a crash report.

  • Switching to Symbolic is a big win--it parses sym files faster than our independent implementation. Further, it allowed us to add support for file and line numbers trivially.

Thanks!

Thank you to everyone involved: Gabriele, Markus, Aria, Mark, and Jason.

Want to comment? Send an email to willkg at bluesock dot org. Include the url for the blog entry in your comment so I have some context as to what you're talking about.