Observability Team Newsletter (2024q1)

Tuesday April 16, 2024, Will Kahn-Greene | share this (mastodon)

Observability Team is a team dedicated to the problem domain and discipline of Observability at Mozilla.

We own, manage, and support monitoring infrastructure and tools supporting Mozilla products and services. Currently this includes Sentry and crash ingestion related services (Crash Stats (Socorro), Mozilla Symbols Server (Tecken), and Mozilla Symbolication Service (Eliot)).

In 2024, we'll be working with SRE to take over other monitoring services they are currently supporting like New Relic, InfluxDB/Grafana, and others.

This newsletter covers an overview of 2024q1. Please forward it to interested readers.

Highlights

🤹 Observability Services: Change in user support
🏆 Sentry: Change in ownership
‼️ Sentry: Please don't start new trials
⏲️ Sentry: Cron monitoring trial ending April 30th
⏱️ Sentry: Performance monitoring pilot
🤖 Socorro: Improvements to Fenix support
🐛 Socorro: Support guard page access information

See details below.

Blog posts

None this quarter.

Detailed project updates

Observability Services: Change in user support

We overhauled our pages in Confluence, started an #obs-help Slack channel, created a new Jira OBSHELP project, built out a support rotation, and leveled up our ability to do support for Observability-owned services.

See our User Support Confluence page for:

where to get user support
documentation for common tasks (get protected data access, create a Sentry team, etc)
self-serve instructions

Hop in #obs-help in Slack to ask for service support, help with monitoring problems, and advice.

Sentry: Change in ownership

The Observability team now owns Sentry service at Mozilla!

We successfully completed Phase 1 of the transition in Q1. If you're a member of the Mozilla Sentry organization, you should have received a separate email about this to the sentry-users Google group.

We've overhauled Sentry user support documentation to improve it in a few ways:

easier to find "how to" articles for common tasks
best practices to help you set up and configure Sentry for your project needs

Check out our Sentry user guide.

There's still a lot that we're figuring out, so we appreciate your patience and cooperation.

Sentry: Please don't start new trials

Sentry sends marketing and promotional emails to Sentry users which often include links to start a new trial. Please contact us before starting any new feature trials in Sentry.

Starting new trials may prevent us from trialing those features in the future when we’re in a better position to evaluate the feature. There's no way for admins to prevent users from starting a trial.

Sentry: Cron monitoring trial ending April 30th

The Cron Monitoring trial that was started a couple of months ago will end April 30th.

Based on feedback so far and other factors, we will not be enabling this feature once the trial ends.

This is a good reminder to build in redundancy in your monitoring systems. Don't rely solely on trial or pilot features for mission critical information!

Once the trial is over, we'll put together an evaluation summary.

Sentry: Performance monitoring pilot

Performance Monitoring is being piloted by a couple of teams; it is not currently available for general use.

In the meantime, if you are not one of these pilot teams, please do not use Performance Monitoring. There is a shared transaction event quota for the entire Mozilla Sentry organization. Once we hit that quota, events are dumped.

If you have questions about any of this, please reach out.

Once the trial is over, we'll put together an evaluation summary.

Socorro: Improvements to Fenix support

We worked on improvements to crash ingestion and the Crash Stats site for the Fenix project:

1812771: Fenix crash reporter's Socorro crash reports for Java exceptions have "Platform" = "Unknown" instead of "Android"

Previously, the platform would be "Unknown". Now the platform for Fenix crash reports is "Android". Further, the platform_pretty_version includes the Android ABI version.

/images/obs_2024q1_android_version.thumbnail.png — Figure 1: Screenshot of Crash Stats Super Search results showing Android versions for crash reports.

1819628: reject crash reports for unsupported Fenix forks

Forks of Fenix outside of our control periodically send large swaths of crash reports to Socorro. When these sudden spikes happened, Mozillians would spend time looking into them only to discover they're not related to our code or our users. This is a waste of our time and resources.

We implemented support for the Android_PackageName crash annotation and added a throttle rule to the collector to drop crash reports from any non-Mozilla releases of Fenix.

From 2024-01-18 to 2024-03-31, Socorro accepted 2,072,785 Fenix crash reports for processing and rejected 37,483 unhelpful crash reports with this new rule. That's roughly 1.7%. That's not a huge amount, but because they sometimes come in bursts with the same signature, they show up in Top Crashers wasting investigation time.

1884041: fix create-a-bug links to work with java_exception

A long time ago, in an age partially forgotten, Fenix crash reports from a crash in Java code would send a crash report with a JavaStackTrace crash annotation. This crash annotation was a string representation of the Java exception. As such, it was difficult-to-impossible to parse reliably.

In 2020, Roger Yang and Will Kahn-Greene spec'd out a new JavaException crash annotation. The value is a JSON-encoded structure mirroring what Sentry uses for exception information. This structure provides more information than the JavaStackTrace crash annotation did and is much easier to work with because we don't have to parse it first.

Between 2020 and now, we have been transitioning from crash reports that only contained a JavaStackTrace to crash reports that contained both a JavaStackTrace and a JavaException. Once all Fenix crash reports from crashes in Java code contained a JavaException, we could transition Socorro code to use the JavaException value for Crash Stats views, signature generation, generate-create-bug-url, and other things.

Recently, Fenix dropped the JavaStackTrace crash annotation. However, we hadn't yet gotten to updating Socorro code to use--and prefer--the JavaException values. This broke the ability to generate a bug for a Fenix crash with the needed data added to the bug description. Work on bug 1884041 fixed that.

Comments for Fenix Java crash reports went from:

Crash report: https://crash-stats.mozilla.org/report/index/eb6f852b-4656-4cf5-8350-fd91a0240408

to:

Crash report: https://crash-stats.mozilla.org/report/index/eb6f852b-4656-4cf5-8350-fd91a0240408

Top 10 frames:

0  android.database.sqlite.SQLiteConnection  nativePrepareStatement  SQLiteConnection.java:-2
1  android.database.sqlite.SQLiteConnection  acquirePreparedStatement  SQLiteConnection.java:939
2  android.database.sqlite.SQLiteConnection  executeForString  SQLiteConnection.java:684
3  android.database.sqlite.SQLiteConnection  setJournalMode  SQLiteConnection.java:369
4  android.database.sqlite.SQLiteConnection  setWalModeFromConfiguration  SQLiteConnection.java:299
5  android.database.sqlite.SQLiteConnection  open  SQLiteConnection.java:218
6  android.database.sqlite.SQLiteConnection  open  SQLiteConnection.java:196
7  android.database.sqlite.SQLiteConnectionPool  openConnectionLocked  SQLiteConnectionPool.java:503
8  android.database.sqlite.SQLiteConnectionPool  open  SQLiteConnectionPool.java:204
9  android.database.sqlite.SQLiteConnectionPool  open  SQLiteConnectionPool.java:196

This both fixes the bug and also vastly improves the bug comments from what we were previously doing with JavaStackTrace.

Between 2024-03-31 and 2024-04-06, there were 158,729 Fenix crash reports processed. Of those, 15,556 have the circumstances affected by this bug: a JavaException but don't have a JavaStackTrace. That's roughly 10% of incoming Fenix crash reports.

While working on this, we refactored the code that generates these crash report bugs, so it's in a separate module that's easier to copy and use in external systems in case others want to generate bug comments from processed crash data.

Further, we changed the code so that instead of dropping arguments in function signatures, it now truncates them at 80 characters.

We're hoping to improve signature generation for Java crashes using JavaException values in 2024q2. That work is tracked in bug #1541120.

Socorro: Support guard page access information

1830954: Expose crashes which were likely accessing a guard page

We updated the stackwalker to pick up the changes for determining is_likely_guard_page. Then we exposed that in crash reports in the has_guard_page_access field. We added this field to the Details tab in crash reports and made it searchable. We also added this to the signature report.

This helps us know if a crash is possibly due to a bug with memory access that could be a possible security vulnerability vector--something we want to prioritize fixing.

Since this field is security sensitive, it requires protected data access to view and search with.

Socorro misc

crashstats-tools 2.0.0 release
socorro-siggen 2.1.20240412 release
4 signature generation changes. Thank you Andrew McCreight and Jim Blandy!
Maintenance and documentation improvements.
6 production deploys. Created 71 issues. Resolved 61 issues.

Tecken/Eliot misc

Maintenance and documentation improvements.
5 production deploys. Created 21 issues. Resolved 28 issues.

More information

Find us:

Confluence page: Observability Team
User support hub: User Support
Support: #obs-help (Slack)
Crash ingestion: #crashreporting (Matrix)

Thank you for reading!

Want to comment? Send an email to willkg at bluesock dot org. Include the url for the blog entry in your comment so I have some context as to what you're talking about.