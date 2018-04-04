Summary Socorro is the crash ingestion pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad crash reporter asks the user if the user would like to send a crash report. If the user answers "yes!", then the Breakpad crash reporter collects data related to the crash, generates a crash report, and submits that crash report as an HTTP POST to Socorro. Socorro collects and saves the crash report, processes it, and provides an interface for aggregating, searching, and looking at crash reports. Over the last year and a half, we've been working on a new infrastructure for Socorro and migrating the project to it. It was a massive undertaking and involved changing a lot of code and some architecture and then redoing all the infrastructure scripts and deploy pipelines. On Thursday, March 28th, we pushed the button and switched to the new infrastructure. The transition was super smooth. Now we're on new infra! This blog post talks a little about the old and new infrastructures and the work we did to migrate.

Why switch this time? First, some context. Way back in the day, Socorro was hosted in a Mozilla data center on long-running virtual servers running CentOS. The hardware was out-of-warranty and dying, so work was under way to figure out the next infrastructure. Then the plan changed and that datacenter was scheduled for decommission and the Socorro team had to scramble to move it somewhere else. They decided to move it to AWS, but the timing was such that they didn't have much time to re-architect and rebuild Socorro to work well in AWS. For the purposes of expedience, they opted for a hybrid approach between the old way of doing things on servers in the datacenter and doing things with AWS best practices. The new infrastructure was vastly improved from the one before it. Developers had a lot of autonomy and visibility into the complex system. Deploys were simpler and more automated. Nodes could be resized to be larger as computational requirements changed. There were scaling groups so an increase in load could be handled by throwing more nodes at it. So on and so forth. That migration project was considered a success. I wasn't on the project at the time, but I was sitting at the next table over at the All Hands when they finished the migration. It was cool stuff. However, the hybrid approach resulted in a unicorn infrastructure that was unlike anything else at Mozilla. It was cute, but quirky, in some ways and really awkward in others. The plan was never to leave it in this state, but to incrementally change it into a more AWS-like system over time. We worked on that for a while, but it became clear that it would be easier to build a new infrastructure and migrate rather than continue to iterate on the existing one. Let's talk about some of the quirky and awkward things about the old infrastructure. Deploys to stage happened automatically any time we merged to the master branch. These would build an RPM of Socorro components, install the RPM on CentOS, bake an AMI, and push that out to stage. Deploys to prod happened by tagging a commit in the master branch. The AMIs associated with that commit would get pushed to prod. RPM filenames included the tag of the Socorro build it was built on. However since RPMs were built in the deploy to stage and the tag was created to deploy to prod, the RPMs that were in prod had the previous tag. That was constantly confusing for me. Our local development environment was completely different from the stage and production environments. It was really hard to get them to more closely match stage and production. Periodically, we'd have problems in stage and production that we couldn't reproduce locally and vice versa. Because migrations and configuration changes were done manually, our stage and production environments weren't like one another. Over the years, some changes to be done were forgotten and the environments diverged. I had several instances where a migration I wrote would work fine on my local machine, work fine on stage, but fail in production because there was a stored procedure or a table with foreign keys that didn't exist in other environments. Both stage and production had a long-running admin node that we updated manually after deploys. Further, we had to manually run migrations and do configuration changes. Configuration was managed in Consul. We had 170+ environment variables (some with structurally complex values > 100 characters long) per environment controlling how Socorro worked. Configuration data wasn't version controlled and had a "review process" that consisted of conversations like this: <willkg> can someone review this config change? <willkg> consulate kv set socorro/processor/new_crash_source.new_crash_so urce_class=socorro.external.fs.fs_new_crash_source.FSNewCrashSource * peterbe looks. <peterbe> looks ok to me. * willkg makes the change. <willkg> done! Since we weren't running consul in the local development environment, configuration changes were effectively tested in stage. One nice thing about Consul the way we had it set up was that after a configuration change, Consul would restart all the processor processes immediately--we didn't have to wait for a deploy. That made the occasional feature-flipping or A/B testing a lot easier. Logs were created and existed on the individual nodes. There was no log aggregation and no central storage. To look at logs, we'd have to log into individual nodes. Every time we did a deploy, we'd lose all the logs. Thus we could never look very far back in time at logs. The processor nodes were in a fixed autoscaling group and didn't scale automatically. Periodically, something in the world would happen and we'd get a larger-than-normal flow of crashes and the processors would be working furiously, but the queue would back up and our ops person would have to manually add nodes to the group. After a deploy, it'd return to the original number and we'd have to manually scale it up again. We use AWS S3 for storage of crash data. However, when we set that up years ago, we put periods in our bucket names. For example, we had a bucket like org.allizom.crash-stats.rhelmer-test.crashes . That becomes a problem when you're using HTTPS because the SSL wildcard certificate creates problems. Another thing we wanted to do was further reduce which things had access to what storage systems. Reducing access would reduce the likelihood of security breaches and data leaks. Thus we had a list of things that we wanted: aggregated, centralized logs and log history Docker-based deploys no more manual post-deploy steps leading to diverging environments disposable nodes configuration that was in version control along-side code and infrastructure and requiring review of changes reduced access to storage systems automatic scaling AWS S3 bucket names that don't have periods Knowing what we wanted out of a new infrastructure, we set about moving forward.

Where are we at now? On March 28th, we cut over to the new system: We had the minorest of minor issues: I forgot that the data flow for the thing I shall not name and despise because it is the unholiest of unholy things works differently in production than all the other environments and when we cut over, we needed to manually tweak the crontabber record for it so that it would run correctly on Friday. We discovered the issue after a few hours, tweaked the crontabber record, and we're fine now.

We discovered there was a bug in this thing we decided to rewrite wherein the process ends before it has time to ack the crashes in RabbitMQ that it just pushed. The next time it starts up, it runs through the same crashes. Again. And Again. And Again. Every two minutes. Then on Sunday, those crashes started raising IntegrityErrors since the date embedded in the crash id did match the submitted_timestamp and so the processor was trying to jam it in the wrong database table. We shut it off and now that's fine.

and so the processor was trying to jam it in the wrong database table. We shut it off and now that's fine. We discovered we needed to raise the nginx upload max file size for the reverse proxy that sits in front of Elasticsearch because some crashes are big. Like, really big. We raised it. Those crashes are saved to Elasticsearch now. Now that's fine.

We had to wait for the last S3 mirror to finish which took a couple of days. During that time, we were missing some crash data that had been collected and processed last week but was indexed in Elasticsearch, so it was searchable, so only sort of missing. We knew this and had notified users accordingly. This is fine now. All minor things--no data loss. The equivalent of moving from one mansion to another mansion in four hours and in the process misplacing your golf clubs in the shower stall of the bathroom for ten minutes. Nothing broke. No data loss. No biggie. This was a successful project. There are some minor things left to do. This unblocks a bunch of other work. Things are good. We probably could have done better. We did some of the work a few times and if we did it "right" the first time, we might have finished earlier. We had a lot of failures caught by simulations, tests, loadtests, runthroughs of system checklists, Sentry error reporting, Datadog graphs, and other places. It's likely we'll hit some more issues over the next few weeks as we get a feel for the new system. Still, it feels good to be done with this project.