Over the last year, I rewrote the Socorro collector which is the edge of the
Mozilla crash ingestion pipeline. Antenna (the new collector) receives crashes from
Breakpad clients as HTTP POSTs, converts the POST payload into JSON, and then
saves a bunch of stuff to AWS S3. One of the problems with this is that it's a
pain in the ass to do development on my laptop without being connected to the
Internet and using AWS S3.
This post covers the various things I did to have a locally running fake AWS S3
service.
This release focused on overhauling the Sphinx extension. It now:
has an Everett domain
supports roles
indexes Everett components and options
looks a lot better
This was the last big thing I wanted to do before doing a 1.0 release. I
consider Everett 0.9 to be a solid beta. Next release will be a 1.0.
Why you should take a look at Everett
At Mozilla, I'm using Everett 0.9 for Antenna which is running in our -stage
environment and will go to -prod very soon. Antenna is the edge of the crash
ingestion pipeline for Mozilla Firefox.
When writing Antenna, I started out with python-decouple, but I didn't like the
way python-decouple dealt with configuration errors (it's pretty hands-off) and
I really wanted to automatically generate documentation from my configuration
code. Why write the same stuff twice especially where it's a critical part of
setting Antenna up and the part everyone will trip over first?
Here's the configuration documentation for Antenna:
When you configure Antenna incorrectly, it spits out an error message like this:
1 <traceback omitted, but it'd be here>
2 everett.InvalidValueError: ValueError: invalid literal for int() with base 10: 'foo'
3 namespace=None key=statsd_port requires a value parseable by int
4 Port for the statsd server
5 For configuration help, see https://antenna.readthedocs.io/en/latest/configuration.html
So what's here?:
Block 1 is the traceback so you can trace the code if you need to.
Line 2 is the exception type and message
Line 3 tells you the namespace, key, and parser used
Line 4 is the documentation for that specific configuration option
Line 5 is the "see also" documentation for the component with that configuration option
Is it beautiful? No. [1] But it gives you enough information to know what the
problem is and where to go for more information.
Further, in Python 3, Everett will always raise a subclass of
ConfigurationError so if you don't like the output, you can tailor it to
your project's needs. [2]
First-class docs. First-class configuration error help. First-class testing.
This is why I created Everett.
If this sounds useful to you, take it for a spin. It's almost a drop-in
replacement for python-decouple [3] and os.environ.get('CONFIGVAR',
'default_value') style of configuration.
Bleach is a Python library for sanitizing
and linkifying text from untrusted sources for safe usage in HTML.
Bleach v2.0 released!
Bleach 2.0 is a massive rewrite. Bleach relies on the html5lib library. html5lib 0.99999999 (8
9s) changed the APIs that Bleach was using to sanitize text. As such, in order
to support html5lib >= 0.99999999 (8 9s), I needed to rewrite Bleach.
Before embarking on the rewrite, I improved the tests and added a set of tests
based on XSS example strings from the OWASP site. Spending quality time with
tests before a rewrite or refactor is both illuminating (you get a better
understanding of what the requirements are) and also immensely helpful (you know
when your rewrite/refactor differs from the original). That was time well spent.
Given that I was doing a rewrite anyways, I decided to take this opportunity to
break the Bleach API to make it more flexible and easier to use:
added Cleaner and Linkifier classes that you can create once and reuse to
reduce redundant work--suggested in #125
created BleachSanitizerFilter which is now an html5lib filter that can be used
anywhere you can use an html5lib filter
created LinkifyFilter as an html5lib filter that can be used anywhere you use
an html5lib filter including as part of cleaning allowing you to clean and
linkify in one pass--suggested in #46
changed arguments for attribute callables and linkify callbacks
and so on
During and after the rewrite, I improved the documentation converting all the
examples to doctest format so they're testable and verifiable and adding
examples where there weren't any. This uncovered bugs in the documentation and
pointed out some annoyances with the new API.
As I rewrote and refactored code, I focused on making the code simpler and
easier to maintain going forward and also documented the intentions so I and
others can know what the code should be doing.
I also adjusted the internals to make it easier for users to extend, subclass,
swap out and whatever else to adjust the functionality to meet their needs
without making Bleach harder to maintain for me or less safe because of
additional complexity.
For API-adjustment inspiration, I went through the Bleach issue tracker and
tried to address every possible issue with this update: infinite loops,
unintended behavior, inflexible APIs, suggested refactorings, features, bugs,
etc.
The rewrite took a while. I tried to be meticulous because this is a security
library and it's a complicated problem domain and I was working on my own during
slow times on work projects. When working on one's own, you don't have benefit
of review. Making sure to have good test coverage and waiting a day to
self-review after posting a PR caught a lot of issues. I also go through the PR
and add comments explaining why I did things to give context to future me. Those
habits help a lot, but probably aren't as good as a code review by someone else.
Some stats
OMG! This blog post is so boring! Walls of text everywhere so far!
There were 61 commits between v1.5 and v2.0:
Vadim Kotov: 1
Alexandr N. Zamaraev: 2
me: 58
I closed out 22 issues--possibly some more.
The rewrite has the following git diff --shortstat:
I ran some timings between Bleach 1.5 and various uses of Bleach 2.0 on the
Standup corpus.
Here's the results:
what?
time to clean and linkify
Bleach 1.5
1m33s
Bleach 2.0 (no code changes)
41s
Bleach 2.0 (using Cleaner and Linker)
10s
Bleach 2.0 (clean and linkify--one pass)
7s
How'd I compute the timings?
I'm using the Standup corpus which has 42000 status messages in it. Each
status message is like a tweet--it's short, has some links, possibly has HTML
in it, etc.
I wrote a timing harness that goes through all those status messages and
times how long it takes to clean and linkify the status message content,
accumulates those timings and then returns the total time spent cleaning and
linking.
I ran that 10 times and took the median. The timing numbers were remarkably
stable and there was only a few seconds difference between the high and low
for all of the sets.
I wrote the median number down in that table above.
Then I'd adjust the code as specified in the table and run the timings again.
I have several observations/thoughts:
First, holy moly--1m33s to 7s is a HUGE performance improvement.
Second, just switching from Bleach 1.5 to 2.0 and making no code changes (in
other words, keeping your calls as bleach.clean and bleach.linkify
rather than using Cleaner and Linker and LinkifyFilter), gets you a
lot. Depending on whether your have attribute filter callables and linkify
callbacks, you may be able to just upgrade the libs and WIN!
Third, switching to reusing Cleaner and Linker also gets you a lot.
Fourth, your mileage may vary depending on the nature of your corpus. For
example, Standup status messages are short so if your text fragments are larger,
you may see more savings by clean-and-linkify in one pass because HTML parsing
takes more time.
How to upgrade
Upgrading should be straight-forward.
Here's the minimal upgrade path:
Update Bleach to 2.0 and html5lib to >= 0.99999999 (8 9s).
If you're using attribute callables, you'll need to update them.
If you're using linkify callbacks, you'll need to update them.
Read through version 2.0 changes
for any other backwards-incompatible changes that might affect you.
Run your tests and see how it goes.
If you want better performance:
Switch to reusing bleach.sanitizer.Cleaner and
bleach.linkifier.Linker.
If you have large text fragments:
Switch to reusing bleach.sanitizer.Cleaner and set filters to include
LinkifyFilter which lets you clean and linkify in one step.
Many thanks
Many thanks to James Socol (previous maintainer) for walking me through why
things were the way they were.
Many thanks to Geoffrey Sneddon (html5lib maintainer) for answering questions,
helping with problems I encountered and all his efforts on html5lib which is a
huge library that he works on in his spare time for which he doesn't get
anywhere near enough gratitude.
Many thanks to Lonnen (my manager) who heard me talk about html5lib zero point
nine nine nine nine nine nine nine nine a bunch.
Also, many thanks to Mozilla for letting me work on this during slow periods of
the projects I should be working on. A bunch of Mozilla sites use Bleach, but
none of mine do.
I work on a lot of different things. Some are applications, are are libraries,
some I started, some other people started, etc. I have way more stuff to do than
I could possibly get done, so I try to spend my time on things "that matter".
For Open Source software that doesn't have an established community, this is
difficult.
This post is a wandering stream of consciousness covering my journey figuring
out who uses Bleach.
As I sat down to write this, I discovered I'd never written about Everett
before. I wrote it initially as part of another project and then extracted it
and did a first release in August 2016.
Since then, I've been tinkering with how it works in relation to how it's used
and talking with peers to understand their thoughts on configuration.
At this stage, I like Everett and it's at a point where it's worth telling
others about and probably due for a 1.0 release.
This is v0.8. In this release, I spent some time polishing the autoconfig
Sphinx directive to make it more flexible to use in your project documentation.
Instead of having configuration bits all over your project, you centralize it in
one place and then in your Sphinx docs, you have something like:
Standup is a system for capturing standup-style posts
from individuals making it easier to see what's going on for teams and projects.
It has an associated IRC bot standups for posting messages from IRC.
This post talks a bit about the Standup v2 rewrite. Why and how we did it and
what's next.
Standup is a system for capturing standup-style posts
from individuals making it easier to see what's going on for teams and projects.
It has an associated IRC bot standups for posting messages from IRC.
Join us for a Standup v2 system test!
Paul and I did a ground-up rewrite of the Standup web-app to transition from
Persona to GitHub auth, release us from the shackles of the old architecture and
usher in a new era for Standup and its users.
We're done with the most minimal of minimal viable products. It's missing some
features that the current Standup has mostly around team management, but
otherwise it's the same-ish down to the lavish shade of purple in the header
that Rehan graced the site with so long ago.
If you're a Standup user, we need your help testing Standup v2 on the -stage
environment before Thursday, September 22nd, 2016!
We've thrown together a GitHub issue to (ab)use as a forum for test
results and working out what needs to get fixed before we push Standup v2 to
production. It's got instructions that should cover everything you need to know.
Why you would want to help:
You get to see Standup v2 before it rolls out and point out anything that's
missing that affects you.
You get a chance to discover parts of Standup you may not have known about
previously.
This is a chance for you to lend a hand on this community project that helps
you which we're all working on in our free time.
Once we get Standup v2 up, there are a bunch of things we can do with Standup
that will make it more useful. Freddy is itching to fix IRC-related issues
and wants https support [1]. I want to implement user API tokens, a cli and
search. Paul want's to have better weekly team reports and project pages.
There are others listed in the issue tracker and some that we never wrote
down.
pyvideo.org is an index of Python-related conference and user-group
videos on the Internet. Saw a session you liked and want to share it? It's
likely you can find it, watch it, and share it with pyvideo.org.
This is my last update. pyvideo.org is now in new and better hands and
will continue going forward.
Over the weekend, I wanted to implement something that acted as both a class and
function decorator, but could also be used as a context manager. I needed this
flexibility for overriding configuration values making it easier to write tests.
I wanted to use it in the following ways: