Friday March 01, 2019 09:00, Will Kahn-Greene | Tweet this

What is it?

Bleach is a Python library for sanitizing and linkifying text from untrusted sources for safe usage in HTML.

I'm stepping down

In October 2015, I had a conversation with James Socol that resulted in me picking up Bleach maintenance from him. That was a little over 3 years ago. In that time, I:

did 12 releases
improved the tests; switched from nose to pytest, added test coverage for all supported versions of Python and html5lib, added regression tests for xss strings in OWASP Testing Guide 4.0 appendix
worked with Greg to add browser testing for cleaned strings
improved documentation; added docstrings, added lots of examples, added automated testing of examples, improved copy
worked with Jannis to implement a security bug disclosure policy
improved performance (Bleach v2.0 released!)
switched to semver so the version number was more meaningful
did a rewrite to work with the extensive html5lib API changes
spent a couple of years dealing with the regressions from the rewrite
stepped up as maintainer for html5lib and did a 1.0 release
added support for Python 3.6 and 3.7

I accomplished a lot.

A retrospective on OSS project maintenance

I'm really proud of the work I did on Bleach. I took a great project and moved it forward in important and meaningful ways. Bleach is used by a ton of projects in the Python ecosystem. You have likely benefitted from my toil.

While I used Bleach on projects like SUMO and Input years ago, I wasn't really using Bleach on anything while I was a maintainer. I picked up maintenance of the project because I was familiar with it, James really wanted to step down, and Mozilla was using it on a bunch of sites--I picked it up because I felt an obligation to make sure it didn't drop on the floor and I knew I could do it.

I never really liked working on Bleach. The problem domain is a total fucking pain-in-the-ass. Parsing HTML like a browser--oh, but not exactly like a browser because we want the output of parsing to be as much like the input as possible, but as safe. Plus, have you seen XSS attack strings? Holy moly! Ugh!

Anyhow, so I did a bunch of work on a project I don't really use, but felt obligated to make sure it didn't fall on the floor, that has a pain-in-the-ass problem domain. I did that for 3+ years.

Recently, I had a conversation with Osmose that made me rethink that. Why am I spending my time and energy on this?

Does it further my career? I don't think so. Time will tell, I suppose.

Does it get me fame and glory? No.

Am I learning while working on this? I learned a lot about HTML parsing. I have scars. It's so crazy what browsers are doing.

Is it a community through which I'm meeting other people and creating friendships? Sort of. I like working with James, Jannis, and Greg. But I interact and work with them on non-Bleach things, too, so Bleach doesn't help here.

Am I getting paid to work on it? Not really. I did some of the work on work-time, but I should have been using that time to improve my skills and my career. So, yes, I spent some work-time on it, but it's not a project I've been tasked with to work on. For the record, I work on Socorro which is the Mozilla crash-ingestion pipeline. I don't use Bleach on that.

Do I like working on it? No.

Seems like I shouldn't be working on it anymore.

I moved Bleach forward significantly. I did a great job. I don't have any half-finished things to do. It's at a good stopping point. It's a good time to thank everyone and get off the stage.

What happens to Bleach?

I'm stepping down without working on what comes next. I think Greg is going to figure that out.

Thank you!

Jannis was a co-maintainer at the beginning because I didn't want to maintain it alone. Jannis stepped down and Greg joined. Both Jannis and Greg were a tremendous help and fantastic people to work with. Thank you!

Sam Snedders helped me figure out a ton of stuff with how Bleach interacts with html5lib. Sam was kind enough to deputize me as a temporary html5lib maintainer to get 1.0 out the door. I really appreciated Sam putting faith in me. Conversations about the particulars of HTML parsing--I'll miss those. Thank you!

While James wasn't maintaining Bleach anymore, he always took the time to answer questions I had. His historical knowledge, guidance, and thoughtfulness were crucial. James was my manager for a while. I miss him. Thank you!

There were a handful of people who contributed patches, too. Thank you!

Thank your maintainers!

My experience from 20 years of OSS projects is that many people are in similar situations: continuing to maintain something because of internal obligations long after they're getting any value from the project.

Take care of the maintainers of the projects you use! You can't thank them enough for their time, their energy, their diligence, their help! Not just the big successful projects, but also the one-person projects, too.

Shout-out for PyCon 2019 maintainers summit

Sumana mentioned that PyCon 2019 has a maintainers summit. That looks fantastic! If you're in the doldrums of maintaining an OSS project, definitely go if you can.

Changes to this blog post

Update March 2, 2019: I completely forgot to thank Sam Snedders which is a really horrible omission. Sam's the best!

Everett v1.0.0 released!

Monday January 07, 2019 10:00, Will Kahn-Greene | Tweet this

What is it?

Everett is a configuration library for Python apps.

Goals of Everett:

flexible configuration from multiple configured environments
easy testing with configuration
easy documentation of configuration for users

From that, Everett has the following features:

is composeable and flexible
makes it easier to provide helpful error messages for users trying to configure your software
supports auto-documentation of configuration with a Sphinx autocomponent directive
has an API for testing configuration variations in your tests
can pull configuration from a variety of specified sources (environment, INI files, YAML files, dict, write-your-own)
supports parsing values (bool, int, lists of things, classes, write-your-own)
supports key namespaces
supports component architectures
works with whatever you're writing--command line tools, web sites, system daemons, etc

v1.0.0 released!

This release fixes many sharp edges, adds a YAML configuration environment, and fixes Everett so that it has no dependencies unless you want to use YAML or INI.

It also drops support for Python 2.7--Everett no longer supports Python 2.

Why you should take a look at Everett

At Mozilla, I'm using Everett for Antenna which is the edge collector for the crash ingestion pipeline for Mozilla products including Firefox and Fennec. It's been in production for a little under a year now and doing super. Using Everett makes it much easier to:

deal with different configurations between local development and server environments
test different configuration values
document configuration options

It's also used in a few other places and I plan to use it for the rest of the components in the crash ingestion pipeline.

First-class docs. First-class configuration error help. First-class testing. This is why I created Everett.

If this sounds useful to you, take it for a spin. It's almost a drop-in replacement for python-decouple and os.environ.get('CONFIGVAR', 'default_value') style of configuration.

Enjoy!

Thank you!

Thank you to Paul Jimenez who helped fixing issues and provided thoughtful insight on API ergonomics!

Where to go for more

For more specifics on this release, see here: https://everett.readthedocs.io/en/latest/history.html#january-7th-2019

Documentation and quickstart here: https://everett.readthedocs.io/en/latest/

Source code and issue tracker here: https://github.com/willkg/everett

Bleach v3.0.0 released!

Wednesday October 03, 2018 12:00, Will Kahn-Greene | Tweet this

What is it?

Bleach is a Python library for sanitizing and linkifying text from untrusted sources for safe usage in HTML.

Bleach v3.0.0 released!

Bleach 3.0.0 focused on easing the problems with the html5lib dependency and fixing regressions created in the Bleach 2.0 rewrite

For the first, I vendored html5lib 1.0.1 into Bleach and wrote a shim module. Bleach code uses things in the shim module which import things from html5lib. In this way I:

keep the two separated to some exten
the shim is easy to test on its own
it shouldn't be too hard to update html5lib versions
we don't have to test Bleach against multiple versions of html5lib (which took a lot of time)
no one has to deal with Bleach requiring one version of html5lib and other libraries requiring other versions

I think this is a big win for all of us.

The second was tricky. The Bleach 2.0 rewrite changed clean and linkify from running in the tokenizing step of HTML parsing to running after parsing is done. The parser (un)helpfully would clean up the HTML before passing it to Bleach. Because of that, the cleaned text would end up with all this extra stuff.

For example, with Bleach 2.1.4, you'd have this:

>>> import bleach
>>> bleach.clean('This is terrible.<sarcasm>')
'This is terrible.&lt;sarcasm&gt;&lt;/sarcasm&gt;'

The tokenizer would parse out things that looked like HTML tags, the parser, would see an end tag that didn't have a start tag and would add the start tag, then clean would escape the start and end tags because they weren't in the list of allowed tags. Blech.

Bleach 3.0.0 fixes that by tweaking the tokenizer to know about the list of allowed tags. With this knowledge, it can see a start, end, or empty tag and strip or escape it during tokenization. Then the parser doesn't try to fix anything.

With Bleach 3.0.0, we get this:

>>> import bleach
>>> bleach.clean('This is terrible.<sarcasm>')
'This is terrible.&lt;sarcasm&gt;'

What I could use help with

I could use help with improving the documentation. I think it's dense and all over the place focus-wise. I find it difficult to read.

If you're good with documentation, I sure could use your help. See issue 397 for more.

Where to go for more

For more specifics on this release, see here: https://bleach.readthedocs.io/en/latest/changes.html#version-3-0-0-october-3rd-2018

Documentation and quickstart here: https://bleach.readthedocs.io/en/latest/

Source code and issue tracker here: https://github.com/mozilla/bleach

Thoughts on Guido retiring as BDFL of Python

Monday July 16, 2018 08:00, Will Kahn-Greene | Tweet this

I read the news of Guido van Rossum announcing his retirement as BDFL of Python and it made me a bit sad.

I've been programming in Python for almost 20 years on a myriad of open source projects, tools for personal use, and work. I helped out with several PyCon US conferences and attended several others. I met a lot of amazing people who have influenced me as a person and as a programmer.

I started PyVideo in March 2012. At a PyCon US after that (maybe 2015?), I found myself in an elevator with Guido and somehow we got to talking about PyVideo and he asked point-blank, "Why work on that?" I tried to explain what I was trying to do with it: create an index of conference videos across video sites, improve the meta-data, transcriptions, subtitles, feeds, etc. I remember he patiently listened to me and then said something along the lines of how it was a good thing to work on. I really appreciated that moment of validation. I think about it periodically. It was one of the reasons Sheila and I worked hard to transition PyVideo to a new group after we were burned out.

It wouldn't be an overstatement to say that through programming in Python, I've done some good things and become a better person.

Thank you, Guido, for everything!

html5lib-python 1.0 released!

Friday December 08, 2017 12:00, Will Kahn-Greene | Tweet this

html5lib-python v1.0 released!

Yesterday, Geoffrey released html5lib 1.0 [1]! The changes aren't wildly interesting.

The more interesting part for me is how the release happened. I'm going to spend the rest of this post talking about that.

[1]	Technically there was a 1.0 release followed by a 1.0.1 release because the 1.0 release had issues.

The story of Bleach and html5lib

I work on Bleach which is a Python library for sanitizing and linkifying text from untrusted sources for safe usage in HTML. It relies heavily on another library called html5lib-python. Most of the work that I do on Bleach consists of figuring out how to make html5lib do what I need it to do.

Over the last few years, maintainers of the html5lib library have been working towards a 1.0. Those well-meaning efforts got them into a versioning model which had some unenthusing properties. I would often talk to people about how I was having difficulties with Bleach and html5lib 0.99999999 (8 9s) and I'd have to mentally count how many 9s I had said. It was goofy [2].

In an attempt to deal with the effects of the versioning, there's a parallel set of versions that start with 1.0b. Because there are two sets of versions, it was a total pain in the ass to correctly specify which versions of html5lib that Bleach worked with.

While working on Bleach 2.0, I bumped into a few bugs and upstreamed a patch for at least one of them. That patch sat in the PR queue for months. That's what got me wondering--is this project dead?

I tracked down Geoffrey and talked with him a bit on IRC. He seems to be the only active maintainer. He was really busy with other things, html5lib doesn't pay at all, there's a ton of stuff to do, he's burned out, and recently there have been spats of negative comments in the issues and PRs. Generally the project had a lot of stop energy.

Some time in August, I offered to step up as an interim maintainer and shepherd html5lib to 1.0. The goals being:

land or close as many old PRs as possible
triage, fix, and close as many issues as possible
clean up testing and CI
clean up documentation
ship 1.0 which ends the versioning issues

[2]	Many things in life are goofy.

Thoughts on being an interim maintainer

I see a lot of open source projects that are in trouble in the sense that they don't have a critical mass of people and energy. When the sole part-time volunteer maintainer burns out, the project languishes. Then the entitled users show up, complain, demand changes, and talk about how horrible the situation is and everyone should be ashamed. It's tough--people are frustrated and then do a bunch of things that make everything so much worse. How do projects escape the raging inferno death spiral?

For a while now, I've been thinking about a model for open source projects where someone else pops in as an interim maintainer for a short period of time with specific goals and then steps down. Maybe this alleviates users' frustrations? Maybe this gives the part-time volunteer burned-out maintainer a breather? Maybe this can get the project moving again? Maybe the temporary interim maintainer can make some of the hard decisions that a regular long-term maintainer just can't?

I wondered if I should try that model out here. In the process of convincing myself that stepping up as an interim maintainer was a good idea [3], I looked at projects that rely on html5lib [4]:

pip vendors it
Bleach relies upon it heavily, so anything that uses Bleach uses html5lib (jupyter, hypermark, readme_renderer, tensorflow, ...)
most web browsers (Firefox, Chrome, servo, etc) have it in their repositories because web-platform-tests uses it

I talked with Geoffrey and offered to step up with these goals in mind.

I started with cleaning up the milestones in GitHub. I bumped everything from the 0.9999999999 (10 9s) milestone which I determined will never happen into a 1.0 milestone. I used this as a bucket for collecting all the issues and PRs that piqued my interest.

I went through the issue tracker and triaged all the issues. I tried to get steps to reproduce and any other data that would help resolve the issue. I closed some issues I didn't think would ever get resolved.

I triaged all the pull requests. Some of them had been open for a long time. I apologized to people who had spent their time to upstream a fix that sat around for years. In some cases, the changes had bitrotted severely and had to be redone [5].

Then I plugged away at issues and pull requests for a couple of months and pushed anything out of the milestone that wasn't well-defined or something we couldn't fix in a week.

At the end of all that, Geoffrey released version 1.0 and here we are today!

[3]	I have precious little free time, so this decision had sweeping consequences for my life, my work, and people around me.

[4]	Recently, I discovered libraries.io--it's pretty amazing project. They have a page for html5lib. I had written a (mediocre) tool that does vaguely similar things.

[5]	This is what happens on projects that don't have a critical mass of energy/people. It sucks for everyone involved.

Conclusion and thoughts

I finished up as interim maintainer for html5lib. I don't think I'm going to continue actively as a maintainer. Yes, Bleach uses it, but I've got other things I should be doing.

I think this was an interesting experiment. I also think it was a successful experiment in regards to achieving my stated goals, but I don't know if it gave the project much momentum to continue forward.

I'd love to see other examples of interim maintainers stepping up, achieving specific goals, and then stepping down again. Does it bring in new people to the community? Does it affect the raging inferno death spiral at all? What kinds of projects would benefit from this the most? What kinds of projects wouldn't benefit at all?

Markus v1.0 released! Better metrics API for Python projects.

Monday October 30, 2017 09:00, Will Kahn-Greene | Tweet this

What is it?

Markus is a Python library for generating metrics.

Markus makes it easier to generate metrics in your program by:

providing multiple backends (Datadog statsd, statsd, logging, logging roll-up, and so on) for sending data to different places
sending metrics to multiple backends at the same time
providing a testing framework for easy testing
providing a decoupled architecture making it easier to write code to generate metrics without having to worry about making sure creating and configuring a metrics client has been done--similar to the Python logging Python logging module in this way

I use it at Mozilla in the collector of our crash ingestion pipeline. Peter used it to build our symbols lookup server, too.

v1.0 released!

This is the v1.0 release. I pushed out v0.2 back in April 2017. We've been using it in Antenna (the collector of the Firefox crash ingestion pipeline) since then. At this point, I think the API is sound and it's being used in production, ergo it's production-ready.

This release also adds Python 2.7 support.

Why you should take a look at Markus

Markus does three things that make generating metrics a lot easier.

First, it separates creating and configuring the metrics backends from generating metrics.

Let's create a metrics client that sends data nowhere:

import markus

markus.configure()

That's not wildly helpful, but it works and it's 2 lines.

Say we're doing development on a laptop on a speeding train and want to spit out metrics to the Python logging module so we can see what's being generated. We can do this:

import markus

markus.configure(
    backends=[
        {
            'class': 'markus.backends.logging.LoggingMetrics'
        }
    ]
)

That will spit out lines to Python logging. Now I can see metrics getting generated while I'm testing my code.

I'm ready to put my code in production, so let's add a statsd backend, too:

import markus

markus.configure(
    backends=[
        {
            # Log metrics to the logs
            'class': 'markus.backends.logging.LoggingMetrics',
        },
        {
            # Log metrics to statsd
            'class': 'markus.backends.statsd.StatsdMetrics',
            'options': {
                'statsd_host': 'statsd.example.com',
                'statsd_port': 8125,
                'statsd_prefix': '',
            }
        }
    ]
)

That's it. Tada!

Markus can support any number of backends. You can send data to multiple statsd servers. You can use the LoggingRollupBackend which will generate statistics every flush_interval of count, current, min, and max for incr stats and count, min, average, median, 95%, and max for timing/histogram stats for metrics data.

If Markus doesn't have the backends you need, writing your own metrics backend is straight-forward.

For more details, see the usage documentation and the backends documentation.

Second, writing code to generate metrics is straight-forward and easy to do.

Much like the Python logging module, you add import markus at the top of the Python module and get a metrics interface. The interface can be module-level or in a class. It doesn't matter.

Here's a module-level metrics example:

import markus

metrics = markus.get_metrics(__name__)

Then you use it:

@metrics.timer_decorator('chopping_vegetables')
def some_long_function(vegetable):
    for veg in vegetable:
        chop_vegetable()
        metrics.incr('vegetable', 1)

That's it. No bootstrapping problems, nice handling of metrics key prefixes, decorators, context managers, and so on. You can use multiple metrics interfaces in the same file. You can pass them around. You can reconfigure the metrics client and backends dynamically while your program is running.

For more details, see the metrics overview documentation.

Third, testing metrics generation is easy to do.

Markus provides a MetricsMock to make testing easier:

import markus
from markus.testing import MetricsMock


def test_something():
    with MetricsMock() as mm:
        # ... Do things that might publish metrics

        # This helps you debug and write your test
        mm.print_records()

        # Make assertions on metrics published
        assert mm.has_metric(markus.INCR, 'some.key', {'value': 1})

I use it with pytest on my projects, but it is testing-system agnostic.

For more details, see the testing documentation.

Why not use statsd directly?

You can definitely use statsd/dogstatsd libraries directly, but using Markus is a lot easier.

With Markus you don't have to worry about the order in which you create/configure the statsd client versus using the statsd client. You don't have to pass around the statsd client. It's a lot easier to use in Dango and Flask where bootstrapping the app and passing things around is tricky sometimes.

With Markus you get to degrade to sending metrics data to the Python logging library which helps surface issues in development. I've had a few occasions when I thought I wrote code to send data, but it turns out I hadn't or that I had messed up the keys or tags.

With Markus you get a testing mock which lets you write tests guaranteeing that your code is generating metrics the way you're expecting.

If you go with using the statsd/dogstatsd libraries directly, that's fine, but you'll probably want to write some/most of these things yourself.

Where to go for more

For more specifics on this release, see here: https://markus.readthedocs.io/en/latest/history.html#october-30th-2017

Documentation and quickstart here: https://markus.readthedocs.io/en/latest/index.html

Source code and issue tracker here: https://github.com/willkg/markus

Let me know whether this helps you!

Using Localstack for a fake AWS S3 for local development

Friday April 28, 2017 14:00, Will Kahn-Greene | Tweet this

Summary

Over the last year, I rewrote the Socorro collector which is the edge of the Mozilla crash ingestion pipeline. Antenna (the new collector) receives crashes from Breakpad clients as HTTP POSTs, converts the POST payload into JSON, and then saves a bunch of stuff to AWS S3. One of the problems with this is that it's a pain in the ass to do development on my laptop without being connected to the Internet and using AWS S3.

This post covers the various things I did to have a locally running fake AWS S3 service.

Everett v0.9 released and why you should use Everett

Friday April 07, 2017 10:00, Will Kahn-Greene | Tweet this

What is it?

Everett is a Python configuration library.

Configuration with Everett:

is composeable and flexible
makes it easier to provide helpful error messages for users trying to configure your software
supports auto-documentation of configuration with a Sphinx autocomponent directive
supports easy testing with configuration override
can pull configuration from a variety of specified sources (environment, ini files, dict, write-your-own)
supports parsing values (bool, int, lists of things, classes, write-your-own)
supports key namespaces
supports component architectures
works with whatever you're writing--command line tools, web sites, system daemons, etc

Everett is inspired by python-decouple and configman.

v0.9 released!

This release focused on overhauling the Sphinx extension. It now:

has an Everett domain
supports roles
indexes Everett components and options
looks a lot better

This was the last big thing I wanted to do before doing a 1.0 release. I consider Everett 0.9 to be a solid beta. Next release will be a 1.0.

Why you should take a look at Everett

At Mozilla, I'm using Everett 0.9 for Antenna which is running in our -stage environment and will go to -prod very soon. Antenna is the edge of the crash ingestion pipeline for Mozilla Firefox.

When writing Antenna, I started out with python-decouple, but I didn't like the way python-decouple dealt with configuration errors (it's pretty hands-off) and I really wanted to automatically generate documentation from my configuration code. Why write the same stuff twice especially where it's a critical part of setting Antenna up and the part everyone will trip over first?

Here's the configuration documentation for Antenna:

http://antenna.readthedocs.io/en/latest/configuration.html#application

Here's the index which makes it easy to find things by component or by option (in this case, environment variables):

http://antenna.readthedocs.io/en/latest/genindex.html

When you configure Antenna incorrectly, it spits out an error message like this:

1  <traceback omitted, but it'd be here>
2  everett.InvalidValueError: ValueError: invalid literal for int() with base 10: 'foo'
3  namespace=None key=statsd_port requires a value parseable by int
4  Port for the statsd server
5  For configuration help, see https://antenna.readthedocs.io/en/latest/configuration.html

So what's here?:

Block 1 is the traceback so you can trace the code if you need to.
Line 2 is the exception type and message
Line 3 tells you the namespace, key, and parser used
Line 4 is the documentation for that specific configuration option
Line 5 is the "see also" documentation for the component with that configuration option

Is it beautiful? No. [1] But it gives you enough information to know what the problem is and where to go for more information.

Further, in Python 3, Everett will always raise a subclass of ConfigurationError so if you don't like the output, you can tailor it to your project's needs. [2]

First-class docs. First-class configuration error help. First-class testing. This is why I created Everett.

If this sounds useful to you, take it for a spin. It's almost a drop-in replacement for python-decouple [3] and os.environ.get('CONFIGVAR', 'default_value') style of configuration.

Enjoy!

[1]	I would love some help here--making that information easier to parse would be great for a 1.0 release.

[2]	Python 2 doesn't support exception chaining and I didn't want to stomp on the original exception thrown, so in Python 2, Everett doesn't wrap exceptions.

[3]	python-decouple is a great project and does a good job at what it was built to do. I don't mean to demean it in any way. I have additional requirements that python-decouple doesn't do well and that's where I'm coming from.

Where to go for more

For more specifics on this release, see here: http://everett.readthedocs.io/en/latest/history.html#april-7th-2017

Documentation and quickstart here: https://everett.readthedocs.org/en/v0.9/

Source code and issue tracker here: https://github.com/willkg/everett

Bleach v2.0 released!

Wednesday March 08, 2017 14:00, Will Kahn-Greene | Tweet this

What is it?

Bleach is a Python library for sanitizing and linkifying text from untrusted sources for safe usage in HTML.

Bleach v2.0 released!

Bleach 2.0 is a massive rewrite. Bleach relies on the html5lib library. html5lib 0.99999999 (8 9s) changed the APIs that Bleach was using to sanitize text. As such, in order to support html5lib >= 0.99999999 (8 9s), I needed to rewrite Bleach.

Before embarking on the rewrite, I improved the tests and added a set of tests based on XSS example strings from the OWASP site. Spending quality time with tests before a rewrite or refactor is both illuminating (you get a better understanding of what the requirements are) and also immensely helpful (you know when your rewrite/refactor differs from the original). That was time well spent.

Given that I was doing a rewrite anyways, I decided to take this opportunity to break the Bleach API to make it more flexible and easier to use:

added Cleaner and Linkifier classes that you can create once and reuse to reduce redundant work--suggested in #125
created BleachSanitizerFilter which is now an html5lib filter that can be used anywhere you can use an html5lib filter
created LinkifyFilter as an html5lib filter that can be used anywhere you use an html5lib filter including as part of cleaning allowing you to clean and linkify in one pass--suggested in #46
changed arguments for attribute callables and linkify callbacks
and so on

During and after the rewrite, I improved the documentation converting all the examples to doctest format so they're testable and verifiable and adding examples where there weren't any. This uncovered bugs in the documentation and pointed out some annoyances with the new API.

As I rewrote and refactored code, I focused on making the code simpler and easier to maintain going forward and also documented the intentions so I and others can know what the code should be doing.

I also adjusted the internals to make it easier for users to extend, subclass, swap out and whatever else to adjust the functionality to meet their needs without making Bleach harder to maintain for me or less safe because of additional complexity.

For API-adjustment inspiration, I went through the Bleach issue tracker and tried to address every possible issue with this update: infinite loops, unintended behavior, inflexible APIs, suggested refactorings, features, bugs, etc.

The rewrite took a while. I tried to be meticulous because this is a security library and it's a complicated problem domain and I was working on my own during slow times on work projects. When working on one's own, you don't have benefit of review. Making sure to have good test coverage and waiting a day to self-review after posting a PR caught a lot of issues. I also go through the PR and add comments explaining why I did things to give context to future me. Those habits help a lot, but probably aren't as good as a code review by someone else.

Some stats

OMG! This blog post is so boring! Walls of text everywhere so far!

There were 61 commits between v1.5 and v2.0:

Vadim Kotov: 1
Alexandr N. Zamaraev: 2
me: 58

I closed out 22 issues--possibly some more.

The rewrite has the following git diff --shortstat:

64 files changed, 2330 insertions(+), 1128 deletions(-)

Lines of code for Bleach 1.5:

~/mozilla/bleach> cloc bleach/ tests/
      11 text files.
      11 unique files.
       0 files ignored.

http://cloc.sourceforge.net v 1.60  T=0.07 s (152.4 files/s, 25287.2 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          11            353            200           1272
-------------------------------------------------------------------------------
SUM:                            11            353            200           1272
-------------------------------------------------------------------------------
~/mozilla/bleach>

Lines of code for Bleach 2.0:

~/mozilla/bleach> cloc bleach/ tests/
      49 text files.
      49 unique files.
      36 files ignored.

http://cloc.sourceforge.net v 1.60  T=0.13 s (101.7 files/s, 20128.5 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          13            545            406           1621
-------------------------------------------------------------------------------
SUM:                            13            545            406           1621
-------------------------------------------------------------------------------
~/mozilla/bleach>

Some off-the-cuff performance benchmarks

I ran some timings between Bleach 1.5 and various uses of Bleach 2.0 on the Standup corpus.

Here's the results:

what?	time to clean and linkify
Bleach 1.5	1m33s
Bleach 2.0 (no code changes)	41s
Bleach 2.0 (using Cleaner and Linker)	10s
Bleach 2.0 (clean and linkify--one pass)	7s

How'd I compute the timings?

I'm using the Standup corpus which has 42000 status messages in it. Each status message is like a tweet--it's short, has some links, possibly has HTML in it, etc.
I wrote a timing harness that goes through all those status messages and times how long it takes to clean and linkify the status message content, accumulates those timings and then returns the total time spent cleaning and linking.
I ran that 10 times and took the median. The timing numbers were remarkably stable and there was only a few seconds difference between the high and low for all of the sets.
I wrote the median number down in that table above.
Then I'd adjust the code as specified in the table and run the timings again.

I have several observations/thoughts:

First, holy moly--1m33s to 7s is a HUGE performance improvement.

Second, just switching from Bleach 1.5 to 2.0 and making no code changes (in other words, keeping your calls as bleach.clean and bleach.linkify rather than using Cleaner and Linker and LinkifyFilter), gets you a lot. Depending on whether your have attribute filter callables and linkify callbacks, you may be able to just upgrade the libs and WIN!

Third, switching to reusing Cleaner and Linker also gets you a lot.

Fourth, your mileage may vary depending on the nature of your corpus. For example, Standup status messages are short so if your text fragments are larger, you may see more savings by clean-and-linkify in one pass because HTML parsing takes more time.

How to upgrade

Upgrading should be straight-forward.

Here's the minimal upgrade path:

Update Bleach to 2.0 and html5lib to >= 0.99999999 (8 9s).
If you're using attribute callables, you'll need to update them.
If you're using linkify callbacks, you'll need to update them.
Read through version 2.0 changes for any other backwards-incompatible changes that might affect you.
Run your tests and see how it goes.

Note

If you're using html5lib 1.0b8, then you have to explicitly upgrade the version. 1.0b8 is equivalent to html5lib 0.9999999 (7 9s) and that's not supported by Bleach 2.0.

You have to explicitly upgrade because pip will think that 1.0b8 comes after 0.99999999 (8 9s) and it doesn't. So it won't upgrade html5lib for you.

If you're doing 9s, make sure to upgrade to 0.99999999 (8 9s) or higher.

If you're doing 1.0bs, make sure to upgrade to 1.0b9 or higher.

If you want better performance:

Switch to reusing bleach.sanitizer.Cleaner and bleach.linkifier.Linker.

If you have large text fragments:

Switch to reusing bleach.sanitizer.Cleaner and set filters to include LinkifyFilter which lets you clean and linkify in one step.

Many thanks

Many thanks to James Socol (previous maintainer) for walking me through why things were the way they were.

Many thanks to Geoffrey Sneddon (html5lib maintainer) for answering questions, helping with problems I encountered and all his efforts on html5lib which is a huge library that he works on in his spare time for which he doesn't get anywhere near enough gratitude.

Many thanks to Lonnen (my manager) who heard me talk about html5lib zero point nine nine nine nine nine nine nine nine a bunch.

Also, many thanks to Mozilla for letting me work on this during slow periods of the projects I should be working on. A bunch of Mozilla sites use Bleach, but none of mine do.

Where to go for more

For more specifics on this release, see here: https://bleach.readthedocs.io/en/latest/changes.html#version-2-0-march-8th-2017

Documentation and quickstart here: https://bleach.readthedocs.org/en/v2.0/

Source code and issue tracker here: https://github.com/mozilla/bleach

Who uses my stuff?

Friday February 24, 2017 11:00, Will Kahn-Greene | Tweet this

Summary

I work on a lot of different things. Some are applications, are are libraries, some I started, some other people started, etc. I have way more stuff to do than I could possibly get done, so I try to spend my time on things "that matter".

For Open Source software that doesn't have an established community, this is difficult.

This post is a wandering stream of consciousness covering my journey figuring out who uses Bleach.