Chrome has fuzzing infrastructure called ClusterFuzz which randomly bombards various components with input in an effort to find bugs. Because inputs are randomly generated, the main bugs that can be found are a set of inputs or user actions that crash Chrome.
Some of the bugs ClusterFuzz finds are in Skia. We would like to find these bugs before ClusterFuzz does and find bugs in APIs that are not used by Chrome using more focused tests than are possible with ClusterFuzz.
A fuzz is some form of random input into Skia. There are two categories of fuzzes: binary and API.
A binary fuzz is a randomly generated or mutated file that Skia imports/parses or loads. Binary fuzzes can be of types
.gif, to name a few. Binary fuzzes are generated very efficiently and tested effectively using a tool called afl-fuzz.
An API fuzz is a unit test-like program that exercises the public APIs of SKIA. The inputs to API fuzzes are also generated by afl-fuzz. Afl-fuzz creates a file of random bytes which are interpreted as “random” numbers for use in the program. The API fuzz uses these numbers to exercise the APIs, e.g. generating a random SVG string and then parsing it. afl-fuzz is smart enough to try numbers like infinity, NaN, and so on in interesting places, so this works quite well.
We are not so much interested in the graphical output of a fuzz, but rather whether the fuzz caused Skia to crash or not. A fuzz that induces a crash is called a bad fuzz. Once Skia has been patched and a bad fuzz no longer crashes Skia, that fuzz is then called a grey fuzz.
Fuzzes are named by a SHA1 hash of their contents, i.e. the input itself. A fuzz's name does not change after it is created.
The fuzzing infrastructure consists of several components that can run independently:
Generators continuously generate random fuzzes and execute them against Skia to see if they cause a crash (i.e. are bad fuzzes). There are many generators, one for each fuzz category. There are several categories, such as skpicture, skcodec_mode, skcodec_scale, api_parse_path, and so on. Throughout the code base, the term fuzzer is used as a synonym for a specific generator, e.g. the “skpicture fuzzer”.
Although binary and API fuzzes are conceptually different, because they both use afl-fuzz as their source of “random” inputs, the generators are very similar, and have an identical interface.
fuzz.cpp is the Skia binary that controls most of the generator logic. The
--bytes flag is the input point for the source of randomness produced by afl-fuzz. For binary fuzzes, this will be a mutated SkPicture or .png file or similar (To generate these, binary fuzzes require some sample binary files; the smaller and faster to parse, i.e. < 100kb and < 50ms, the better). For API fuzzes, this will just be random bytes that Fuzz.h reads and formats into “random” integers, floats, etc.
Fuzz.cpp is built as Release and instrumented with afl-fuzz's instrumentation, similar to a code coverage tool. This allows afl-fuzz to find bad fuzzes that crash in different ways, rather than a naive solution which may find thousands of bad fuzzes that all exercise the the same bad execution path.
The fuzz generators simply run afl-fuzz on as many cores as wanted, one or more for each fuzz category. These afl-fuzz processes dump their results to disk, where the aggregator will scan to detect new ones (see below). When a new version of Skia is “under fuzz”, all afl-fuzz seeds are updated for a fresh analysis.
The aggregator will find new bad fuzzes, create some analytics for them and upload fuzz and analytics to Google Storage. One aggregator will run on every vm that is running generator(s).
Every time it triggers, say, once per minute, it find new fuzzes and run them against several different builds of Skia, recording the output as metadata/analytics. For now, those builds are (Debug, Release) x (Clang, AddressSanitizer). These builds are built off of the same Skia commit as was used in the Generator. The analytics is parsed to include a stacktrace and several flags, such as if any version crashed, if asserts were hit, if AddressSanitizer found anything, etc.
After analyzing the fuzz, the aggregator deduplicates it against all other bad fuzzes. If something like it has already been seen (e.g. has the same top 5 stacktrace frames and flags), it will be skipped. The first iteration did not do this deduplication, and even though afl-fuzz seeks out different execution paths, there were many many duplicates. This deduplication strategy is not perfect, but it removes a lot of obvious duplication, improving the signal-to-noise ratio.
The aggregator uploads the non-duplicate bad fuzzes and the analytics to Google Storage.
When a new version of Skia is “under fuzz”, the aggregator is used to download all old fuzzes and re-analyze them to see if the stop crashing (or regress) and create new analytics for them.
In the event the storage requirement becomes too large on Google Storage, we can use a sanitizer to trim that down. The sanitizer will periodically purge grey fuzzes from Google Storage older than some threshold. The sanitizer will also be in charge of deleting API fuzzes that are obsolete (e.g. test code that has been removed/deprecated).
Care will be taken to keep some of the older ones to easily detect regressions. For example, if there are 600 old grey fuzzes with the same stack trace, the sanitizer might delete all but a few of these.
The sanitizer functionality could be implemented as part of the Web Front End.
Skia developers are able to visually browse the history of the fuzzers and get quick access to any failing fuzzes. The front end has the ability to filter fuzzes by file, function and line number, as well as by any of the analytic tags found during aggregation.
The go web server will periodically:
When web clients make a request for fuzzes by file, function, or line, the web server will slice off a piece of the intermediate tree, convert it to JSON and return to be rendered. The client sorts and filters the data by the analytic tags.
If a developer is working on turning bad fuzzes into grey fuzzes, they may want to test their pre-committed or committed code against a series of bad fuzzes. The fuzz try bot will do exactly that.
There will likely be several settings for the trybots - run all bad fuzzes found this week, run all regressing fuzzes, run all bad and grey fuzzes from the last 3 months, run all bad and grey fuzzes from all time, etc.
All components are run on GCE.
Execution continuously against HEAD doesn't really make much sense because fuzzing takes some time to get good coverage (especially afl-fuzz). To account for this, Skia will be pulled from the latest DEPS roll every week for fuzzing, unless a developer requests a fresh pull (e.g. after fixing some broken code and executing the fuzz trybot).