commit | 5a1903fdfba070c48be0a4b8caede2275a704469 | [log] [tgz] |
---|---|---|
author | Nigel Tao <nigeltao@golang.org> | Mon Jul 17 17:31:47 2017 +1000 |
committer | Nigel Tao <nigeltao@golang.org> | Mon Jul 17 17:31:47 2017 +1000 |
tree | f7e0892d2a0cc8f577b6189b34e247c4e25a309d | |
parent | c463f4956f336741b56ae1c00a0bd9298b476655 [diff] |
Add TODO to describe more keywords.
Puffs is a domain-specific language and library for parsing untrusted file formats safely. Examples of such file formats include images, audio, video, fonts and compressed archives.
Unlike the C programming language, Puffs is safe with respect to buffer overflows, integer arithmetic overflows and null pointer dereferences. The key difference between Puffs and other memory-safe languages is that all such checks are done at compile time, not at run time. If it compiles, it is safe, with respect to those three bug classes.
The aim is to produce software libraries that are as safe as Go or Rust, roughly speaking, but as fast as C, and that can be used anywhere C libraries are used. This includes very large C/C++ products, such as popular web browsers and operating systems (using that term to include desktop and mobile user interfaces, not just the kernel).
The trade-off in aiming for both safety and speed is that Puffs programs take longer for a programmer to write, as they have to explicitly annotate their programs with proofs of safety. A statement like x += 1
unsurprisingly means to increment the variable x
by 1
. However, in Puffs, such a statement is a compile time error unless the compiler can also prove that x
is not the maximal value of x
's type (e.g. x
is not 255
if x
is a u8
), as the increment would otherwise overflow. Similarly, an integer arithmetic expression like x / y
is a compile time error unless the compiler can also prove that y
is not zero.
Puffs is not a general purpose programming language. While technically possible, it is unlikely that a Puffs compiler would be worth writing in Puffs.
The std/gif/decode_lzw.puffs
file is a good example. See the “Poking Around” section below for more guidance.
For example, making this one-line edit to the GIF codec leads to a compile time error. puffs gen
fails to generate the C code, i.e. fails to compile (transpile) the Puffs code to C code:
$ git diff diff --git a/std/gif/decode_lzw.puffs b/std/gif/decode_lzw.puffs index f878c5e..f10dcee 100644 --- a/std/gif/decode_lzw.puffs +++ b/std/gif/decode_lzw.puffs @@ -98,7 +98,7 @@ pub func lzw_decoder.decode?(dst ptr buf1, src ptr buf1, src_final bool)() { in.dst.write?(x:s) if use_save_code { - this.suffixes[save_code] = c as u8 + this.suffixes[save_code] = (c + 1) as u8 this.prefixes[save_code] = prev_code as u16 } $ puffs gen std/gif check: expression "(c + 1) as u8" bounds [1..256] is not within bounds [0..255] at /home/n/go/src/github.com/google/puffs/std/gif/decode_lzw.puffs:101. Facts: n_bits < 8 c < 256 this.stack[s] == (c as u8) use_save_code
In comparison, this two-line edit will compile (but the “does it decode GIF correctly” tests then fail):
$ git diff diff --git a/std/gif/decode_lzw.puffs b/std/gif/decode_lzw.puffs index f878c5e..b43443d 100644 --- a/std/gif/decode_lzw.puffs +++ b/std/gif/decode_lzw.puffs @@ -97,8 +97,8 @@ pub func lzw_decoder.decode?(dst ptr buf1, src ptr buf1, src_final bool)() { // type checking, bounds checking and code generation for it). in.dst.write?(x:s) - if use_save_code { - this.suffixes[save_code] = c as u8 + if use_save_code and (c < 200) { + this.suffixes[save_code] = (c + 1) as u8 this.prefixes[save_code] = prev_code as u16 } $ puffs gen std/gif gen wrote: /home/n/go/src/github.com/google/puffs/gen/c/gif.c gen unchanged: /home/n/go/src/github.com/google/puffs/gen/h/gif.h $ puffs test std/gif gen unchanged: /home/n/go/src/github.com/google/puffs/gen/c/gif.c gen unchanged: /home/n/go/src/github.com/google/puffs/gen/h/gif.h test: /home/n/go/src/github.com/google/puffs/test/c/gif gif/basic.c clang PASS (8 tests run) gif/basic.c gcc PASS (8 tests run) gif/gif.c clang FAIL test_lzw_decode: bufs1_equal: wi: got 19311, want 19200. contents differ at byte 3 (in hex: 0x000003): 000000: dcdc dc00 00d9 f5f9 f6df dc5f 393a 3a3a ..........._9::: 000010: 3a3b 618e c8e4 e4e4 e5e4 e600 00e4 bbbb :;a............. 000020: eded 8f91 9191 9090 9090 9190 9192 9192 ................ 000030: 9191 9292 9191 9293 93f0 f0f0 f1f1 f2f2 ................ excerpts of got (above) versus want (below): 000000: dcdc dcdc dcd9 f5f9 f6df dc5f 393a 3a3a ..........._9::: 000010: 3a3a 618e c8e4 e4e4 e5e4 e6e4 e4e4 bbbb ::a............. 000020: eded 8f91 9191 9090 9090 9090 9191 9191 ................ 000030: 9191 9191 9191 9193 93f0 f0f0 f1f1 f2f2 ................ gif/gif.c gcc FAIL test_lzw_decode: bufs1_equal: wi: got 19311, want 19200. contents differ at byte 3 (in hex: 0x000003): 000000: dcdc dc00 00d9 f5f9 f6df dc5f 393a 3a3a ..........._9::: 000010: 3a3b 618e c8e4 e4e4 e5e4 e600 00e4 bbbb :;a............. 000020: eded 8f91 9191 9090 9090 9190 9192 9192 ................ 000030: 9191 9292 9191 9293 93f0 f0f0 f1f1 f2f2 ................ excerpts of got (above) versus want (below): 000000: dcdc dcdc dcd9 f5f9 f6df dc5f 393a 3a3a ..........._9::: 000010: 3a3a 618e c8e4 e4e4 e5e4 e6e4 e4e4 bbbb ::a............. 000020: eded 8f91 9191 9090 9090 9090 9191 9191 ................ 000030: 9191 9191 9191 9193 93f0 f0f0 f1f1 f2f2 ................ puffs-test-c: some tests failed puffs test: some tests failed
Parsing untrusted data, such as images downloaded from across the web, have a long history of security vulnerabilities. As of 2017, libpng is over 18 years old, and the PNG specification is dated 2003, but that well examined C library is still getting CVE's published in 2017.
Sandboxing and fuzzing can mitigate the danger, but they are reactions to C's fundamental unsafety. Newer programming languages remove entire classes of potential security bugs. Buffer overflows and null pointer dereferences are amongst the most well known.
Less well known are integer overflow bugs. Offset-length pairs, defining a sub-section of a file, are seen in many file formats, such as OpenType fonts and PDF documents. A conscientious C programmer might think to check that a section of a file or a buffer is within bounds by writing if (offset + length < end)
before processing that section, but that addition can silently overflow, and a maliciously crafted file might bypass the check.
A variation on this theme is where offset
is a pointer, exemplified by capnproto's CVE-2017-7892 and another example. For a pointer-typed offset, witnessing such a vulnerability can depend on both the malicious input itself and the addresses of the memory the software used to process that input. Those addresses can vary from run to run and from system to system, e.g. 32-bit versus 64-bit systems and whether dynamically allocated memory can have sufficiently high address values, and that variability makes it harder to reproduce and to catch such subtle bugs from fuzzing.
In C, some integer overflow is undefined behavior, as per the C99 spec section 3.4.3. In Go, integer overflow is silently ignored. In Rust, integer overflow is checked at run time in debug mode and silently ignored in release mode by default, as the run time performance penalty was deemed too great. In Swift, it‘s a run time error. In D, it’s configurable. Other languages like Python and Haskell can automatically spill into ‘big integers’ larger than 64 bits, but this can have a performance impact when such integers are used in inner loops.
Even if overflow is checked, it is usually checked at run time. Similarly, modern languages do their bounds checking at run time. An expression like a[i]
is really if ((0 <= i) && (i < a.length)) { use a[i] } else { throw }
, in mangled pseudo-code. Compilers for these languages can often eliminate many of these bounds checks, e.g. if i
is an iterator index, but not always all of them.
The run time cost is small, measured in nanoseconds. But if an image decoding library has to eat this cost per pixel, and you have a megapixel image, then nanoseconds become milliseconds, and milliseconds can matter.
In comparison, in Puffs, all bounds checks and arithmetic overflow checks happen at compile time, with zero run time overhead.
Puffs code (that is proved safe via explicit assertions) is compiled to C code (with those assertions removed) - it is transpiled. If you are a C/C++ programmer and just want to use the C edition of the Puffs standard library, then clone the repository and look at the files in the gen/c
and gen/h
directories. No other software tools are required and there are no library dependencies, other than C standard library concepts like <stdint.h>
's uint32_t
type and <string.h>
's memset
function.
If your C/C++ project is large, you might want both the .c files (adding each to your build system) and the .h files. If your C/C++ project is small, you might only need the .c files, not the .h files, as the .c files are designed to be a drop-in library. For example, if you want a GIF decoder, you only need gif.c
. See TODO for an example. More complicated decoders might require multiple .c files - multiple modules. For example, the PNG codec (TODO) requires the flate codec (TODO), but they are separate files, since HTTP can use also flate compression (also known as gzip or zlib, roughly speaking) without necessarily processing PNG images.
If you want to modify the Puffs standard library, or compile your own Puffs code, you will need to do a little more work, and will have to install at least the Go toolchain in order to build the Puffs tools. To run the test suite, you might also have to install C compilers like clang and gcc, as well as C libraries (and their .h files) like libjpeg and libpng, as some tests (TODO) compare that Puffs produces exactly the same output as these other libraries.
Puffs is not published yet, but the working assumption is that the eventual home of the software project will be at github.com/google/puffs
. After cloning the repository, move it to that path under your $GOPATH
.
Running go install -v github.com/google/puffs/cmd/...
will then install the puffs tools. The ones that you'll most often use are puffsfmt
(analogous to clang-format
, gofmt
or rustfmt
) and puffs
(roughly analogous to make
, go
or cargo
).
You should now be able to run puffs test
. If all goes well, you should see some output containing the word “PASS” multiple times.
Feel free to edit the std/gif/decode_lzw.puffs
file, which implements the GIF LZW decoder. After editing, run puffs gen std/gif
or puffs test std/gif
to re-generate the C edition of the Puffs standard library's GIF codec, and optionally run its tests.
Try deleting an assert statement and re-running puffs gen
. The result should be syntactically valid, but a compile error, as some bounds checks can no longer be proven.
Find the line var bits u32
, which declares the bits variable and initializes it to zero. Try adding bits -= 1
on a new line of code after it. Again, puffs gen
should fail, as the computation can underflow.
Similarly, replacing the line var n_bits u32
with var n_bits u32 = 10
should fail, as an n_bits < 8
assertion, a pre-condition, a few lines further down again cannot be proven.
Similarly, changing the 4095
in var prev_code u32[..4095]
either higher or lower should fail.
Try adding assert false
at various places, which should obviously fail, but should also cause puffs gen
to print what facts the compiler can prove at that point. This can be useful when debugging why Puffs can't prove something you think it should be able to.
If you've changed any of the tools (i.e. changed any .go
code), re-run go install -v github.com/google/puffs/cmd/...
and go test github.com/google/puffs/lang/...
.
If you‘ve changed any of the libraries (i.e. changed any .puffs
code), run puffs test
or, ideally, puffs test -mimic
to also check that Puffs’ output mimics (i.e. exactly matches) other libraries' output, such as giflib for GIF, libpng for PNG, etc.
If your library change is an optimization, run puffs bench
or puffs bench -mimic
both before and after your change to quantify the improvement. The mimic benchmark numbers should‘t change if you’re only changing .puffs
code, but seeing zero change in those numbers is a sanity check on any unrelated system variance, such as software updates or virus checkers running in the background.
lang
holds the Go libraries that implement the Puffs language: tokenizer, AST, parser, renderer, etc. The Puffs tools are written in Go, but as mentioned above, Puffs transpiles to C code, and Go is not necessarily involved if all you want is to use the C edition of Puffs.cmd
holds Puffs' command line tools, also written in Go.std
holds the Puffs standard library's code. The initial focus is on popular image codecs: BMP, GIF, JPEG, PNG, TIFF and WEBP.gen
holds the transpiled editions of that standard library. The initial focus is generating C code. Later on, the repository might include generated Go and Rust code.test
holds the tests for the Puffs standard library.script
holds miscellaneous utility programs.doc
holds documentation.example
holds example programs.For a guide on how various things work together, the “99ff8e2 Let fields have default values” commit is an example of adding new Puffs syntax and threading that all the way through to C code generation and testing.
Puffs is a language by itself, integrated with a compiler, not something embedded in the comments of another language's program, supported by a separate program checker, as the bounds check elimination affects the generated code.
Checked C has a similar goal of safety, but aims to be backwards compatible with C, including the idiomatic ways in which C code plays fast and loose with pointers and arrays, and implicitly inserting run time checks while maintaining existing data layout (e.g. the sizeof a struct type) for interoperability with other systems. This is a much harder problem than Puffs' approach of a new, simple, domain-specific programming language.
Extended Static Checker for Java (ESC/Java) and its successor OpenJML, which obviously target the Java programming language, similarly has to analyze a language that is more complicated that Puffs. Java is also not commonly used for writing e.g. low level image codecs, as the language lacks unsigned integer types, it is garbage collected and idiomatic code often allocates.
Spark ADA targets a subset of the Ada programming language, which again is more complicated than Puffs. It also allows for pluggable theorem provers such as Z3. This can increase programmer productivity in that it can take less effort to prove more programs safe, but it also affects reproducibility: when some programs that are provable in one programmer‘s configuration are unprovable in another. It’s also not obvious up front what effect different theorem provers, and their different heuristics, will have on compile times or proof times. In constrast, Puffs' proof system is fast (and dumb) instead of smart (and slow).
We‘re not claiming that “use a state of the art theorem prover” is an unworkable approach. It’s just a different approach, with different trade-offs.
Simpler languages are easier to prove things about. Macros, inheritance, closures, generics, operator overloading, goto's and built-in container types are all useful features, but as mentioned at the top, Puffs is not a general purpose programming language. Instead, its focus is on compile-time provable memory safety, with C-like performance, for CPU-intensive file format decoders.
Nonetheless, Puffs is an imperative language, not a functional language, despite the long history of functional languages and provability. Inner loop performance usage matters, and it is easier to match the optimization techniques of incumbent C libraries with a C-like language than with a functional language. Compared to something like Liquid Haskell, Puffs also has no garbage collector overhead, as it has no garbage collector.
Memory usage also matters, again considering per-pixel costs can be multiplied by millions of pixels. It is important to represent e.g. an RGBA pixel as four u8 values (or one u32), not as four naturally-sized-for-the-CPU integers or four big precision integers.
Puffs transpiles to standard C99 code with no dependencies, and should run anywhere that has a C99 compliant C compiler. An existing C/C++ project, such as the Chromium web browser, can choose to replace e.g. libpng with Puffs PNG, without needing any additional toolchains. Sure, languages like D and Rust have great binary-level interop with C code, and Mozilla are reporting progress with parsing media formats in Rust, but it‘s still a non-zero operational hurdle to grow a project’s build process and to assess build times and binary sizes.
Again, we‘re not claiming that “write it in Rust” is an unworkable approach. It’s just a different approach, with different trade-offs.
Nom is a promising parser combinator library, in Rust. Puffs differs from nom by itself, in that Puffs is an end to end implementation, not limited to that part of a file format that is easily expressible as a formal grammar. In particular, it also handles entropy encodings such as LZW (for GIF), ZLIB (for PNG, TODO) and Huffman (for JPEG, TODO). Puffs differs from nom combined with other Rust code (e.g. a Rust LZW implementation) in that bounds and overflow checks not just ubiquitous but also completely eliminated at compile time. Once again, we're not claiming that nom or Rust are unworkable, just different.
Proof of concept. Version 0.1 at best. API and ABI aren't stabilized yet. There are plenty of tests to create, docs to write and TODOs to do. The compiler undoubtedly has bugs. Assertion checking needs more rigor, especially around side effects and aliasing, and being sufficiently well specified to allow alternative implementations. Lots of detail needs work, but the broad brushstrokes are there.
TODO: get a decently large corpus of test images from a web crawl. Ensure that Puffs produces 100% identical decodings to the suite of giflib, libjpeg, libpng, etc.
TODO: also trawl through Go's bug tracker for “this image failed to load”.
Preliminary puffs bench -mimic
summarized throughput numbers for the GIF codec are below. Higher is better.
“Mimic” tests check that Puffs' output mimics (i.e. exactly matches) other libraries' output. “Mimic” benchmarks give the numbers for those other libraries, as shipped with the OS, measured here on Ubunty 14.04 LTS “Trusty”.
The 1k, 10k, etc. numbers are approximately how many bytes of pixel data there is in the decoded image. For example, the test/testdata/harvesters.*
images are 1165 × 859 (approximately 1000k pixels) and a GIF image (a paletted image) is 1 byte per pixel.
name speed gif_puffs_decode_1k/clang 389MB/s ± 2% gif_puffs_decode_10k/clang 137MB/s ± 0% gif_puffs_decode_100k/clang 121MB/s ± 0% gif_puffs_decode_1000k/clang 124MB/s ± 0% gif_mimic_decode_1k/clang 158MB/s ± 1% gif_mimic_decode_10k/clang 94.4MB/s ± 0% gif_mimic_decode_100k/clang 100MB/s ± 0% gif_mimic_decode_1000k/clang 102MB/s ± 0% gif_puffs_decode_1k/gcc 406MB/s ± 1% gif_puffs_decode_10k/gcc 158MB/s ± 0% gif_puffs_decode_100k/gcc 138MB/s ± 0% gif_puffs_decode_1000k/gcc 142MB/s ± 0% gif_mimic_decode_1k/gcc 158MB/s ± 0% gif_mimic_decode_10k/gcc 94.4MB/s ± 0% gif_mimic_decode_100k/gcc 100MB/s ± 0% gif_mimic_decode_1000k/gcc 102MB/s ± 0%
The mimic numbers measure the pre-compiled library that shipped with the OS, so it is unsurprising that they don't depend on the C compiler (clang or gcc) used to run the test harness.
TODO.
Short term:
Medium term:
Long term:
Very long term:
TODO: set up a mailing list.
The CONTRIBUTING.md file contains instructions on how to file the Contributor License Agreement before sending any pull requests (PRs). Of course, if you‘re new to the project, it’s usually best to discuss any proposals and reach consensus before sending your first PR.
BSD-style with patent grant. See the LICENSE and PATENTS files for details.
This is not an official Google product, it is just code that happens to be owned by Google.
Updated on July 2017.