std/crc32/README.md - external/github.com/google/wuffs - Git at Google

 # CRC-32

 CRC-32 (Cyclic Redundancy Check 32) is a checksum algorithm that hashes byte
 sequences to 32 bit values.

 The algorithm is based on polynomial division. In theory, a variety of
 polynomials can be used. In practice, only two are widely used. The IEEE
 polynomial is used by Bzip2, Ethernet (IEEE 802.3), Gzip, MPEG-2, PNG, SATA,
 Zip and other formats. The Castagnoli polynomial is used by Btrfs, Ext4, iSCSI,
 SCTP and other formats.


 # Polynomial Division

 In arithmetic, two numbers `A` and `B` can be divided to yield a quotient `Q`
 and a remainder `R`, so that `A = (B × Q) + R`. Two polynomials can similarly
 be divided, yielding quotient and remainder polynomials. This remains true when
 restricting polynomial coefficients to the [Galois Field
 GF(2)](https://en.wikipedia.org/wiki/GF(2)), equivalent to the integers modulo
 2, i.e. the integers 0 and 1. Addition and multiplication are equivalent to XOR
 and AND, so that `foo + foo = 0` and `foo + 0 = foo`. For example,

 ```
   ((x⁴ + x¹ + x⁰) × (x⁴ + x⁰)) + (x¹ + x⁰)
 =  (x⁸ + x⁵ + x⁴) + (x⁴ + x¹ + x⁰) + (x¹ + x⁰)
 =   x⁸ + x⁵ + (x⁴ + x⁴) + (x¹ + x¹) + (x⁰ + x⁰)
 =   x⁸ + x⁵ + 0 + 0 + 0
 =   x⁸ + x⁵
 ```

 Therefore, dividing `A(x) = x⁸ + x⁵` by `B(x) = x⁴ + x¹ + x⁰` yields a
 remainder of `R(x) = x¹ + x⁰`. Such polynomials can be represented as binary
 numbers: `A` is `0b100100000`, `B` is `0b10011` and `R` is `0b11`.


 ## Bitwise Operations

 [Polynomial division](https://en.wikipedia.org/wiki/Polynomial_long_division)
 over GF(2) can be efficiently implemented as a sequence of XORs and shifts.
 Extending the example above (with a divisor polynomial `B` of degree `N` equal
 to 4), an input `0b10010` is first padded with `N` trailing zeroes to give the
 initial state: the `A` value `0b100100000`.

 The algorithm loops, alternating between shifting a copy of the `B` polynomial
 so that its left-most 1 bit aligns with the left-most 1 bit of the state, and
 XOR-ing those two values, producing a new state whose left-most 1 bit is
 further to right. The loop continues until no such shifted `B` can be found, or
 in other words, all of the state's bits to the left of the right-most `N` bits
 are all zero. What remains is the remainder polynomial, `R`.

 ```
 input   00010010
 pad     00010010 0000
 divisor    10011
 xor     00000001 0000
 divisor        1 0011
 xor     00000000 0011
 remain           0011
 ```

 This `B` divisor polynomial, `0b10011`, is also known as the CRC-4-ITU
 polynomial. The `4` in "CRC-4-ITU" is the `N` in "an N bit CRC". The polynomial
 has `N+1` bits, and the hash value or remainder `0b0011` (what remains after
 dividing a longer padded-input polynomial by the shorter hash polynomial) has
 `N` bits.


 ## Normal (MSB) and Reversed (LSB) Representation

 Consider that CRC-4-ITU polynomial again: (x⁴ + x¹ + x⁰) or `0b10011` in
 binary. The high bit of an `N`-degree polynomial is always 1, so an equivalent
 representation of that polynomial is `N=4, MSB, BITS=0b0011`. This is the
 "normal" or Most Significant Bit first order. Another equivalent representation
 is to visit the bits right-to-left: `N=4, LSB, BITS=0b1100`. This is the
 "reversed" or Least Significant Bit first order. The binary number `0b1100` in
 hexadecimal is `0xC`, so yet another equivalent representation of that
 polynomial is `N=4, LSB, BITS=0xC`. If `N` and `LSB`-ness is agreed beforehand,
 `0xC` is all you need to specify the polynomial.

 For 32 bit CRCs, the two popular polynomials (presented in `LSB` order) are
 `0xEDB8_8320` and `0x82F6_3B78`, also called the IEEE and Castagnoli
 polynomials, also called CRC-32 and CRC-32C. For example, the bit string
 representation of the IEEE polynomial `0xEDB8_8320`, un-reversed and with the
 implicit high bit, is `0b1_00000100_11000001_00011101_10110111 =
 0x1_04C1_1DB7`.


 # Worked Example

 A worked example for calculating the CRC-32 hash of the three byte input "Hi\n"
 (equivalently, "\x48\x69\x0A" but note in the work below's input that the bits
 within each byte are reversed to be LSB-first) proceeds similarly to the
 simpler worked example for the CRC-4-ITU hash, above, with two additional
 inversion steps, described below.

 Both the CRC-4-ITU and the CRC-32 worked examples are output by the
 `script/print-crc32-example.go` program.

 ```
 input   00010010 10010110 01010000
 pad     00010010 10010110 01010000 00000000 00000000 00000000 00000000
 invert  11101101 01101001 10101111 11111111 00000000 00000000 00000000
 divisor 10000010 01100000 10001110 11011011 1
 xor     01101111 00001001 00100001 00100100 10000000 00000000 00000000
 divisor  1000001 00110000 01000111 01101101 11
 xor     00101110 00111001 01100110 01001001 01000000 00000000 00000000
 divisor   100000 10011000 00100011 10110110 111
 xor     00001110 10100001 01000101 11111111 10100000 00000000 00000000
 divisor     1000 00100110 00001000 11101101 10111
 xor     00000110 10000111 01001101 00010010 00011000 00000000 00000000
 divisor      100 00010011 00000100 01110110 110111
 xor     00000010 10010100 01001001 01100100 11000100 00000000 00000000
 divisor       10 00001001 10000010 00111011 0110111
 xor     00000000 10011101 11001011 01011111 10101010 00000000 00000000
 divisor          10000010 01100000 10001110 11011011 1
 xor     00000000 00011111 10101011 11010001 01110001 10000000 00000000
 divisor             10000 01001100 00010001 11011011 0111
 xor     00000000 00001111 11100111 11000000 10101010 11110000 00000000
 divisor              1000 00100110 00001000 11101101 10111
 xor     00000000 00000111 11000001 11001000 01000111 01001000 00000000
 divisor               100 00010011 00000100 01110110 110111
 xor     00000000 00000011 11010010 11001100 00110001 10010100 00000000
 divisor                10 00001001 10000010 00111011 0110111
 xor     00000000 00000001 11011011 01001110 00001010 11111010 00000000
 divisor                 1 00000100 11000001 00011101 10110111
 xor     00000000 00000000 11011111 10001111 00010111 01001101 00000000
 divisor                   10000010 01100000 10001110 11011011 1
 xor     00000000 00000000 01011101 11101111 10011001 10010110 10000000
 divisor                    1000001 00110000 01000111 01101101 11
 xor     00000000 00000000 00011100 11011111 11011110 11111011 01000000
 divisor                      10000 01001100 00010001 11011011 0111
 xor     00000000 00000000 00001100 10010011 11001111 00100000 00110000
 divisor                       1000 00100110 00001000 11101101 10111
 xor     00000000 00000000 00000100 10110101 11000111 11001101 10001000
 divisor                        100 00010011 00000100 01110110 110111
 xor     00000000 00000000 00000000 10100110 11000011 10111011 01010100
 remain                             10100110 11000011 10111011 01010100
 invert                             01011001 00111100 01000100 10101011
 hex                                   A   9    C   3    2   2    5   D
 ```

 The final line says that the CRC-32 checksum of "Hi\n" is `0xD522_3C9A`. This
 can be verified by running the `/usr/bin/crc32` program:

 ```
 $ echo Hi | hd    /dev/stdin
 00000000  48 69 0a                                          |Hi.|
 00000003
 $ echo Hi | crc32 /dev/stdin
 d5223c9a
 ```


 ## Inversion

 An `A` value of `0b100100000` as a mathematical concept (whether as a binary
 number of as a polynomial) is unchanged by adding leading 0 bits. Thus, the CRC
 algorithm in its basic form will compute the same hash value for both some
 string `s` and another string that is multiple "\x00" bytes followed by `s`.

 It's easy for network errors or other corruption to introduce multiple "\x00"
 bytes, and a good checksum should be able detect that. To do so, CRC as used in
 practice inverts (applies a bitwise NOT to) the first `N` bits of the padded
 input, just before the divisor-xor loop of the algorithm.

 Similarly but independently of that, the same issue (in a more limited way) can
 occur with trailing "\x00" bytes. The same trick addresses that, this time
 inverting the last `N` bits, just after the divisor-xor loop.

 While the two inversions are notionally independent, and it would be possible
 to implement a CRC flavor that inverted only one side, inverting on both sides
 results in a nice decomposability property. Calculating the CRC hash of the
 concatenation of two strings, `s+t`, can be computed by first hashing `s`, then
 hashing `t` starting with an initial state equal to that hash instead of equal
 to zero. The second inversion of the `s` computation cancels out the first
 inversion of the `t` computation.


 # Fast Computation

 In theory, the mathematics of the CRC algorithm works in terms of bit streams.
 In practice, the computation of the CRC hash value works with byte streams: 8
 (or more) bits at a time. It is faster to do so because CPU and RAM
 fundamentally work with bytes (or words) instead of bits, and because the
 bit-by-bit loop (over 8 successive 0 or 1 input bits) can be implemented as a
 byte-by-byte loop involving a single lookup into a 256-entry table, yielding a
 32-bit value to XOR with the cumulative state. For well known polynomials, such
 as the IEEE and Castagnoli polynomials, these lookup tables can be calculated
 beforehand, and hard-coded at compile time. A uint32 `state` variable can be
 updated by the simple algorithm (with `invert` meaning `bitwise_not`):

 ```
 state = invert(state)
 for each input byte x {
     state = table[x ^ (state & 0xFF)] ^ (state >> 8)
 }
 state = invert(state)
 ```


 ## Multi-Byte Lookup Tables

 Better performance can be obtained by processing M bytes at a time, e.g. for an
 M of 4, 8 or 16. Even at only 4 bytes, a naive implementation would require a
 more-than-4-billion (256 × 256 × 256 × 256) entry lookup table, which is
 impractical. A cleverer algorithm can process 4 bytes at a time using (256 +
 256 + 256 + 256) entries. See [A Systematic Approach to Building High
 Performance, Software-based, CRC
 Generators](https://web.archive.org/web/20060515024705/http://www.intel.com/technology/comms/perfnet/download/CRC_generators.pdf)
 by Kounavis and Berry of Intel Corporation. [Stephan Brumme's CRC-32
 page](https://create.stephan-brumme.com/crc32/#slicing-by-16-overview) also has
 some more discussion and code examples.

 This slicing-by-M algorithm is still applicable even when the input isn't an
 exact multiple of M bytes. The bulk of the input is processed M bytes at a
 time, and the remainder is then processed 1 byte at a time.

 For M greater than 4, the first 4 bytes of each slice are treated in one way,
 since the state (a uint32) is 4 bytes long. The remaining M-4 bytes are treated
 a second way. See [the actual code](./common_crc32.wuffs) for details.


 ## SIMD Implementations

 Even better performance can be obtained through CPU-specific SIMD instructions.
 See [Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
 Instruction](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf)
 by Gopal, Ozturk, Guilford, Wolrich, Feghali and Dixon of Intel Corporation and
 Karakoyunlu of the Worcester Polytechnic Institute.


 # Further Reading

 See a couple of Wikipedia articles:

 - [CRC](https://en.wikipedia.org/wiki/Cyclic_redundancy_check),
 - [Computation of
   CRCs](https://en.wikipedia.org/wiki/Computation_of_cyclic_redundancy_checks).
	# CRC-32

	CRC-32 (Cyclic Redundancy Check 32) is a checksum algorithm that hashes byte
	sequences to 32 bit values.

	The algorithm is based on polynomial division. In theory, a variety of
	polynomials can be used. In practice, only two are widely used. The IEEE
	polynomial is used by Bzip2, Ethernet (IEEE 802.3), Gzip, MPEG-2, PNG, SATA,
	Zip and other formats. The Castagnoli polynomial is used by Btrfs, Ext4, iSCSI,
	SCTP and other formats.


	# Polynomial Division

	In arithmetic, two numbers `A` and `B` can be divided to yield a quotient `Q`
	and a remainder `R`, so that `A = (B × Q) + R`. Two polynomials can similarly
	be divided, yielding quotient and remainder polynomials. This remains true when
	restricting polynomial coefficients to the [Galois Field
	GF(2)](https://en.wikipedia.org/wiki/GF(2)), equivalent to the integers modulo
	2, i.e. the integers 0 and 1. Addition and multiplication are equivalent to XOR
	and AND, so that `foo + foo = 0` and `foo + 0 = foo`. For example,

	```
	((x⁴ + x¹ + x⁰) × (x⁴ + x⁰)) + (x¹ + x⁰)
	= (x⁸ + x⁵ + x⁴) + (x⁴ + x¹ + x⁰) + (x¹ + x⁰)
	= x⁸ + x⁵ + (x⁴ + x⁴) + (x¹ + x¹) + (x⁰ + x⁰)
	= x⁸ + x⁵ + 0 + 0 + 0
	= x⁸ + x⁵
	```

	Therefore, dividing `A(x) = x⁸ + x⁵` by `B(x) = x⁴ + x¹ + x⁰` yields a
	remainder of `R(x) = x¹ + x⁰`. Such polynomials can be represented as binary
	numbers: `A` is `0b100100000`, `B` is `0b10011` and `R` is `0b11`.


	## Bitwise Operations

	[Polynomial division](https://en.wikipedia.org/wiki/Polynomial_long_division)
	over GF(2) can be efficiently implemented as a sequence of XORs and shifts.
	Extending the example above (with a divisor polynomial `B` of degree `N` equal
	to 4), an input `0b10010` is first padded with `N` trailing zeroes to give the
	initial state: the `A` value `0b100100000`.

	The algorithm loops, alternating between shifting a copy of the `B` polynomial
	so that its left-most 1 bit aligns with the left-most 1 bit of the state, and
	XOR-ing those two values, producing a new state whose left-most 1 bit is
	further to right. The loop continues until no such shifted `B` can be found, or
	in other words, all of the state's bits to the left of the right-most `N` bits
	are all zero. What remains is the remainder polynomial, `R`.

	```
	input 00010010
	pad 00010010 0000
	divisor 10011
	xor 00000001 0000
	divisor 1 0011
	xor 00000000 0011
	remain 0011
	```

	This `B` divisor polynomial, `0b10011`, is also known as the CRC-4-ITU
	polynomial. The `4` in "CRC-4-ITU" is the `N` in "an N bit CRC". The polynomial
	has `N+1` bits, and the hash value or remainder `0b0011` (what remains after
	dividing a longer padded-input polynomial by the shorter hash polynomial) has
	`N` bits.


	## Normal (MSB) and Reversed (LSB) Representation

	Consider that CRC-4-ITU polynomial again: (x⁴ + x¹ + x⁰) or `0b10011` in
	binary. The high bit of an `N`-degree polynomial is always 1, so an equivalent
	representation of that polynomial is `N=4, MSB, BITS=0b0011`. This is the
	"normal" or Most Significant Bit first order. Another equivalent representation
	is to visit the bits right-to-left: `N=4, LSB, BITS=0b1100`. This is the
	"reversed" or Least Significant Bit first order. The binary number `0b1100` in
	hexadecimal is `0xC`, so yet another equivalent representation of that
	polynomial is `N=4, LSB, BITS=0xC`. If `N` and `LSB`-ness is agreed beforehand,
	`0xC` is all you need to specify the polynomial.

	For 32 bit CRCs, the two popular polynomials (presented in `LSB` order) are
	`0xEDB8_8320` and `0x82F6_3B78`, also called the IEEE and Castagnoli
	polynomials, also called CRC-32 and CRC-32C. For example, the bit string
	representation of the IEEE polynomial `0xEDB8_8320`, un-reversed and with the
	implicit high bit, is `0b1_00000100_11000001_00011101_10110111 =
	0x1_04C1_1DB7`.


	# Worked Example

	A worked example for calculating the CRC-32 hash of the three byte input "Hi\n"
	(equivalently, "\x48\x69\x0A" but note in the work below's input that the bits
	within each byte are reversed to be LSB-first) proceeds similarly to the
	simpler worked example for the CRC-4-ITU hash, above, with two additional
	inversion steps, described below.

	Both the CRC-4-ITU and the CRC-32 worked examples are output by the
	`script/print-crc32-example.go` program.

	```
	input 00010010 10010110 01010000
	pad 00010010 10010110 01010000 00000000 00000000 00000000 00000000
	invert 11101101 01101001 10101111 11111111 00000000 00000000 00000000
	divisor 10000010 01100000 10001110 11011011 1
	xor 01101111 00001001 00100001 00100100 10000000 00000000 00000000
	divisor 1000001 00110000 01000111 01101101 11
	xor 00101110 00111001 01100110 01001001 01000000 00000000 00000000
	divisor 100000 10011000 00100011 10110110 111
	xor 00001110 10100001 01000101 11111111 10100000 00000000 00000000
	divisor 1000 00100110 00001000 11101101 10111
	xor 00000110 10000111 01001101 00010010 00011000 00000000 00000000
	divisor 100 00010011 00000100 01110110 110111
	xor 00000010 10010100 01001001 01100100 11000100 00000000 00000000
	divisor 10 00001001 10000010 00111011 0110111
	xor 00000000 10011101 11001011 01011111 10101010 00000000 00000000
	divisor 10000010 01100000 10001110 11011011 1
	xor 00000000 00011111 10101011 11010001 01110001 10000000 00000000
	divisor 10000 01001100 00010001 11011011 0111
	xor 00000000 00001111 11100111 11000000 10101010 11110000 00000000
	divisor 1000 00100110 00001000 11101101 10111
	xor 00000000 00000111 11000001 11001000 01000111 01001000 00000000
	divisor 100 00010011 00000100 01110110 110111
	xor 00000000 00000011 11010010 11001100 00110001 10010100 00000000
	divisor 10 00001001 10000010 00111011 0110111
	xor 00000000 00000001 11011011 01001110 00001010 11111010 00000000
	divisor 1 00000100 11000001 00011101 10110111
	xor 00000000 00000000 11011111 10001111 00010111 01001101 00000000
	divisor 10000010 01100000 10001110 11011011 1
	xor 00000000 00000000 01011101 11101111 10011001 10010110 10000000
	divisor 1000001 00110000 01000111 01101101 11
	xor 00000000 00000000 00011100 11011111 11011110 11111011 01000000
	divisor 10000 01001100 00010001 11011011 0111
	xor 00000000 00000000 00001100 10010011 11001111 00100000 00110000
	divisor 1000 00100110 00001000 11101101 10111
	xor 00000000 00000000 00000100 10110101 11000111 11001101 10001000
	divisor 100 00010011 00000100 01110110 110111
	xor 00000000 00000000 00000000 10100110 11000011 10111011 01010100
	remain 10100110 11000011 10111011 01010100
	invert 01011001 00111100 01000100 10101011
	hex A 9 C 3 2 2 5 D
	```

	The final line says that the CRC-32 checksum of "Hi\n" is `0xD522_3C9A`. This
	can be verified by running the `/usr/bin/crc32` program:

	```
	$ echo Hi \| hd /dev/stdin
	00000000 48 69 0a \|Hi.\|
	00000003
	$ echo Hi \| crc32 /dev/stdin
	d5223c9a
	```


	## Inversion

	An `A` value of `0b100100000` as a mathematical concept (whether as a binary
	number of as a polynomial) is unchanged by adding leading 0 bits. Thus, the CRC
	algorithm in its basic form will compute the same hash value for both some
	string `s` and another string that is multiple "\x00" bytes followed by `s`.

	It's easy for network errors or other corruption to introduce multiple "\x00"
	bytes, and a good checksum should be able detect that. To do so, CRC as used in
	practice inverts (applies a bitwise NOT to) the first `N` bits of the padded
	input, just before the divisor-xor loop of the algorithm.

	Similarly but independently of that, the same issue (in a more limited way) can
	occur with trailing "\x00" bytes. The same trick addresses that, this time
	inverting the last `N` bits, just after the divisor-xor loop.

	While the two inversions are notionally independent, and it would be possible
	to implement a CRC flavor that inverted only one side, inverting on both sides
	results in a nice decomposability property. Calculating the CRC hash of the
	concatenation of two strings, `s+t`, can be computed by first hashing `s`, then
	hashing `t` starting with an initial state equal to that hash instead of equal
	to zero. The second inversion of the `s` computation cancels out the first
	inversion of the `t` computation.


	# Fast Computation

	In theory, the mathematics of the CRC algorithm works in terms of bit streams.
	In practice, the computation of the CRC hash value works with byte streams: 8
	(or more) bits at a time. It is faster to do so because CPU and RAM
	fundamentally work with bytes (or words) instead of bits, and because the
	bit-by-bit loop (over 8 successive 0 or 1 input bits) can be implemented as a
	byte-by-byte loop involving a single lookup into a 256-entry table, yielding a
	32-bit value to XOR with the cumulative state. For well known polynomials, such
	as the IEEE and Castagnoli polynomials, these lookup tables can be calculated
	beforehand, and hard-coded at compile time. A uint32 `state` variable can be
	updated by the simple algorithm (with `invert` meaning `bitwise_not`):

	```
	state = invert(state)
	for each input byte x {
	state = table[x ^ (state & 0xFF)] ^ (state >> 8)
	}
	state = invert(state)
	```


	## Multi-Byte Lookup Tables

	Better performance can be obtained by processing M bytes at a time, e.g. for an
	M of 4, 8 or 16. Even at only 4 bytes, a naive implementation would require a
	more-than-4-billion (256 × 256 × 256 × 256) entry lookup table, which is
	impractical. A cleverer algorithm can process 4 bytes at a time using (256 +
	256 + 256 + 256) entries. See [A Systematic Approach to Building High
	Performance, Software-based, CRC
	Generators](https://web.archive.org/web/20060515024705/http://www.intel.com/technology/comms/perfnet/download/CRC_generators.pdf)
	by Kounavis and Berry of Intel Corporation. [Stephan Brumme's CRC-32
	page](https://create.stephan-brumme.com/crc32/#slicing-by-16-overview) also has
	some more discussion and code examples.

	This slicing-by-M algorithm is still applicable even when the input isn't an
	exact multiple of M bytes. The bulk of the input is processed M bytes at a
	time, and the remainder is then processed 1 byte at a time.

	For M greater than 4, the first 4 bytes of each slice are treated in one way,
	since the state (a uint32) is 4 bytes long. The remaining M-4 bytes are treated
	a second way. See [the actual code](./common_crc32.wuffs) for details.


	## SIMD Implementations

	Even better performance can be obtained through CPU-specific SIMD instructions.
	See [Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
	Instruction](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf)
	by Gopal, Ozturk, Guilford, Wolrich, Feghali and Dixon of Intel Corporation and
	Karakoyunlu of the Worcester Polytechnic Institute.


	# Further Reading

	See a couple of Wikipedia articles:

	- [CRC](https://en.wikipedia.org/wiki/Cyclic_redundancy_check),
	- [Computation of
	CRCs](https://en.wikipedia.org/wiki/Computation_of_cyclic_redundancy_checks).