tree: 73ce93a9f4de2a67670b2dae40ed1e5ef0888937 [path history] [tgz]
  1. common_adler32.wuffs
  2. common_up_arm_neon.wuffs
  3. common_up_x86_sse42.wuffs
  4. README.md
std/adler32/README.md

Adler-32

Adler-32 is a checksum algorithm that hashes byte sequences to 32 bit values. It is named after its inventor, Mark Adler, who also co-invented the Gzip and Zlib compressed file formats. Amongst other differences, Gzip uses CRC-32 as its checksum and Zlib uses Adler-32.

The algorithm, described in RFC 1950, is simple. Conceptually, there are two unsigned integers s1 and s2 of infinite precision, initialized to 0 and 1. These two accumulators are updated for every input byte src[i]. At the end of the loop, s1 is 1 plus the sum of all source bytes and s2 is the sum of all (intermediate and final) s1 values:

var s1 = 1;
var s2 = 0;
for_each i in_the_range_of src {
    s1 = s1 + src[i];
    s2 = s2 + s1;
}
return ((s2 % 65521) << 16) | (s1 % 65521);

The final uint32_t hash value is composed of two 16-bit values: (s1 % 65521) in the low 16 bits and (s2 % 65521) in the high 16 bits. 65521 is the largest prime number less than (1 << 16).

Infinite precision arithmetic requires arbitrarily large amounts of memory. In practice, computing the Adler-32 hash instead uses a uint32_t typed s1 and s2, modifying the algorithm to be concious of overflow inside the loop:

uint32_t s1 = 1;
uint32_t s2 = 0;
for_each i in_the_range_of src {
    s1 = (s1 + src[i]) % 65521;
    s2 = (s2 + s1)     % 65521;
}
return (s2 << 16) | s1;

The loop can be split into two levels, so that the relatively expensive modulo operation can be hoisted out of the inner loop:

uint32_t s1 = 1;
uint32_t s2 = 0;
for_each_sub_slice s of_length_up_to M partitioning src {
    for_each i in_the_range_of s {
        s1 = s1 + s[i];
        s2 = s2 + s1;
    }
    s1 = s1 % 65521;
    s2 = s2 % 65521;
}
return (s2 << 16) | s1;

We just need to find the largest M such that the inner loop cannot overflow. The worst case scenario is that s1 and s2 both start the inner loop at 65520 and every subsequent src[i] byte equals 0xFF. A simple computation finds that the largest non-overflowing M is 5552.

In a happy coincidence, 5552 is an exact multiple of 16, which often works well with loop unrolling and with SIMD alignment.

Comparison with CRC-32

Adler-32 is a very simple hashing algorithm. While its output is nominally a uint32_t value, it isn't uniformly distributed across the entire uint32_t range. The [65521, 65535] range of each 16-bit half of an Adler-32 checksum is never touched.

While neither Adler-32 or CRC-32 are cryptographic hash functions, there is still a stark difference in the patterns (or lack of) in their hash values of the N-byte string consisting entirely of zeroes, as this Go program shows:

N  Adler-32    CRC-32      Input
0  0x00000001  0x00000000  ""
1  0x00010001  0xD202EF8D  "\x00"
2  0x00020001  0x41D912FF  "\x00\x00"
3  0x00030001  0xFF41D912  "\x00\x00\x00"
4  0x00040001  0x2144DF1C  "\x00\x00\x00\x00"
5  0x00050001  0xC622F71D  "\x00\x00\x00\x00\x00"
6  0x00060001  0xB1C2A1A3  "\x00\x00\x00\x00\x00\x00"
7  0x00070001  0x9D6CDF7E  "\x00\x00\x00\x00\x00\x00\x00"

Adler-32 is a simpler algorithm than CRC-32. At the time Adler-32 was invented, it had noticeably higher throughput. With modern SIMD implementations, that performance difference has largely disappeared.

Worked Example

A worked example for calculating the Adler-32 hash of the three byte input “Hi\n”, starting from the initial state (s1 = 1) and (s2 = 0):

src[i]  ((s2 << 16) | s1)
----    0x00000001
0x48    0x00490049
0x69    0x00FB00B2
0x0A    0x01B700BC