fold together clut<8> and clut<16>

As you might guess by now, somehow both smaller and faster.

Since there's only one clut() and one call to it now, it's much more
likely to be inlined, which means writing into *r, *g, *b is just as
cheap as temporaries now.

