make inner loop of clut() less branchy

Instead of lo,hi, and t arrays, store lo and hi together as index, and
store 1-t and t in a new weight array.  We can then figure out whether
we need to use lo and 1-t or hi and t using a couple quick bit
operations with no branching.

Then, unroll the i = [0,dim) loop, handling i=0 unconditionally
and the remaining (1,dim) as a fall-through switch.

This is noticeably faster on all platforms and a big win on 32-bit ARM,
in exchange for a tiny increase in code size.

Change-Id: Idee254ca6439724c75de40ac250742223c83435c
Reviewed-on: https://skia-review.googlesource.com/c/163163
Commit-Queue: Brian Osman <brianosman@google.com>
Auto-Submit: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
1 file changed