make inner loop of clut() less branchy

Instead of lo,hi, and t arrays, store lo and hi together as index, and
store 1-t and t in a new weight array.  We can then figure out whether
we need to use lo and 1-t or hi and t using a couple quick bit
operations with no branching.

Then, unroll the i = [0,dim) loop, handling i=0 unconditionally
and the remaining (1,dim) as a fall-through switch.

This is noticeably faster on all platforms and a big win on 32-bit ARM,
in exchange for a tiny increase in code size.

Change-Id: Idee254ca6439724c75de40ac250742223c83435c
Commit-Queue: Brian Osman <>
Auto-Submit: Mike Klein <>
Reviewed-by: Brian Osman <>
1 file changed