make inner loop of clut() less branchy
Instead of lo,hi, and t arrays, store lo and hi together as index, and
store 1-t and t in a new weight array. We can then figure out whether
we need to use lo and 1-t or hi and t using a couple quick bit
operations with no branching.
Then, unroll the i = [0,dim) loop, handling i=0 unconditionally
and the remaining (1,dim) as a fall-through switch.
This is noticeably faster on all platforms and a big win on 32-bit ARM,
in exchange for a tiny increase in code size.
Commit-Queue: Brian Osman <email@example.com>
Auto-Submit: Mike Klein <firstname.lastname@example.org>
Reviewed-by: Brian Osman <email@example.com>
1 file changed