simplify load_888, impl load_161616

I may have cared too much too soon about making load_888() tight vector
code.  I want to regress it to simpler to read code for now, without any
of the RawBytes punning or shuffle() mess.  We can just put each byte
where it goes.

This made it easier to think about and write an analogous load_161616().

I required the src pointer be 2-byte aligned for 161616, which seems
unlikely to ever be accidentally violated.  Other formats so far
actually only require byte alignment.  What's the right policy here?
Pointers are aligned to address components, with 565 a special case
since we can't address bits?  Full pixel alignment? Maybe the answer
will become clearer as I finish up _hhhh and _ffff.

Change-Id: I13b11cde11b46c811dbbc4b547e6aa377442ce09
2 files changed