restore vmull_u8() in color32()

vmull_u8() does u8 * u8 -> u16, 8 at a time.  This keeps the loop as
tight as possible in NEON, basically {load,mull,addhn,store,loop}.

Drop N to 4 pixels at at time to make this easier.  Depending on how
performance charts go, I may circle back to bring this back up to 8.

Bug: chromium:952502
1 file changed