restore vmull_u8() in color32()

vmull_u8() does u8 * u8 -> u16, 8 at a time.  This keeps the loop as
tight as possible in NEON, basically {load,mull,addhn,store,loop}.

Drop N to 4 pixels at at time to make this easier.  Depending on how
performance charts go, I may circle back to bring this back up to 8.

Bug: chromium:952502
Change-Id: I17ba6b60c0cc6c6da71b05a4af269d87d76672b5
Auto-Submit: Mike Klein <>
Reviewed-by: Michael Ludwig <>
Commit-Queue: Mike Klein <>
1 file changed