restore vmull_u8() in color32()

vmull_u8() does u8 * u8 -> u16, 8 at a time.  This keeps the loop as
tight as possible in NEON, basically {load,mull,addhn,store,loop}.

Drop N to 4 pixels at at time to make this easier.  Depending on how
performance charts go, I may circle back to bring this back up to 8.

Bug: chromium:952502
Change-Id: I17ba6b60c0cc6c6da71b05a4af269d87d76672b5
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/208140
Auto-Submit: Mike Klein <mtklein@google.com>
Reviewed-by: Michael Ludwig <michaelludwig@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
1 file changed