skia: blend32_16_row for neon version

This includes blend32_16_row neon implementation
for aarch32 and aarch64.

For performance,
blend32_16_row is called in following tests in nanobench.
 - Xfermode_SrcOver
 - tablebench
 - rotated_rects_bw_alternating_transparent_and_opaque_srcover
 - rotated_rects_bw_changing_transparent_srcover
 - rotated_rects_bw_same_transparent_srcover
 - luma_colorfilter_large
 - luma_colorfilter_small
 - chart_bw

I can see perf increase in following two tests, especially. For others, looks
similar.
For each, I tried to run two times.

1) Xfermode_SrcOver
<org>
 - D/skia    ( 2000):    3M        57      17.3µs  17.4µs  17.4µs  17.7µs  1%
  █▃▂▃▂▂▂▁▃▂      565     Xfermode_SrcOver
 - D/skia    ( 1915):    3M        70      13.5µs  16.9µs  16.7µs  18.8µs  9%
  ▆█▄▅█▁▅▅▆▄      565     Xfermode_SrcOver

<new>
 - D/skia    ( 2000):    3M        8       11.6µs  11.8µs  12.1µs  14.4µs  7%
  ▃█▁▁▂▁▁▁▂▂      565     Xfermode_SrcOver
 - D/skia    ( 2004):    3M        62      10.3µs  12.9µs  13µs    15.2µs  11%
  █▅▅▆▁▅▅▅▇▃      565     Xfermode_SrcOver

2)
luma_colorfilter_large
<org>
 - D/skia    ( 2000):  159M        8       136µs   136µs   136µs   139µs   1%
  █▃▁▂▁▁▁▁▁▁      565     luma_colorfilter_large
 - D/skia    ( 1915):  158M        2       135µs   177µs   182µs   269µs   22%
  ▆▃█▁▁▃▃▃▃▃      565     luma_colorfilter_large

<new>
 - D/skia    ( 2000):  157M        5       84.2µs  85.3µs  87.5µs  110µs   9%
  █▁▂▁▁▁▁▁▁▁      565     luma_colorfilter_large
 - D/skia    ( 2004):  159M        6       84.7µs  110µs   112µs   144µs   18%
  █▄▇▁▁▄▃▄▄▆      565     luma_colorfilter_large

Review URL: https://codereview.chromium.org/847363002
4 files changed