Faster and more accurate blit_row_s32a_opaque for ARM

Change ARM implementation of alpha blending to work on 8 pixels at a
time (using NEON). Also improve the accuracy of alpha blending by using
a formula based on SkMulDiv255Round rather than SkPMSrcOver.

Note that a number of variations of this code were considered. Here are
some notes:

- A 16 pixels at a time version was considered. This performs well for
  the case of extreme alpha (all-opaque or all-transparent pixels), but
  performs worst than the 8 pixels version when there are frequent
  transitions of alpha. Also gcc 6.2.1 seems to have troubles with
  register pressure when using this version.

- If the branch to detect the fully-opaque or fully-transparent cases
  is removed, then the performance increases significantly for images
  which are all partially transparent (especially on ARM Cortex A72),
  but can significantly decrease for images that are almost fully
  opaque or fully transparent.

This implementation is a compromise to the effects described above.
This patch produces a ~10% improvement on the nanobench's sub-scores
repeatTile_BGRA_8888_A, constXTile_MM_filter_trans, constXTile_CC_trans,
constXTile_RR_filter_trans when running on ARM Cortex A72. Improvements
of greater magnitude (20% to 30%) are observed when running on ARM
Cortex A53.

CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD
Change-Id: I1f0c9f549057613bbffd26e6651f3beeb0019af9
Bug: skia:
Reviewed-on: https://skia-review.googlesource.com/16520
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
1 file changed