first pass of using NEON

NEON has all sorts of handy instructions for us to play with:
  - deinterlacing loads and interlacing stores
  - vectorized bswap
  - f16<->f32 conversion
These should be faster, and if nothing else cuts about 1.2KB of code.

f16 conversion is a separate bit (__ARM_FP & 2) that'll always be set on
ARMv8 but not necessarily on ARMv7.  One day I may need to add an ARMv7
bot...

The _888 and _8888 stages might be able to benefit from NEON too, but
they're a little more awkward because there's no uint8x4_t in NEON, so
I've left them out of this CL.  The _8888 stages might be hard to beat
as-is.  Not much we can do for 1010102 AFAIK.

Change-Id: I78f7c9992773d258702c9a4e7d9a46d7b85b9589
Reviewed-on: https://skia-review.googlesource.com/96868
Reviewed-by: Brian Osman <brianosman@google.com>
4 files changed