overhaul blit_row_s32a_opaque()

  - Remove branching on source alpha, which
    makes Skia susceptible to timing attacks.

  - Remove SSE4.1 variant, which is nearly identical
    to the SSE2 code once branching's removed.

  - Reroll SIMD loops back to their native vector
    size, leaving unrolling to the compiler.

  - Allow wider SIMD sets to cascade down into the
    narrower ones for the last few pixels instead of
    always hitting the scalar fallback.

  - Move code around, rewrite, refactor, etc. so it
    all reads more consistently.

  blit_row_color32() has not changed at all here,
  just moved to the bottom of the file to prevent
  it from interrupting blit_row_s32a_opaque().

  This prevents me from seeing the timing reconstructions
  on the Timing sample everywhere I've tested.

