handle any N in aarch64 JIT code

Add a tail loop to handle elements one at a time.

Just like in the interpreter, the only instructions
that need to be changed are the loads and stores,
16 byte -> 4 byte and 4 byte -> 1 byte.

With this we can mark the interpreter as SkUNREACHABLE,
and it even completely compiles away, saving a few KB.

Example profile for the SkVMTool float-squaring program
running N=15 over and over:

    Samples│
           │      skvm-jit-3663518994():
        42 │40:   cmp    x0, #0x4
           │44: ↓ b.lt   60
        51 │48:   ldr    q0, [x1]
       197 │4c:   mul    v0.4s, v0.4s, v0.4s
       135 │50:   str    q0, [x1]
           │54:   add    x1, x1, #0x10
        43 │58:   sub    x0, x0, #0x4
           │5c:   b.al   40
       150 │60: ↓ cbz    x0, 7c
        67 │64:   ldr    s0, [x1]
       130 │68:   mul    v0.4s, v0.4s, v0.4s
       135 │6c:   str    s0, [x1]
        18 │70:   add    x1, x1, #0x4
        17 │74:   sub    x0, x0, #0x1
        20 │78:   b.al   60
       124 │7c: ← ret

Change-Id: I153d7bc247942366a686e30a9cad60c935f754ed
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/227138
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Herb Derby <herb@google.com>
1 file changed