Neon/AArch64: Explicitly unroll quant loop w/Clang

The loop in jsimd_quantize_neon() is only executed twice and should be
unrolled for AArch64 targets.  GCC does that by default, but Clang 11
and later versions available at the time of this writing do not.  This
patch adds an unroll pragma when targetting AArch64 with Clang.  We do
not use the unroll pragma for AArch32 targets, because it causes the
Clang-generated assembly code to exhaust the available Neon registers
(32 x 64-bit) and spill to the stack.  (DRC: Referring to the discussion
in #570, this is likely due to compiler confusion that results in poor
register allocation.  It is possible to eliminate the spillage and
reduce the instruction count by loading the data on a just-in-time
basis, thus explicitly interleaving compute and I/O, but the performance
implications of that are currently unknown.)

The effects of unrolling the quantization loop are:
1) elimination of the loop control flow overhead and
2) enabling the use of LDP/STP instructions that work from a single
   base pointer, instead of using double the number of LDR/STR
   instructions, each requiring an address calculation.

Closes #570
diff --git a/simd/arm/jquanti-neon.c b/simd/arm/jquanti-neon.c
index a7eb6f1..d5d95d8 100644
--- a/simd/arm/jquanti-neon.c
+++ b/simd/arm/jquanti-neon.c
@@ -1,7 +1,7 @@
 /*
  * jquanti-neon.c - sample data conversion and quantization (Arm Neon)
  *
- * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
+ * Copyright (C) 2020-2021, Arm Limited.  All Rights Reserved.
  *
  * This software is provided 'as-is', without any express or implied
  * warranty.  In no event will the authors be held liable for any damages
@@ -100,6 +100,9 @@
   DCTELEM *shift_ptr = divisors + 3 * DCTSIZE2;
   int i;
 
+#if defined(__clang__) && (defined(__aarch64__) || defined(_M_ARM64))
+#pragma unroll
+#endif
   for (i = 0; i < DCTSIZE; i += DCTSIZE / 2) {
     /* Load DCT coefficients. */
     int16x8_t row0 = vld1q_s16(workspace + (i + 0) * DCTSIZE);