Neon/AArch64: Explicitly unroll quant loop w/Clang The loop in jsimd_quantize_neon() is only executed twice and should be unrolled for AArch64 targets. GCC does that by default, but Clang 11 and later versions available at the time of this writing do not. This patch adds an unroll pragma when targetting AArch64 with Clang. We do not use the unroll pragma for AArch32 targets, because it causes the Clang-generated assembly code to exhaust the available Neon registers (32 x 64-bit) and spill to the stack. (DRC: Referring to the discussion in #570, this is likely due to compiler confusion that results in poor register allocation. It is possible to eliminate the spillage and reduce the instruction count by loading the data on a just-in-time basis, thus explicitly interleaving compute and I/O, but the performance implications of that are currently unknown.) The effects of unrolling the quantization loop are: 1) elimination of the loop control flow overhead and 2) enabling the use of LDP/STP instructions that work from a single base pointer, instead of using double the number of LDR/STR instructions, each requiring an address calculation. Closes #570

commit: c5f269eb9665435271c05fbcaf8721fa58e9eafa [log] [tgz]
author: Jonathan Wright <jonathan.wright@arm.com> Fri Sep 03 11:52:40 2021 +0100
committer: DRC <information@libjpeg-turbo.org> Fri Feb 25 12:53:05 2022 -0600
tree: 7bade4ad954fa3a6f959cc375af2494b0c9ec230
parent: 98bc3eeb3abce0424f7063ecce06d214256015a5 [diff]
diff --git a/simd/arm/jquanti-neon.c b/simd/arm/jquanti-neon.c
index a7eb6f1..d5d95d8 100644
--- a/simd/arm/jquanti-neon.c
+++ b/simd/arm/jquanti-neon.c

@@ -1,7 +1,7 @@
 /*
  * jquanti-neon.c - sample data conversion and quantization (Arm Neon)
  *
- * Copyright (C) 2020, Arm Limited.  All Rights Reserved.
+ * Copyright (C) 2020-2021, Arm Limited.  All Rights Reserved.
  *
  * This software is provided 'as-is', without any express or implied
  * warranty.  In no event will the authors be held liable for any damages
@@ -100,6 +100,9 @@
   DCTELEM *shift_ptr = divisors + 3 * DCTSIZE2;
   int i;
 
+#if defined(__clang__) && (defined(__aarch64__) || defined(_M_ARM64))
+#pragma unroll
+#endif
   for (i = 0; i < DCTSIZE; i += DCTSIZE / 2) {
     /* Load DCT coefficients. */
     int16x8_t row0 = vld1q_s16(workspace + (i + 0) * DCTSIZE);
commit	c5f269eb9665435271c05fbcaf8721fa58e9eafa	[log] [tgz]
author	Jonathan Wright <jonathan.wright@arm.com>	Fri Sep 03 11:52:40 2021 +0100
committer	DRC <information@libjpeg-turbo.org>	Fri Feb 25 12:53:05 2022 -0600
tree	7bade4ad954fa3a6f959cc375af2494b0c9ec230
parent	98bc3eeb3abce0424f7063ecce06d214256015a5 [diff]