"ARM"="Arm", "NEON"="Neon"

Refer to:
https://www.arm.com/company/policies/trademarks/arm-trademark-list/arm-trademark
https://www.arm.com/company/policies/trademarks/arm-trademark-list/neon-trademark

NOTE: These changes are only applied to change log entries for 2.0.x and
later, since the change log is a historical record and Arm's new
trademark policy did not go into effect until late 2017.
diff --git a/BUILDING.md b/BUILDING.md
index a4ae1e0..6828809 100644
--- a/BUILDING.md
+++ b/BUILDING.md
@@ -398,8 +398,8 @@
 Building libjpeg-turbo for iOS
 ------------------------------
 
-iOS platforms, such as the iPhone and iPad, use ARM processors, and all
-currently supported models include NEON instructions.  Thus, they can take
+iOS platforms, such as the iPhone and iPad, use Arm processors, and all
+currently supported models include Neon instructions.  Thus, they can take
 advantage of libjpeg-turbo's SIMD extensions to significantly accelerate JPEG
 compression/decompression.  This section describes how to build libjpeg-turbo
 for these platforms.
@@ -412,7 +412,7 @@
   it should be installed in your `PATH`.
 
 
-### ARMv7 (32-bit)
+### Armv7 (32-bit)
 
 **gas-preprocessor.pl required**
 
@@ -465,7 +465,7 @@
     make
 
 
-### ARMv7s (32-bit)
+### Armv7s (32-bit)
 
 **gas-preprocessor.pl required**
 
@@ -493,13 +493,13 @@
 
 #### Xcode 5 and later (Clang)
 
-Same as the ARMv7 build procedure for Xcode 5 and later, except replace the
+Same as the Armv7 build procedure for Xcode 5 and later, except replace the
 compiler flags as follows:
 
     export CFLAGS="-Wall -mfloat-abi=softfp -arch armv7s -miphoneos-version-min=6.0"
 
 
-### ARMv8 (64-bit)
+### Armv8 (64-bit)
 
 **gas-preprocessor.pl required if using Xcode < 6**
 
@@ -523,7 +523,7 @@
       [additional CMake flags] {source_directory}
     make
 
-Once built, lipo can be used to combine the ARMv7, v7s, and/or v8 variants into
+Once built, lipo can be used to combine the Armv7, v7s, and/or v8 variants into
 a universal library.
 
 
@@ -534,7 +534,7 @@
 [Android NDK](https://developer.android.com/tools/sdk/ndk).
 
 
-### ARMv7 (32-bit)
+### Armv7 (32-bit)
 
 The following is a general recipe script that can be modified for your specific
 needs.
@@ -559,7 +559,7 @@
     make
 
 
-### ARMv8 (64-bit)
+### Armv8 (64-bit)
 
 The following is a general recipe script that can be modified for your specific
 needs.
@@ -742,21 +742,21 @@
 
     make udmg
 
-This creates a Mac package/disk image that contains universal x86-64/i386/ARM
+This creates a Mac package/disk image that contains universal x86-64/i386/Arm
 binaries.  The following CMake variables control which architectures are
 included in the universal binaries.  Setting any of these variables to an empty
 string excludes that architecture from the package.
 
 * `OSX_32BIT_BUILD`: Directory containing an i386 (32-bit) Mac build of
   libjpeg-turbo (default: *{source_directory}*/osxx86)
-* `IOS_ARMV7_BUILD`: Directory containing an ARMv7 (32-bit) iOS build of
+* `IOS_ARMV7_BUILD`: Directory containing an Armv7 (32-bit) iOS build of
   libjpeg-turbo (default: *{source_directory}*/iosarmv7)
-* `IOS_ARMV7S_BUILD`: Directory containing an ARMv7s (32-bit) iOS build of
+* `IOS_ARMV7S_BUILD`: Directory containing an Armv7s (32-bit) iOS build of
   libjpeg-turbo (default: *{source_directory}*/iosarmv7s)
-* `IOS_ARMV8_BUILD`: Directory containing an ARMv8 (64-bit) iOS build of
+* `IOS_ARMV8_BUILD`: Directory containing an Armv8 (64-bit) iOS build of
   libjpeg-turbo (default: *{source_directory}*/iosarmv8)
 
-You should first use CMake to configure i386, ARMv7, ARMv7s, and/or ARMv8
+You should first use CMake to configure i386, Armv7, Armv7s, and/or Armv8
 sub-builds of libjpeg-turbo (see "Build Recipes" and "Building libjpeg-turbo
 for iOS" above) in build directories that match those specified in the
 aforementioned CMake variables.  Next, configure the primary build of
diff --git a/ChangeLog.md b/ChangeLog.md
index e496281..b04ba36 100644
--- a/ChangeLog.md
+++ b/ChangeLog.md
@@ -20,8 +20,8 @@
      - Fixed an issue whereby `jpeg_skip_scanlines()` always returned 0 when
 skipping past the end of an image.
 
-3. The ARM 64-bit (ARMv8) NEON SIMD extensions can now be built using MinGW
-toolchains targetting ARM64 (AArch64) Windows binaries.
+3. The Arm 64-bit (Armv8) Neon SIMD extensions can now be built using MinGW
+toolchains targetting Arm64 (AArch64) Windows binaries.
 
 4. Fixed unexpected visual artifacts that occurred when using
 `jpeg_crop_scanline()` and interblock smoothing while decompressing only the DC
@@ -94,7 +94,7 @@
 (unlike the decompressor) is not generally exposed to arbitrary data exploits,
 this issue did not likely pose a security risk.
 
-6. The ARM 64-bit (ARMv8) NEON SIMD assembly code now stores constants in a
+6. The Arm 64-bit (Armv8) Neon SIMD assembly code now stores constants in a
 separate read-only data section rather than in the text section, to support
 execute-only memory layouts.
 
@@ -380,7 +380,7 @@
 now produces bitwise-identical results to the unmerged algorithms.
 
 12. The SIMD function symbols for x86[-64]/ELF, MIPS/ELF, macOS/x86[-64] (if
-libjpeg-turbo is built with YASM), and iOS/ARM[64] builds are now private.
+libjpeg-turbo is built with YASM), and iOS/Arm[64] builds are now private.
 This prevents those symbols from being exposed in applications or shared
 libraries that link statically with libjpeg-turbo.
 
diff --git a/README.md b/README.md
index c88d3f5..1ff632e 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 ==========
 
 libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
-baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
+baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
 MIPS systems, as well as progressive JPEG compression on x86 and x86-64
 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
 all else being equal.  On other types of systems, libjpeg-turbo can still
diff --git a/cmakescripts/BuildPackages.cmake b/cmakescripts/BuildPackages.cmake
index 27e4c9e..b98fa93 100644
--- a/cmakescripts/BuildPackages.cmake
+++ b/cmakescripts/BuildPackages.cmake
@@ -137,13 +137,13 @@
   "Directory containing 32-bit (i386) Mac build to include in universal binaries (default: ${DEFAULT_OSX_32BIT_BUILD})")
 set(DEFAULT_IOS_ARMV7_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7)
 set(IOS_ARMV7_BUILD ${DEFAULT_IOS_ARMV7_BUILD} CACHE PATH
-  "Directory containing ARMv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})")
+  "Directory containing Armv7 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7_BUILD})")
 set(DEFAULT_IOS_ARMV7S_BUILD ${CMAKE_SOURCE_DIR}/iosarmv7s)
 set(IOS_ARMV7S_BUILD ${DEFAULT_IOS_ARMV7S_BUILD} CACHE PATH
-  "Directory containing ARMv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})")
+  "Directory containing Armv7s iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV7S_BUILD})")
 set(DEFAULT_IOS_ARMV8_BUILD ${CMAKE_SOURCE_DIR}/iosarmv8)
 set(IOS_ARMV8_BUILD ${DEFAULT_IOS_ARMV8_BUILD} CACHE PATH
-  "Directory containing ARMv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})")
+  "Directory containing Armv8 iOS build to include in universal binaries (default: ${DEFAULT_IOS_ARMV8_BUILD})")
 
 set(OSX_APP_CERT_NAME "" CACHE STRING
   "Name of the Developer ID Application certificate (in the macOS keychain) that should be used to sign the libjpeg-turbo DMG.  Leave this blank to generate an unsigned DMG.")
diff --git a/jchuff.c b/jchuff.c
index cb05055..db85ce1 100644
--- a/jchuff.c
+++ b/jchuff.c
@@ -34,10 +34,10 @@
  * memory footprint by 64k, which is important for some mobile applications
  * that create many isolated instances of libjpeg-turbo (web browsers, for
  * instance.)  This may improve performance on some mobile platforms as well.
- * This feature is enabled by default only on ARM processors, because some x86
+ * This feature is enabled by default only on Arm processors, because some x86
  * chips have a slow implementation of bsr, and the use of clz/bsr cannot be
  * shown to have a significant performance impact even on the x86 chips that
- * have a fast implementation of it.  When building for ARMv6, you can
+ * have a fast implementation of it.  When building for Armv6, you can
  * explicitly disable the use of clz/bsr by adding -mthumb to the compiler
  * flags (this defines __thumb__).
  */
diff --git a/jcphuff.c b/jcphuff.c
index 8c4efaf..a8b94be 100644
--- a/jcphuff.c
+++ b/jcphuff.c
@@ -43,10 +43,10 @@
  * memory footprint by 64k, which is important for some mobile applications
  * that create many isolated instances of libjpeg-turbo (web browsers, for
  * instance.)  This may improve performance on some mobile platforms as well.
- * This feature is enabled by default only on ARM processors, because some x86
+ * This feature is enabled by default only on Arm processors, because some x86
  * chips have a slow implementation of bsr, and the use of clz/bsr cannot be
  * shown to have a significant performance impact even on the x86 chips that
- * have a fast implementation of it.  When building for ARMv6, you can
+ * have a fast implementation of it.  When building for Armv6, you can
  * explicitly disable the use of clz/bsr by adding -mthumb to the compiler
  * flags (this defines __thumb__).
  */
diff --git a/release/ReadMe.txt b/release/ReadMe.txt
index 0a08711..0d1888d 100644
--- a/release/ReadMe.txt
+++ b/release/ReadMe.txt
@@ -1,4 +1,4 @@
-libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal.  On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines.  In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
+libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86 and x86-64 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal.  On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines.  In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.
 
 libjpeg-turbo implements both the traditional libjpeg API as well as the less powerful but more straightforward TurboJPEG API.  libjpeg-turbo also features colorspace extensions that allow it to compress from/decompress to 32-bit and big-endian pixel buffers (RGBX, XBGR, etc.), as well as a full-featured Java interface.
 
diff --git a/release/deb-control.in b/release/deb-control.in
index c41c9a7..b82bdac 100644
--- a/release/deb-control.in
+++ b/release/deb-control.in
@@ -9,7 +9,7 @@
 Installed-Size: {__SIZE}
 Description: A SIMD-accelerated JPEG codec that provides both the libjpeg and TurboJPEG APIs
  libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
- baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
+ baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
  MIPS systems, as well as progressive JPEG compression on x86 and x86-64
  systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
  all else being equal.  On other types of systems, libjpeg-turbo can still
diff --git a/release/makemacpkg.in b/release/makemacpkg.in
index bbbfe6f..ae80bec 100755
--- a/release/makemacpkg.in
+++ b/release/makemacpkg.in
@@ -223,15 +223,15 @@
 }
 
 if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7" != "" ]; then
-	install_ios $BUILDDIRARMV7 ARMv7 armv7 arm
+	install_ios $BUILDDIRARMV7 Armv7 armv7 arm
 fi
 
 if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV7S" != "" ]; then
-	install_ios $BUILDDIRARMV7S ARMv7s armv7s arm
+	install_ios $BUILDDIRARMV7S Armv7s armv7s arm
 fi
 
 if [ $UNIVERSAL = 1 -a "$BUILDDIRARMV8" != "" ]; then
-	install_ios $BUILDDIRARMV8 ARMv8 armv8 arm64
+	install_ios $BUILDDIRARMV8 Armv8 armv8 arm64
 fi
 
 install_name_tool -id $LIBDIR/$LIBJPEG_DSO_NAME $PKGROOT/$LIBDIR/$LIBJPEG_DSO_NAME
diff --git a/release/rpm.spec.in b/release/rpm.spec.in
index 83a1669..f8db764 100644
--- a/release/rpm.spec.in
+++ b/release/rpm.spec.in
@@ -52,7 +52,7 @@
 
 %description
 libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate
-baseline JPEG compression and decompression on x86, x86-64, ARM, PowerPC, and
+baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and
 MIPS systems, as well as progressive JPEG compression on x86 and x86-64
 systems.  On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg,
 all else being equal.  On other types of systems, libjpeg-turbo can still
diff --git a/simd/CMakeLists.txt b/simd/CMakeLists.txt
index 5c8009a..ba0bd13 100644
--- a/simd/CMakeLists.txt
+++ b/simd/CMakeLists.txt
@@ -205,7 +205,7 @@
 
 
 ###############################################################################
-# ARM (GAS)
+# Arm (GAS)
 ###############################################################################
 
 elseif(CPU_TYPE STREQUAL "arm64" OR CPU_TYPE STREQUAL "arm")
diff --git a/simd/arm/jsimd.c b/simd/arm/jsimd.c
index 45f9b04..709656c 100644
--- a/simd/arm/jsimd.c
+++ b/simd/arm/jsimd.c
@@ -13,7 +13,7 @@
  *
  * This file contains the interface between the "normal" portions
  * of the library and the SIMD implementations when running on a
- * 32-bit ARM architecture.
+ * 32-bit Arm architecture.
  */
 
 #define JPEG_INTERNALS
@@ -118,7 +118,7 @@
 #if defined(__ARM_NEON__)
   simd_support |= JSIMD_NEON;
 #elif defined(__linux__) || defined(ANDROID) || defined(__ANDROID__)
-  /* We still have a chance to use NEON regardless of globally used
+  /* We still have a chance to use Neon regardless of globally used
    * -mcpu/-mfpu options passed to gcc by performing runtime detection via
    * /proc/cpuinfo parsing on linux/android */
   while (!parse_proc_cpuinfo(bufsize)) {
diff --git a/simd/arm/jsimd_neon.S b/simd/arm/jsimd_neon.S
index af929fe..f8f0dad 100644
--- a/simd/arm/jsimd_neon.S
+++ b/simd/arm/jsimd_neon.S
@@ -1,5 +1,5 @@
 /*
- * ARMv7 NEON optimizations for libjpeg-turbo
+ * Armv7 Neon optimizations for libjpeg-turbo
  *
  * Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
  *                          All Rights Reserved.
@@ -229,7 +229,7 @@
     ROW7L           .req d30
     ROW7R           .req d31
 
-    /* Load and dequantize coefficients into NEON registers
+    /* Load and dequantize coefficients into Neon registers
      * with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
@@ -261,7 +261,7 @@
     vld1.16         {d0, d1, d2, d3}, [ip, :128]  /* load constants */
     add             ip, ip, #16
     vmul.s16        q15, q15, q3
-    vpush           {d8-d15}                      /* save NEON registers */
+    vpush           {d8-d15}                      /* save Neon registers */
     /* 1-D IDCT, pass 1, left 4x8 half */
     vadd.s16        d4, ROW7L, ROW3L
     vadd.s16        d5, ROW5L, ROW1L
@@ -507,7 +507,7 @@
     vqrshrn.s16     d17, q9, #2
     vqrshrn.s16     d18, q10, #2
     vqrshrn.s16     d19, q11, #2
-    vpop            {d8-d15}                      /* restore NEON registers */
+    vpop            {d8-d15}                      /* restore Neon registers */
     vqrshrn.s16     d20, q12, #2
       /* Transpose the final 8-bit samples and do signed->unsigned conversion */
       vtrn.16         q8, q9
@@ -688,7 +688,7 @@
  * function from jidctfst.c
  *
  * Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
- * But in ARM NEON case some extra additions are required because VQDMULH
+ * But in Arm Neon case some extra additions are required because VQDMULH
  * instruction can't handle the constants larger than 1. So the expressions
  * like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
  * which introduces an extra addition. Overall, there are 6 extra additions
@@ -718,7 +718,7 @@
     TMP3            .req r2
     TMP4            .req ip
 
-    /* Load and dequantize coefficients into NEON registers
+    /* Load and dequantize coefficients into Neon registers
      * with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
@@ -749,7 +749,7 @@
     vmul.s16        q13, q13, q1
     vld1.16         {d0}, [ip, :64]  /* load constants */
     vmul.s16        q15, q15, q3
-    vpush           {d8-d13}         /* save NEON registers */
+    vpush           {d8-d13}         /* save Neon registers */
     /* 1-D IDCT, pass 1 */
     vsub.s16        q2, q10, q14
     vadd.s16        q14, q10, q14
@@ -842,7 +842,7 @@
     vadd.s16        q14, q5, q3
     vsub.s16        q9, q5, q3
     vsub.s16        q13, q10, q2
-    vpop            {d8-d13}      /* restore NEON registers */
+    vpop            {d8-d13}      /* restore Neon registers */
     vadd.s16        q10, q10, q2
     vsub.s16        q11, q12, q1
     vadd.s16        q12, q12, q1
@@ -913,7 +913,7 @@
  *
  * NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
  *       requires much less arithmetic operations and hence should be faster.
- *       The primary purpose of this particular NEON optimized function is
+ *       The primary purpose of this particular Neon optimized function is
  *       bit exact compatibility with jpeg-6b.
  *
  * TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1016,7 +1016,7 @@
     adr             TMP4, jsimd_idct_4x4_neon_consts
     vld1.16         {d0, d1, d2, d3}, [TMP4, :128]
 
-    /* Load all COEF_BLOCK into NEON registers with the following allocation:
+    /* Load all COEF_BLOCK into Neon registers with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
      *   0 | d4      | d5
@@ -1126,7 +1126,7 @@
  *
  * NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
  *       requires much less arithmetic operations and hence should be faster.
- *       The primary purpose of this particular NEON optimized function is
+ *       The primary purpose of this particular Neon optimized function is
  *       bit exact compatibility with jpeg-6b.
  */
 
@@ -1173,7 +1173,7 @@
     adr             TMP2, jsimd_idct_2x2_neon_consts
     vld1.16         {d0}, [TMP2, :64]
 
-    /* Load all COEF_BLOCK into NEON registers with the following allocation:
+    /* Load all COEF_BLOCK into Neon registers with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
      *   0 | d4      | d5
@@ -1499,7 +1499,7 @@
     adr             ip, jsimd_ycc_\colorid\()_neon_consts
     vld1.16         {d0, d1, d2, d3}, [ip, :128]
 
-    /* Save ARM registers and handle input arguments */
+    /* Save Arm registers and handle input arguments */
     push            {r4, r5, r6, r7, r8, r9, r10, lr}
     ldr             NUM_ROWS, [sp, #(4 * 8)]
     ldr             INPUT_BUF0, [INPUT_BUF]
@@ -1507,7 +1507,7 @@
     ldr             INPUT_BUF2, [INPUT_BUF, #8]
     .unreq          INPUT_BUF
 
-    /* Save NEON registers */
+    /* Save Neon registers */
     vpush           {d8-d15}
 
     /* Initially set d10, d11, d12, d13 to 0xFF */
@@ -1814,7 +1814,7 @@
     adr             ip, jsimd_\colorid\()_ycc_neon_consts
     vld1.16         {d0, d1, d2, d3}, [ip, :128]
 
-    /* Save ARM registers and handle input arguments */
+    /* Save Arm registers and handle input arguments */
     push            {r4, r5, r6, r7, r8, r9, r10, lr}
     ldr             NUM_ROWS, [sp, #(4 * 8)]
     ldr             OUTPUT_BUF0, [OUTPUT_BUF]
@@ -1822,7 +1822,7 @@
     ldr             OUTPUT_BUF2, [OUTPUT_BUF, #8]
     .unreq          OUTPUT_BUF
 
-    /* Save NEON registers */
+    /* Save Neon registers */
     vpush           {d8-d15}
 
     /* Outer loop over scanlines */
@@ -2017,7 +2017,7 @@
     adr             TMP, jsimd_fdct_ifast_neon_consts
     vld1.16         {d0}, [TMP, :64]
 
-    /* Load all DATA into NEON registers with the following allocation:
+    /* Load all DATA into Neon registers with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
      *   0 | d16     | d17    | q8
@@ -2112,8 +2112,8 @@
  *
  * Note: the code uses 2 stage pipelining in order to improve instructions
  *       scheduling and eliminate stalls (this provides ~15% better
- *       performance for this function on both ARM Cortex-A8 and
- *       ARM Cortex-A9 when compared to the non-pipelined variant).
+ *       performance for this function on both Arm Cortex-A8 and
+ *       Arm Cortex-A9 when compared to the non-pipelined variant).
  *       The instructions which belong to the second stage use different
  *       indentation for better readiability.
  */
diff --git a/simd/arm64/jsimd.c b/simd/arm64/jsimd.c
index 0e6c7b9..808c0e3 100644
--- a/simd/arm64/jsimd.c
+++ b/simd/arm64/jsimd.c
@@ -12,7 +12,7 @@
  *
  * This file contains the interface between the "normal" portions
  * of the library and the SIMD implementations when running on a
- * 64-bit ARM architecture.
+ * 64-bit Arm architecture.
  */
 
 #define JPEG_INTERNALS
@@ -114,8 +114,8 @@
  */
 
 /*
- * ARMv8 architectures support NEON extensions by default.
- * It is no longer optional as it was with ARMv7.
+ * Armv8 architectures support Neon extensions by default.
+ * It is no longer optional as it was with Armv7.
  */
 
 
diff --git a/simd/arm64/jsimd_neon.S b/simd/arm64/jsimd_neon.S
index 70cef2c..3ed5f58 100644
--- a/simd/arm64/jsimd_neon.S
+++ b/simd/arm64/jsimd_neon.S
@@ -1,5 +1,5 @@
 /*
- * ARMv8 NEON optimizations for libjpeg-turbo
+ * Armv8 Neon optimizations for libjpeg-turbo
  *
  * Copyright (C) 2009-2011, Nokia Corporation and/or its subsidiary(-ies).
  *                          All Rights Reserved.
@@ -611,7 +611,7 @@
     shrn2           v5.8h, v15.4s, #16  /* wsptr[DCTSIZE*3] = (int)DESCALE(tmp13 + tmp0, CONST_BITS+PASS1_BITS+3) */
     shrn2           v6.8h, v17.4s, #16  /* wsptr[DCTSIZE*4] = (int)DESCALE(tmp13 - tmp0, CONST_BITS+PASS1_BITS+3) */
     movi            v0.16b, #(CENTERJSAMPLE)
-    /* Prepare pointers (dual-issue with NEON instructions) */
+    /* Prepare pointers (dual-issue with Neon instructions) */
       ldp             TMP1, TMP2, [OUTPUT_BUF], 16
     sqrshrn         v28.8b, v2.8h, #(CONST_BITS+PASS1_BITS+3-16)
       ldp             TMP3, TMP4, [OUTPUT_BUF], 16
@@ -992,7 +992,7 @@
  * function from jidctfst.c
  *
  * Normally 1-D AAN DCT needs 5 multiplications and 29 additions.
- * But in ARM NEON case some extra additions are required because VQDMULH
+ * But in Arm Neon case some extra additions are required because VQDMULH
  * instruction can't handle the constants larger than 1. So the expressions
  * like "x * 1.082392200" have to be converted to "x * 0.082392200 + x",
  * which introduces an extra addition. Overall, there are 6 extra additions
@@ -1024,7 +1024,7 @@
        instruction ensures that those bits are set to zero. */
     uxtw x3, w3
 
-    /* Load and dequantize coefficients into NEON registers
+    /* Load and dequantize coefficients into Neon registers
      * with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
@@ -1037,7 +1037,7 @@
      *   6 | d28     | d29     ( v22.8h )
      *   7 | d30     | d31     ( v23.8h )
      */
-    /* Save NEON registers used in fast IDCT */
+    /* Save Neon registers used in fast IDCT */
     get_symbol_loc  TMP5, Ljsimd_idct_ifast_neon_consts
     ld1             {v16.8h, v17.8h}, [COEF_BLOCK], 32
     ld1             {v0.8h, v1.8h}, [DCT_TABLE], 32
@@ -1142,7 +1142,7 @@
     add             v20.8h, v20.8h, v1.8h
     /* Descale to 8-bit and range limit */
     movi            v0.16b, #0x80
-      /* Prepare pointers (dual-issue with NEON instructions) */
+      /* Prepare pointers (dual-issue with Neon instructions) */
       ldp             TMP1, TMP2, [OUTPUT_BUF], 16
     sqshrn          v28.8b, v16.8h, #5
       ldp             TMP3, TMP4, [OUTPUT_BUF], 16
@@ -1221,7 +1221,7 @@
  *
  * NOTE: jpeg-8 has an improved implementation of 4x4 inverse-DCT, which
  *       requires much less arithmetic operations and hence should be faster.
- *       The primary purpose of this particular NEON optimized function is
+ *       The primary purpose of this particular Neon optimized function is
  *       bit exact compatibility with jpeg-6b.
  *
  * TODO: a bit better instructions scheduling can be achieved by expanding
@@ -1291,7 +1291,7 @@
        instruction ensures that those bits are set to zero. */
     uxtw x3, w3
 
-    /* Save all used NEON registers */
+    /* Save all used Neon registers */
     sub             sp, sp, 64
     mov             x9, sp
     /* Load constants (v3.4h is just used for padding) */
@@ -1300,7 +1300,7 @@
     st1             {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
     ld1             {v0.4h, v1.4h, v2.4h, v3.4h}, [TMP4]
 
-    /* Load all COEF_BLOCK into NEON registers with the following allocation:
+    /* Load all COEF_BLOCK into Neon registers with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
      *   0 | v4.4h   | v5.4h
@@ -1434,7 +1434,7 @@
  *
  * NOTE: jpeg-8 has an improved implementation of 2x2 inverse-DCT, which
  *       requires much less arithmetic operations and hence should be faster.
- *       The primary purpose of this particular NEON optimized function is
+ *       The primary purpose of this particular Neon optimized function is
  *       bit exact compatibility with jpeg-6b.
  */
 
@@ -1483,7 +1483,7 @@
     st1             {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
     ld1             {v14.4h}, [TMP2]
 
-    /* Load all COEF_BLOCK into NEON registers with the following allocation:
+    /* Load all COEF_BLOCK into Neon registers with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
      *   0 | v4.4h   | v5.4h
@@ -1857,7 +1857,7 @@
     /* Load constants to d1, d2, d3 (v0.4h is just used for padding) */
     get_symbol_loc  x15, Ljsimd_ycc_rgb_neon_consts
 
-    /* Save NEON registers */
+    /* Save Neon registers */
     st1             {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
     st1             {v12.8b, v13.8b, v14.8b, v15.8b}, [x9], 32
     ld1             {v0.4h, v1.4h}, [x15], 16
@@ -2142,7 +2142,7 @@
 .endm
 
 /* TODO: expand macros and interleave instructions if some in-order
- *       ARM64 processor actually can dual-issue LOAD/STORE with ALU */
+ *       AArch64 processor actually can dual-issue LOAD/STORE with ALU */
 .macro do_rgb_to_yuv_stage2_store_load_stage1 fast_ld3
     do_rgb_to_yuv_stage2
     do_load         \bpp, 8, \fast_ld3
@@ -2182,7 +2182,7 @@
     ldr             OUTPUT_BUF2, [OUTPUT_BUF, #16]
     .unreq          OUTPUT_BUF
 
-    /* Save NEON registers */
+    /* Save Neon registers */
     sub             sp, sp, #64
     mov             x9, sp
     st1             {v8.8b, v9.8b, v10.8b, v11.8b}, [x9], 32
@@ -2396,13 +2396,13 @@
     get_symbol_loc  TMP, Ljsimd_fdct_islow_neon_consts
     ld1             {v0.8h, v1.8h}, [TMP]
 
-    /* Save NEON registers */
+    /* Save Neon registers */
     sub             sp, sp, #64
     mov             x10, sp
     st1             {v8.8b, v9.8b, v10.8b, v11.8b}, [x10], 32
     st1             {v12.8b, v13.8b, v14.8b, v15.8b}, [x10], 32
 
-    /* Load all DATA into NEON registers with the following allocation:
+    /* Load all DATA into Neon registers with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
      *   0 | d16     | d17    | v16.8h
@@ -2629,7 +2629,7 @@
     st1             {v16.8h, v17.8h, v18.8h, v19.8h}, [DATA], 64
     st1             {v20.8h, v21.8h, v22.8h, v23.8h}, [DATA]
 
-    /* Restore NEON registers */
+    /* Restore Neon registers */
     ld1             {v8.8b, v9.8b, v10.8b, v11.8b}, [sp], 32
     ld1             {v12.8b, v13.8b, v14.8b, v15.8b}, [sp], 32
 
@@ -2681,7 +2681,7 @@
     get_symbol_loc  TMP, Ljsimd_fdct_ifast_neon_consts
     ld1             {v0.4h}, [TMP]
 
-    /* Load all DATA into NEON registers with the following allocation:
+    /* Load all DATA into Neon registers with the following allocation:
      *       0 1 2 3 | 4 5 6 7
      *      ---------+--------
      *   0 | d16     | d17    | v0.8h
@@ -3066,7 +3066,7 @@
 .endif
     sub             sp, sp, 272
     sub             BUFFER, BUFFER, #0x1    /* BUFFER=buffer-- */
-    /* Save ARM registers */
+    /* Save Arm registers */
     stp             x19, x20, [sp]
     get_symbol_loc  x15, Ljsimd_huff_encode_one_block_neon_consts
     ldr             PUT_BUFFER, [x0, #0x10]