Optimize find_optimal_selector_clusters_for_each_block

In find_optimal_selector_clusters_for_each_block, a noticeable amount of
time is spent computing color distances for pixels of different clusters.

Because for each pixel we only have 4 colors to compare against, we have
to compute a grand total of 64 unique color deltas; the cluster count,
however, is typically ~200 and we computed 16 deltas for each cluster.

It's thus cheaper to precompute all 64 deltas ahead of time and just add
the right deltas up for each cluster.

This reduces the time to encode a 2Kx2K image with a mip chain in a
single thread with SSE4.1 enabled from 7.8 seconds to 7.3 seconds; the
resulting image is binary identical before/after this change.
diff --git a/encoder/basisu_frontend.cpp b/encoder/basisu_frontend.cpp
index f7d47e9..701e9f4 100644
--- a/encoder/basisu_frontend.cpp
+++ b/encoder/basisu_frontend.cpp
@@ -1931,6 +1931,17 @@
 					color_rgba trial_block_colors[4];
 					blk.get_block_colors(trial_block_colors, 0);
 
+					// precompute errors for the i-th block pixel and selector sel: [sel][i]
+					uint32_t trial_errors[4][16];
+
+					for (int sel = 0; sel < 4; ++sel)
+					{
+						for (int i = 0; i < 16; ++i)
+						{
+							trial_errors[sel][i] = color_distance(m_params.m_perceptual, pBlock_pixels[i], trial_block_colors[sel], false);
+						}
+					}
+
 					uint64_t best_cluster_err = INT64_MAX;
 					uint32_t best_cluster_index = 0;
 
@@ -1984,7 +1995,7 @@
 							{
 								const uint32_t sel = unpacked_optimized_cluster_selectors[cluster_index * 16 + i];
 										
-								trial_err += color_distance(true, trial_block_colors[sel], pBlock_pixels[i], false);
+								trial_err += trial_errors[sel][i];
 								if (trial_err > best_cluster_err)
 									goto early_out;
 							}
@@ -2015,7 +2026,7 @@
 							{
 								const uint32_t sel = unpacked_optimized_cluster_selectors[cluster_index * 16 + i];
 
-								trial_err += color_distance(false, trial_block_colors[sel], pBlock_pixels[i], false);
+								trial_err += trial_errors[sel][i];
 								if (trial_err > best_cluster_err)
 									goto early_out2;
 							}