sketch single-source multi-target skcms_Transform()

On x86-64 targets not already compiling globally with AVX2, we add in an
extra AVX2 slice of the core of skcms_Transform() and run it if we
detect CPU support at runtime.  In exchange for the improved
performance, we pay an extra 25 or 30K of code size.

No signficant size or logic change in any other builds.

I am ambivalent about whether skcms_Transform() should do its own CPU
detection as written here, or take an argument with policy bits filled
out by the caller, e.g.

   struct skcms_TransformPolicy {
        // If true, allow skcms_Transform() to use code targeting
        // Haswell or later (i.e. -march=haswell) processors,
        // namely the AVX2, F16C, and FMA instruction sets.
        bool hsw_ok;
   };

Change-Id: Idfe3a28ad128bfb2fa48096cfc55d000255fbfbe
Reviewed-on: https://skia-review.googlesource.com/117500
Reviewed-by: Brian Osman <brianosman@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
5 files changed