refactor memory stages

I was looking at wider PixelFormats and thought up a new stage
factorization that favors the wide stride runs even more, and makes it
easier to think about how memory loads and stores and the format
conversions interact, especially for >4 byte pixels.

The main interesting logic is in load_<PixelFormat>_N, which can now
assume it's always loading N pixels.  When we don't have N pixels to
load, load_<PixelFormat>_1 makes it appear as if we do, with zeros above
the bottom lane.  Stores work the same way in reverse, storing N pixels
to a temporary stack buffer, then carefully copying out the one real
pixel.

This design also makes it easy to pivot to handling all <N pixels in one
pass instead of one at a time, should we find that necessary.

Change-Id: I8745418b9a5d0607b98b58d49287d8a3da8165a9
2 files changed
tree: 6973fc92b76a6b45b63747800e805578d7c6e3bc
  1. src/
  2. .gitignore
  3. build.ninja
  4. CMakeLists.txt
  5. LICENSE
  6. skcms.c
  7. skcms.h
  8. tests.c