q14 rethink

I've been thinking and rethinking and rethinking how best to use 16-bit
values like Q14 fixed-point in SkVM.  Here's some ways:

   A) don't... just use 32-bit values instead
   B) use 16x2-bit pairs to match the narrower 32-bit lane count
   C) double-pump 32-bit values to match the wider 16-bit lane count
   D) use native 16- and 32-bit values and let the backends sort it out

A) is how things work today, and C) is how SkRasterPipeline's lowp mode
works.  Having tried out B) and C) both for a good fair shake, they were
both already awkward to work with after writing just a few functions.  I
would not give up on them entirely, but they're no longer my favorites.

D) is subtle and my new favorite.  It's easiest to program with SkVM
when the values we're holding represent single values and the backend
handles any parallelism for us.  That suggests we add a simple 16-bit
Q14 to the existing 32-bit I32 and F32 types, where they can be actively
converted between as normal, but not freely no-op bit punned.  D) says
we people shouldn't have to choose between A-C) up front... each backend
can handle it themselves.

Under strategy D), it's entirely the backend's job to decide how to
represent each value, and how to to vectorize them.  We don't need to
know as a user, and the backends can use the program itself to inform
how they vectorize.  16-bit values could live in xmm registers and
32-bit values in ymm, or the 16-bit values could go in the low half of a
ymm, or the even lanes of a ymm, or a full ymm and use two for 32-bit
values, etc. etc.  This all is a backend choice, not something we should
have to know about when writing a program using Q14/I32/F32.

My next steps are to get Q14 operations tested and plumbed through the
JIT again, and to build out a blitter and a few effects using Q14 color
channels.  Then, independently, we can look at each backend and how to
vectorize them.  Some ideas:

    1) keep running at current vectorization, with half rate 16-bit ops
    2) pump up to 2x wider vectorization unconditionally to favor 16-bit
    3) pump up to 2x wider vectorization only when any 16-bit op is used

These choices can be made independently for each backend (JIT, LLVM,
interp), and I wouldn't be surprised to find that we'll want to do them
differently.  For instance, the interpreter is already running at 32x
vectorization... might be pumping it higher won't help anything.

Change-Id: Ib8ad2b1bf790e8c4e3acfb4818d4032f7628e8f8
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/319321
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Mike Reed <reed@google.com>
5 files changed