add small_memcpy()

That last CL that started using non-constant-sized memcpy() must have
thrown off the optimizer... the stores do bizarre things like,

  store 16 byte vector to stack
  load low 8 bytes back from stack to register A
  load high 8 bytes back from stack to register B
  store to dst+0 from register A
  store to dst+8 from register B

when they could do

  store 16 byte vector to dst+0

This small_memcpy() that uses __builtin_memcpy()
when possible restores the sanity.

Change-Id: I220d03228dc4b2652e988488a7772101d59994c1
Reviewed-on: https://skia-review.googlesource.com/96961
Reviewed-by: Brian Osman <brianosman@google.com>
1 file changed