Fix invalid memory access and optimise Blit_3or4_to_3or4__*

Fix invalid write at last pixel of the surface:
  when surface has no padding (pitch == w * bpp) and bpp is 3
  with Blit, no colorkey, and NO_ALPHA same or inverse rgb triplet

Optimise by using int32 access:

BGR24 -> ARGB8888 :  faster x1.897875   (362405 -> 190953)
RGB24 -> ABGR8888 :  faster x1.660416   (363304 -> 218803)

ABGR8888 -> RGB24 :  faster x1.686319   (334962 -> 198635)
ARGB8888 -> BGR24 :  faster x1.691868   (324524 -> 191814)
BGR24 -> RGB888 :  faster x1.678459   (326811 -> 194709)
BGR888 -> RGB24 :  faster x1.731772   (327724 -> 189242)
RGB24 -> BGR888 :  faster x1.690989   (328916 -> 194511)
RGB888 -> BGR24 :  faster x1.698333   (326175 -> 192056)
1 file changed