| Name String | |
| cl_intel_media_block_io | |
| Contributors | |
| Biju George, Intel | |
| Ben Ashbaugh, Intel | |
| Scott Pillow, Intel | |
| Contact | |
| Biju George, Intel (biju.george 'at' intel.com) | |
| Version | |
| Version 1, December 2, 2016 | |
| Number | |
| OpenCL Extension #51 | |
| Status | |
| First Draft | |
| Dependencies | |
| OpenCL 1.2 is required. | |
| The OpenCL Intel vendor extension cl_intel_subgroups is required. The media | |
| block read/write built-in functions are an extension of the subgroup | |
| functions defined in cl_intel_subgroups. | |
| This extension is written against revision 29 of the OpenCL 2.0 API | |
| specification, against revision 33 of the OpenCL 2.0 OpenCL C | |
| specification, and against revision 32 of the OpenCL 2.0 extension | |
| specification. | |
| Overview | |
| This extension augments the block read/write functionality available in the | |
| Intel vendor extensions cl_intel_subgroups and cl_intel_subgroups_short by | |
| the specification of additional built-in functions to facilitate the reading | |
| and writing of flexible 2D regions from images. This API allows for | |
| the explicit specification of the width and height of the image regions. | |
| While not required, this extension is most useful when the subgroup size is | |
| known at compile-time. The primary use case for this extension is to support | |
| the reading of the edge texels (or image elements) of neighboring macro- | |
| blocks as described in the Intel vendor extension cl_intel_device_side_avc_ | |
| motion_estimation. When using the built-in functions from cl_intel_device_ | |
| side_avc_motion_estimation the subgroup size is implicitly fixed to 16. In | |
| other use cases the subgroup size may be fixed using the cl_intel_required_ | |
| subgroup_size extension, if needed. | |
| New API Enums | |
| None. | |
| Terms and Definitions | |
| Texel: | |
| ------ | |
| This refers to an images element (or an image pixel). | |
| Byte: | |
| ----- | |
| A 8-bit unsigned integer (or cl_uchar). | |
| Word: | |
| ----- | |
| A 16-bit unsigned integer (or cl_ushort). | |
| Dword: | |
| ----- | |
| A 32-bit unsigned integer (or cl_uint). | |
| New OpenCL C built-in functions | |
| Append to Section 6.13.X "Sub Group Read and Write Functions" of the | |
| OpenCL 2.0 C Specification, which was added by the cl_intel_subgroups | |
| extension. | |
| These built-in functions must be encountered by all work items in a subgroup | |
| executing the kernel, otherwise the behavior is undefined (i.e. they can only | |
| be used only in convergent control flow where all the work-items in the sub- | |
| groups are enabled). | |
| The following restrictions in Table 6.X apply for allowed sizes for the | |
| explicit width and height parameters of the flexible media block read/write | |
| built-in functions. | |
| Table 6.X | |
| +-----------+--------------+ | |
| | width |maximum height| | |
| | (bytes) | (rows) | | |
| +-----------+--------------+ | |
| | 4 | 64 | | |
| +-----------+--------------+ | |
| | 8 | 32 | | |
| +-----------+--------------+ | |
| | 12,16 | 16 | | |
| +-----------+--------------+ | |
| |20,24,28,32| 8 | | |
| +-----------+--------------+ | |
| If OpenCL image kernel parameters are used as input arguments in calls | |
| to the flexible media block read/write built-in functions, then these image | |
| objects must exclusively be used only by flexible media block read/write | |
| built-in functions. If the same image needs to be used for other image | |
| operations within the same kernel, then an additional image parameter may be | |
| used that is bound to the same image object as for the flexible media block | |
| read/write built-in function call in the OpenCL host application for the | |
| kernel enqueue API call. | |
| Additionally for images that are read from or written to using the flexible | |
| media block read/write built-in functions, the images should be created with | |
| the "image_width" value in "cl_image_desc" such that image_width multiplied | |
| by the texel size is a multiple of 4. | |
| The following additional restrictions are imposed for reading 2D images | |
| created from buffers with the flexible media block read/write built-in | |
| functions. | |
| 1. The image row pitch is required to be a multiple of 64-bytes, in addition | |
| to the CL_DEVICE_IMAGE_PITCH_ALIGNMENT requirements. | |
| 2. If the buffer was created using CL_MEM_USE_HOST_PTR with the host | |
| application providing the storage bits for the memory object using | |
| "host_ptr", then "host_ptr" is required to be 16-byte aligned, in addition | |
| to the clCreateBuffer requirements. | |
| 3. The maximum height is further restricted to 16 or less. | |
| The behavior is undefined if the flexible media block read/write built-in | |
| functions are used to directly read/write planar YUV image. Instead they may | |
| be indirectly used to read/write a 2D image object representing a single | |
| plane from a planar YUV image object. Creating a 2D image object | |
| representing a single plane from a planar YUV image object is described in | |
| cl_intel_planar_yuv. | |
| _____________________________________________________________________________ | |
| Read operations | |
| --------------- | |
| Byte sized read operations | |
| ++++++++++++++++++++++++++ | |
| +---------------------------------+----------------------------------------+ | |
| |uchar |Reads a 2D region from an image. | | |
| |intel_sub_group_media_block_read_| | | |
| |uc( |The 2D source byte offset of the | | |
| | int2 src_byte_offset, |top-left corner and the width and height| | |
| | int width, |of the region are specified explicitly | | |
| | int height, |in the interface parameters. The source | | |
| | read_only image2d_t image ) |byte x-offset and width must be 4 byte | | |
| | |aligned. | | |
| +---------------------------------+ | | |
| |uchar2 |The width is specified in byte units | | |
| |intel_sub_group_media_block_read_|must be less than or equal to 32. The | | |
| |uc2( |width and height of the region must be | | |
| | int2 src_byte_offset, |compile-time constants. | | |
| | int width, | | | |
| | int height, |The read-in texels in the 2D region | | |
| | read_only image2d_t image ) |taken in row-major order are | | |
| | |re-organized as another 2D region with | | |
| +---------------------------------+the byte width equal to the subgroup | | |
| |uchar4 |size. Then each work-item reads each | | |
| |intel_sub_group_media_block_read_|byte column vector of the re-organized | | |
| |uc4( |rectangle, i.e. each column's subsequent| | |
| | int2 src_byte_offset, |data element's address is strided by the| | |
| | int width, |subgroup size. | | |
| | int height, | | | |
| | read_only image2d_t image ) |The max byte area of the region is | | |
| | |defined as the byte size of the return | | |
| +---------------------------------+type multiplied by the subgroup size. If| | |
| |uchar8 |the byte area of the region is less than| | |
| |intel_sub_group_media_block_read_|its max byte area, then corresponding | | |
| |uc8( |tail elements of some of the column | | |
| | int2 src_byte_offset, |vector are undefined. Conversely if the | | |
| | int width, |byte area of the region is more than the| | |
| | int height, |max byte area, then some corresponding | | |
| | read_only image2d_t image ) |of the tail elements of the region are | | |
| | |dropped. | | |
| +---------------------------------+ | | |
| |uchar16 |For out-of-bound reads, the read-in | | |
| |intel_sub_group_media_block_read_|texels are replicated from the nearest | | |
| |uc16( |edge for byte sized texels. The out-of- | | |
| | int2 src_byte_offset, |bound behavior is undefined for larger | | |
| | int width, |sized texels with the "_uc" | | |
| | int height, |builtin-functions. | | |
| | read_only image2d_t image ) | | | |
| +---------------------------------+----------------------------------------+ | |
| Word sized read operations | |
| ++++++++++++++++++++++++++ | |
| +---------------------------------+----------------------------------------+ | |
| |ushort |Reads a 2D region from an image. | | |
| |intel_sub_group_media_block_read_| | | |
| |us( |The 2D source byte offset of the | | |
| | int2 src_byte_offset, |top-left corner and the width and height| | |
| | int width, |of the region are specified explicitly | | |
| | int height, |in the interface parameters. The source | | |
| | read_only image2d_t image ) |byte x-offset and width must be 4 byte | | |
| | |aligned. | | |
| +---------------------------------+ | | |
| |ushort2 |The width is specified in word units and| | |
| |intel_sub_group_media_block_read_|must be less than or equal to 16. The | | |
| |us2( |width and height of the region must be | | |
| | int2 src_byte_offset, |compile-time constants. | | |
| | int width, | | | |
| | int height, |The read-in texels in the 2D region | | |
| | read_only image2d_t image ) |taken in row-major order are | | |
| | |re-organized as another 2D region with | | |
| +---------------------------------+the word width equal to the subgroup | | |
| |ushort4 |size. Then each work-item reads each | | |
| |intel_sub_group_media_block_read_|word column vector of the re-organized | | |
| |us4( |rectangle, i.e. each column's subsequent| | |
| | int2 src_byte_offset, |data element's address is strided by the| | |
| | int width, |subgroup size multipled by 2. | | |
| | int height, | | | |
| | read_only image2d_t image ) |The max word area of the region is | | |
| | |defined as the word size of the return | | |
| +---------------------------------+type multiplied by the subgroup size. If| | |
| |ushort8 |the word area of the region is less than| | |
| |intel_sub_group_media_block_read_|its max word area, then corresponding | | |
| |us8( |tail elements of some of the column | | |
| | int2 src_byte_offset, |vector are undefined. Conversely if the | | |
| | int width, |word area of the region is more than the| | |
| | int height, |max word area, then some corresponding | | |
| | read_only image2d_t image ) |of the tail elements of the region are | | |
| | |dropped. | | |
| +---------------------------------+ | | |
| |ushort16 |For out-of-bound reads, the read-in | | |
| |intel_sub_group_media_block_read_|texels are replicated from the nearest | | |
| |us16( |edge for byte and word sized texels. The| | |
| | int2 src_byte_offset, |out-of-bound behavior is undefined for | | |
| | int width, |larger sized texels with the "_us" | | |
| | int height, |builtin-functions. | | |
| | read_only image2d_t image ) | | | |
| +---------------------------------+----------------------------------------+ | |
| Double Word (DWORD) sized read operations | |
| +++++++++++++++++++++++++++++++++++++++++ | |
| +---------------------------------+----------------------------------------+ | |
| |uint |Reads a 2D region from an image. | | |
| |intel_sub_group_media_block_read_| | | |
| |ui( |The 2D source byte offset of the | | |
| | int2 src_byte_offset, |top-left corner and the width and height| | |
| | int width, |of the region are specified explicitly | | |
| | int height, |in the interface parameters. The source | | |
| | read_only image2d_t image ) |byte x-offset and width must be 4 byte | | |
| | |aligned. | | |
| +---------------------------------+ | | |
| |uint2 |The width is specified in dword units | | |
| |intel_sub_group_media_block_read_|and must be less than or equal to 8. The| | |
| |ui2( |width and height of the region must be | | |
| | int2 src_byte_offset, |compile-time constants. | | |
| | int width, | | | |
| | int height, |The read-in texels in the 2D region | | |
| | read_only image2d_t image ) |taken in row-major order are | | |
| | |re-organized as another 2D region with | | |
| +---------------------------------+the dword width equal to the subgroup | | |
| |uint4 |size. Then each work-item reads each | | |
| |intel_sub_group_media_block_read_|dword column vector of the re-organized | | |
| |ui4( |rectangle, i.e. each column's subsequent| | |
| | int2 src_byte_offset, |data elements's address is strided by | | |
| | int width, |the subgroup size multipled by 4. | | |
| | int height, | | | |
| | read_only image2d_t image ) |The max dword area of the region is the | | |
| | |dword size of the return type multiplied| | |
| +---------------------------------+by the subgroup size. If the dword area | | |
| |uint8 |of the region is less than its max dword| | |
| |intel_sub_group_media_block_read_|area, then corresponding tail elements | | |
| |ui8( |of some of the column vector are | | |
| | int2 src_byte_offset, |undefined. Conversely if the dword area | | |
| | int width, |of the region is more than the max dword| | |
| | int height, |area, then corresponding some of the | | |
| | read_only image2d_t image ) |tail elements of the region are dropped.| | |
| | | | | |
| | |For out-of-bound reads, the read-in | | |
| | |texels are replicated from the nearest | | |
| | |edge for byte, word and dword sized | | |
| | |texels. The out-of-bound behavior is | | |
| | |undefined for larger sized texels with | | |
| | |the "_ui" builtin-functions. | | |
| +---------------------------------+----------------------------------------+ | |
| Additional notes on out-of-bound reads: | |
| +++++++++++++++++++++++++++++++++++++++ | |
| 1. For an image with byte texels, the boundary byte is replicated. For | |
| example, for a boundary word B0B1B2B3, to replicate the left boundary | |
| byte texel, the out of bound dwords have the format of B0B0B0B0, and that | |
| for right boundary is B3B3B3B3. | |
| 2. For an image with word texels, boundary texel replication is on words. For | |
| example, for a boundary dword B0B1B2B3, to replicate the left boundary | |
| word texel, the out of bound dwords have the format of B0B1B0B1, and that | |
| for right boundary is B2B3B2B3. | |
| 3. For special images with (word texel) YUV packed format as described in | |
| the cl_intel_packed_yuv extension, there are two cases depending on the | |
| Y location: CL_YUYV_INTEL and CL_UYVY_INTEL. Boundary handling for | |
| CL_YVYU_INTEL is the same as that for CL_YUYV_INTEL. Similarly, boundary | |
| handling for CL_VYUY_INTEL is the same as that for UYVY. For a boundary | |
| dword Y0U0Y1V0, to replicate the left boundary, we get Y0U0Y0V0, and to | |
| replicate the right boundary, we get Y1U0Y1V0. For a boundary dword | |
| U0Y0V0Y1, to replicate the left boundary, we get U0Y0V0Y0, and to | |
| replicate the right boundary, we get U0Y1V0Y1. | |
| 4. For an image with dword texels, the boundary dword texel is replicated. | |
| 5. The behavior is undefined for images with greater than dword sized texels | |
| (such as CL_RGBA + CL_FLOAT). | |
| _____________________________________________________________________________ | |
| Write operations | |
| ---------------- | |
| Byte sized write operations | |
| +++++++++++++++++++++++++++ | |
| +----------------------------------+---------------------------------------+ | |
| |void |Writes to a 2D region of an image with | | |
| |intel_sub_group_media_block_write_|surface formats of byte sized texels. | | |
| |uc( | | | |
| | int2 src_byte_offset, |The 2D source byte offset of the | | |
| | int width, |top-left corner and the width and | | |
| | int height, |height of the region are specified | | |
| | uchar texel, |explicitly in the interface | | |
| | image2d_t image ) |parameters. The source byte x-offset | | |
| | |and width must be 4 byte aligned. | | |
| +----------------------------------+ | | |
| |intel_sub_group_media_block_write_|The width is specified in byte units | | |
| |uc2( |and must be less than or equal to | | |
| | int2 src_byte_offset, |32. The width and height of the region | | |
| | int width, |must be compile-time constants. | | |
| | int height, | | | |
| | uchar2 texels, |The 2D region that is written to is | | |
| | image2d_t image ) |logically re-organized taken in | | |
| | |row-major order as another 2D region | | |
| +----------------------------------+with the byte width equal to the | | |
| |void |subgroup size. Then each work-item | | |
| |intel_sub_group_media_block_write_|processes each byte column vector of | | |
| |uc4( |the logically re-organized rectangle, | | |
| | int2 src_byte_offset, |i.e. each column's subsequent data | | |
| | int width, |element's address is strided by the | | |
| | int height, |subgroup size. | | |
| | uchar4 texels, | | | |
| | image2d_t image ) |The max byte area of the region is | | |
| | |defined as the byte size of the return | | |
| +----------------------------------+type multiplied by the subgroup | | |
| |void |size. If the byte area of the region is| | |
| |intel_sub_group_media_block_write_|less than its max byte area, then | | |
| |uc8( |corresponding tail elements of some of | | |
| | int2 src_byte_offset, |the column vector will not be included | | |
| | int width, |in the written out region. Conversely | | |
| | int height, |if the byte area of the region is more | | |
| | uchar8 texels, |than the max byte area, then | | |
| | image2d_t image ) |corresponding some of the tail elements| | |
| | |of the region are dropped. | | |
| +----------------------------------+ | | |
| |void |Out-of-bound writes are dropped. | | |
| |intel_sub_group_media_block_write_| | | |
| |uc16( | | | |
| | int2 src_byte_offset, | | | |
| | int width, | | | |
| | int height, | | | |
| | uchar16 texels, | | | |
| | image2d_t image ) | | | |
| | | | | |
| +----------------------------------+---------------------------------------+ | |
| Word sized write operations | |
| +++++++++++++++++++++++++++ | |
| +----------------------------------+---------------------------------------+ | |
| |void |Writes to a 2D region of an image with | | |
| |intel_sub_group_media_block_write_|surface formats of word or byte sized | | |
| |us( |texels. | | |
| | int2 src_byte_offset, | | | |
| | int width, |The 2D source byte offset of the | | |
| | int height, |top-left corner and the width and | | |
| | ushort texel, |height of the region are specified | | |
| | image2d_t image ) |explicitly in the interface parameters.| | |
| | |The source byte x-offset and width must| | |
| +----------------------------------+be dword aligned. | | |
| |void | | | |
| |intel_sub_group_media_block_write_|The width is specified in word units | | |
| |us2( |and must be less than or equal to | | |
| | int2 src_byte_offset, |16. The width and height of the region | | |
| | int width, |must be compile-time constants. | | |
| | int height, | | | |
| | ushort2 texels, |The 2D region that is written to is | | |
| | image2d_t image ) |logically re-organized taken in | | |
| | |row-major order as another 2D region | | |
| +----------------------------------+with the word width equal to the | | |
| |void |subgroup size. Then each work-item | | |
| |intel_sub_group_media_block_write_|processes each column vector of the | | |
| |us4( |logically re-organized rectangle, | | |
| | int2 src_byte_offset, |i.e. each column's subsequent data | | |
| | int width, |element's address is strided by the | | |
| | int height, |subgroup size multipled by 2. | | |
| | ushort4 texels, | | | |
| | image2d_t image ) |The max word area of the region is | | |
| | |defined as the word size of the return | | |
| +----------------------------------+type multiplied by the subgroup | | |
| |void |size. If the word area of the region is| | |
| |intel_sub_group_media_block_write_|less than its max word area, then | | |
| |us8( |corresponding tail elements of some of | | |
| | int2 src_byte_offset, |the column vector will not be included | | |
| | int width, |in the written out region. Conversely | | |
| | int height, |if the word area of the region is more | | |
| | ushort8 texels, |than the max word area, then | | |
| | image2d_t image ) |corresponding some of the tail elements| | |
| | |of the region are dropped. | | |
| +----------------------------------+ | | |
| |void |Out-of-bound writes are dropped. | | |
| |intel_sub_group_media_block_write_| | | |
| |us16( | | | |
| | int2 src_byte_offset, | | | |
| | int width, | | | |
| | int height, | | | |
| | ushort16 texels, | | | |
| | image2d_t image ) | | | |
| | | | | |
| +----------------------------------+---------------------------------------+ | |
| Double word (DWORD) sized write operations | |
| ++++++++++++++++++++++++++++++++++++++++++ | |
| +----------------------------------+---------------------------------------+ | |
| |void |Writes to a 2D region of an image with | | |
| |intel_sub_group_media_block_write_|surface formats of dword, word or sized| | |
| |ui( |texels. | | |
| | int2 src_byte_offset, | | | |
| | int width, |The 2D source byte offset of the | | |
| | int height, |top-left corner and the width and | | |
| | uint texels, |height of the region are specified | | |
| | image2d_t image ) |explicitly in the interface | | |
| | |parameters. The source byte x-offset | | |
| +----------------------------------+and width must be 4 byte aligned. | | |
| |void | | | |
| |intel_sub_group_media_block_write_|The width is specified in dword units | | |
| |ui2( |and must be less than or equal to | | |
| | int2 src_byte_offset, |8. The width and height of the region | | |
| | int width, |must be compile-time constants. | | |
| | int height, | | | |
| | uint2 texels, |The 2D region that is written to is | | |
| | image2d_t image ) |logically re-organized taken in | | |
| | |row-major order as another 2D region | | |
| +----------------------------------+with the dword width equal to the | | |
| |void |subgroup size. Then each work-item | | |
| |intel_sub_group_media_block_write_|processes each column vector of the | | |
| |ui4( |logically re-organized rectangle, | | |
| | int2 src_byte_offset, |i.e. each column's subsequent data | | |
| | int width, |element's address is strided by the | | |
| | int height, |subgroup size multiplied by 4. | | |
| | uint4 texels, | | | |
| | image2d_t image ) |The max dword area of the region is | | |
| | |defined as the dword size of the return| | |
| +----------------------------------+type multiplied by the subgroup | | |
| |void |size. If the dword area of the region | | |
| |intel_sub_group_media_block_write_|is less than its max dword area, then | | |
| |ui8( |corresponding tail elements of some of | | |
| | int2 src_byte_offset, |the column vector will not be included | | |
| | int width, |in the written out region. Conversely | | |
| | int height, |if the dword area of the region is more| | |
| | uint8 texels, |than the max texel area, then | | |
| | image2d_t image ) |corresponding some of the tail elements| | |
| | |of the region are dropped | | |
| | | | | |
| | |Out-of-bound writes are dropped. | | |
| +----------------------------------+---------------------------------------+ | |
| _____________________________________________________________________________ | |
| Examples | |
| 1. Reading the vertical left edge of a macroblock in a kernel that use the | |
| device-side VME built-in functions. | |
| All images are 8-bit images with the image_channel_order and the | |
| image_data_type as CL_R and CL_UNORM_INT8 respectively. | |
| __kernel | |
| void vme_intra_estimation_kernel( | |
| __read_only image2d_t src_img, | |
| __read_only image2d_t ref_img, | |
| __read_only image2d_t src_luma_img, | |
| ... | |
| { | |
| ... | |
| // Read the left edge for a macro-block. | |
| int2 edgeCoord; | |
| edgeCoord.x = srcCoord.x - 4; | |
| edgeCoord.y = srcCoord.y; | |
| uint leftLumaEdgeDW = | |
| intel_sub_group_media_block_read_ui( | |
| edgeCoord, | |
| 1, // image region width of 1 dword | |
| 16, // image region height of 16 | |
| src_luma_image ); | |
| leftLumaEdge = as_uchar4( leftLumaEdgeDW ).s3; | |
| ... | |
| intel_sub_group_avc_sic_result_t result; | |
| result = | |
| intel_sub_group_avc_sic_evaluate_ipe( | |
| src_img, | |
| vme_sampler, | |
| payload ); | |
| ... | |
| } | |
| Image 2D region: Subgroup work-items: | |
| ++++++++++++++++ ++++++++++++++++++++ | |
| +-+ | |
| |0| | |
| +-+ | |
| |1| Subgroup local id: 0 1 2 3 4 5 6 7 8 9 A B C D E F | |
| +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| |2| |0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F| | |
| +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| |3| | |
| +-+ | |
| |4| | |
| +-+ | |
| |5| The max byte area of the image region is 1*16 (16). | |
| +-+ | |
| |6| The 1x16 dword image region is re-organized taken in row- | |
| +-+ major order as another 2D region within the subgroup with | |
| |7| the word width equal to the subgroup size of 16 - 16x1 | |
| +-+ region. | |
| |8| | |
| +-+ | |
| |9| | |
| +-+ | |
| |A| | |
| +-+ | |
| |B| | |
| +-+ | |
| |C| | |
| +-+ | |
| |D| | |
| +-+ | |
| |E| | |
| +-+ | |
| |F| | |
| +-+ | |
| 2. Reading a 16x2 word region from an image. | |
| All images are 8-bit images with the image_channel_order and the | |
| image_data_type as CL_R and CL_UNORM_INT8 respectively. | |
| __kernel __attribute__((intel_reqd_sub_group_size(8))) | |
| void vme_intra_estimation_kernel( | |
| __read_only image2d_t src_img, | |
| ... | |
| { | |
| ... | |
| // Read the 16x2 word region in a subgroup of size 8. | |
| int2 srcCoord; | |
| ... | |
| ushort4 texels = | |
| intel_sub_group_media_block_read_us4( | |
| srcCoord, | |
| 16, // image region width of 16 words | |
| 2, // image rgeion height of 16 | |
| src_image ); | |
| ... | |
| } | |
| Image 2D region: | |
| ++++++++++++++++ | |
| The max word area of the region is 4*8 (32). | |
| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | |
| |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |A |B |C |D |E |F | | |
| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | |
| |10|11|12|13|14|15|16|17|18|19|1A|1B|1C|1D|1E|1F| | |
| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | |
| Subgroup work-items: | |
| ++++++++++++++++++++ | |
| The 16x2 image region is re-organized taken in row-major order as another | |
| 2D region within the subgroup with the word width equal to the subgroup | |
| size of 8 - 8x4 region. Each work-item processes a data item strided by 8 | |
| words. | |
| Subgroup local id: 0 1 2 3 4 5 6 7 | |
| +--+--+--+--+--+--+--+--+ | |
| |0 |1 |2 |3 |4 |5 |6 |7 | | |
| +--+--+--+--+--+--+--+--+ | |
| |8 |9 |A |B |C |D |E |F | | |
| +--+--+--+--+--+--+--+--+ | |
| |10|11|12|13|14|15|16|17| | |
| +--+--+--+--+--+--+--+--+ | |
| |18|19|1A|1B|1C|1D|1E|1F| | |
| +--+--+--+--+--+--+--+--+ | |
| Revision History | |
| Version 1 (12/02/2016): First public revision. |