blob: 7c801d250986ab0cb3dc75becf62a2466bb6dc54 [file] [log] [blame]
Name String
cl_intel_media_block_io
Contributors
Biju George, Intel
Ben Ashbaugh, Intel
Scott Pillow, Intel
Contact
Biju George, Intel (biju.george 'at' intel.com)
Version
Version 1, December 2, 2016
Number
OpenCL Extension #51
Status
First Draft
Dependencies
OpenCL 1.2 is required.
The OpenCL Intel vendor extension cl_intel_subgroups is required. The media
block read/write built-in functions are an extension of the subgroup
functions defined in cl_intel_subgroups.
This extension is written against revision 29 of the OpenCL 2.0 API
specification, against revision 33 of the OpenCL 2.0 OpenCL C
specification, and against revision 32 of the OpenCL 2.0 extension
specification.
Overview
This extension augments the block read/write functionality available in the
Intel vendor extensions cl_intel_subgroups and cl_intel_subgroups_short by
the specification of additional built-in functions to facilitate the reading
and writing of flexible 2D regions from images. This API allows for
the explicit specification of the width and height of the image regions.
While not required, this extension is most useful when the subgroup size is
known at compile-time. The primary use case for this extension is to support
the reading of the edge texels (or image elements) of neighboring macro-
blocks as described in the Intel vendor extension cl_intel_device_side_avc_
motion_estimation. When using the built-in functions from cl_intel_device_
side_avc_motion_estimation the subgroup size is implicitly fixed to 16. In
other use cases the subgroup size may be fixed using the cl_intel_required_
subgroup_size extension, if needed.
New API Enums
None.
Terms and Definitions
Texel:
------
This refers to an images element (or an image pixel).
Byte:
-----
A 8-bit unsigned integer (or cl_uchar).
Word:
-----
A 16-bit unsigned integer (or cl_ushort).
Dword:
-----
A 32-bit unsigned integer (or cl_uint).
New OpenCL C built-in functions
Append to Section 6.13.X "Sub Group Read and Write Functions" of the
OpenCL 2.0 C Specification, which was added by the cl_intel_subgroups
extension.
These built-in functions must be encountered by all work items in a subgroup
executing the kernel, otherwise the behavior is undefined (i.e. they can only
be used only in convergent control flow where all the work-items in the sub-
groups are enabled).
The following restrictions in Table 6.X apply for allowed sizes for the
explicit width and height parameters of the flexible media block read/write
built-in functions.
Table 6.X
+-----------+--------------+
| width |maximum height|
| (bytes) | (rows) |
+-----------+--------------+
| 4 | 64 |
+-----------+--------------+
| 8 | 32 |
+-----------+--------------+
| 12,16 | 16 |
+-----------+--------------+
|20,24,28,32| 8 |
+-----------+--------------+
If OpenCL image kernel parameters are used as input arguments in calls
to the flexible media block read/write built-in functions, then these image
objects must exclusively be used only by flexible media block read/write
built-in functions. If the same image needs to be used for other image
operations within the same kernel, then an additional image parameter may be
used that is bound to the same image object as for the flexible media block
read/write built-in function call in the OpenCL host application for the
kernel enqueue API call.
Additionally for images that are read from or written to using the flexible
media block read/write built-in functions, the images should be created with
the "image_width" value in "cl_image_desc" such that image_width multiplied
by the texel size is a multiple of 4.
The following additional restrictions are imposed for reading 2D images
created from buffers with the flexible media block read/write built-in
functions.
1. The image row pitch is required to be a multiple of 64-bytes, in addition
to the CL_DEVICE_IMAGE_PITCH_ALIGNMENT requirements.
2. If the buffer was created using CL_MEM_USE_HOST_PTR with the host
application providing the storage bits for the memory object using
"host_ptr", then "host_ptr" is required to be 16-byte aligned, in addition
to the clCreateBuffer requirements.
3. The maximum height is further restricted to 16 or less.
The behavior is undefined if the flexible media block read/write built-in
functions are used to directly read/write planar YUV image. Instead they may
be indirectly used to read/write a 2D image object representing a single
plane from a planar YUV image object. Creating a 2D image object
representing a single plane from a planar YUV image object is described in
cl_intel_planar_yuv.
_____________________________________________________________________________
Read operations
---------------
Byte sized read operations
++++++++++++++++++++++++++
+---------------------------------+----------------------------------------+
|uchar |Reads a 2D region from an image. |
|intel_sub_group_media_block_read_| |
|uc( |The 2D source byte offset of the |
| int2 src_byte_offset, |top-left corner and the width and height|
| int width, |of the region are specified explicitly |
| int height, |in the interface parameters. The source |
| read_only image2d_t image ) |byte x-offset and width must be 4 byte |
| |aligned. |
+---------------------------------+ |
|uchar2 |The width is specified in byte units |
|intel_sub_group_media_block_read_|must be less than or equal to 32. The |
|uc2( |width and height of the region must be |
| int2 src_byte_offset, |compile-time constants. |
| int width, | |
| int height, |The read-in texels in the 2D region |
| read_only image2d_t image ) |taken in row-major order are |
| |re-organized as another 2D region with |
+---------------------------------+the byte width equal to the subgroup |
|uchar4 |size. Then each work-item reads each |
|intel_sub_group_media_block_read_|byte column vector of the re-organized |
|uc4( |rectangle, i.e. each column's subsequent|
| int2 src_byte_offset, |data element's address is strided by the|
| int width, |subgroup size. |
| int height, | |
| read_only image2d_t image ) |The max byte area of the region is |
| |defined as the byte size of the return |
+---------------------------------+type multiplied by the subgroup size. If|
|uchar8 |the byte area of the region is less than|
|intel_sub_group_media_block_read_|its max byte area, then corresponding |
|uc8( |tail elements of some of the column |
| int2 src_byte_offset, |vector are undefined. Conversely if the |
| int width, |byte area of the region is more than the|
| int height, |max byte area, then some corresponding |
| read_only image2d_t image ) |of the tail elements of the region are |
| |dropped. |
+---------------------------------+ |
|uchar16 |For out-of-bound reads, the read-in |
|intel_sub_group_media_block_read_|texels are replicated from the nearest |
|uc16( |edge for byte sized texels. The out-of- |
| int2 src_byte_offset, |bound behavior is undefined for larger |
| int width, |sized texels with the "_uc" |
| int height, |builtin-functions. |
| read_only image2d_t image ) | |
+---------------------------------+----------------------------------------+
Word sized read operations
++++++++++++++++++++++++++
+---------------------------------+----------------------------------------+
|ushort |Reads a 2D region from an image. |
|intel_sub_group_media_block_read_| |
|us( |The 2D source byte offset of the |
| int2 src_byte_offset, |top-left corner and the width and height|
| int width, |of the region are specified explicitly |
| int height, |in the interface parameters. The source |
| read_only image2d_t image ) |byte x-offset and width must be 4 byte |
| |aligned. |
+---------------------------------+ |
|ushort2 |The width is specified in word units and|
|intel_sub_group_media_block_read_|must be less than or equal to 16. The |
|us2( |width and height of the region must be |
| int2 src_byte_offset, |compile-time constants. |
| int width, | |
| int height, |The read-in texels in the 2D region |
| read_only image2d_t image ) |taken in row-major order are |
| |re-organized as another 2D region with |
+---------------------------------+the word width equal to the subgroup |
|ushort4 |size. Then each work-item reads each |
|intel_sub_group_media_block_read_|word column vector of the re-organized |
|us4( |rectangle, i.e. each column's subsequent|
| int2 src_byte_offset, |data element's address is strided by the|
| int width, |subgroup size multipled by 2. |
| int height, | |
| read_only image2d_t image ) |The max word area of the region is |
| |defined as the word size of the return |
+---------------------------------+type multiplied by the subgroup size. If|
|ushort8 |the word area of the region is less than|
|intel_sub_group_media_block_read_|its max word area, then corresponding |
|us8( |tail elements of some of the column |
| int2 src_byte_offset, |vector are undefined. Conversely if the |
| int width, |word area of the region is more than the|
| int height, |max word area, then some corresponding |
| read_only image2d_t image ) |of the tail elements of the region are |
| |dropped. |
+---------------------------------+ |
|ushort16 |For out-of-bound reads, the read-in |
|intel_sub_group_media_block_read_|texels are replicated from the nearest |
|us16( |edge for byte and word sized texels. The|
| int2 src_byte_offset, |out-of-bound behavior is undefined for |
| int width, |larger sized texels with the "_us" |
| int height, |builtin-functions. |
| read_only image2d_t image ) | |
+---------------------------------+----------------------------------------+
Double Word (DWORD) sized read operations
+++++++++++++++++++++++++++++++++++++++++
+---------------------------------+----------------------------------------+
|uint |Reads a 2D region from an image. |
|intel_sub_group_media_block_read_| |
|ui( |The 2D source byte offset of the |
| int2 src_byte_offset, |top-left corner and the width and height|
| int width, |of the region are specified explicitly |
| int height, |in the interface parameters. The source |
| read_only image2d_t image ) |byte x-offset and width must be 4 byte |
| |aligned. |
+---------------------------------+ |
|uint2 |The width is specified in dword units |
|intel_sub_group_media_block_read_|and must be less than or equal to 8. The|
|ui2( |width and height of the region must be |
| int2 src_byte_offset, |compile-time constants. |
| int width, | |
| int height, |The read-in texels in the 2D region |
| read_only image2d_t image ) |taken in row-major order are |
| |re-organized as another 2D region with |
+---------------------------------+the dword width equal to the subgroup |
|uint4 |size. Then each work-item reads each |
|intel_sub_group_media_block_read_|dword column vector of the re-organized |
|ui4( |rectangle, i.e. each column's subsequent|
| int2 src_byte_offset, |data elements's address is strided by |
| int width, |the subgroup size multipled by 4. |
| int height, | |
| read_only image2d_t image ) |The max dword area of the region is the |
| |dword size of the return type multiplied|
+---------------------------------+by the subgroup size. If the dword area |
|uint8 |of the region is less than its max dword|
|intel_sub_group_media_block_read_|area, then corresponding tail elements |
|ui8( |of some of the column vector are |
| int2 src_byte_offset, |undefined. Conversely if the dword area |
| int width, |of the region is more than the max dword|
| int height, |area, then corresponding some of the |
| read_only image2d_t image ) |tail elements of the region are dropped.|
| | |
| |For out-of-bound reads, the read-in |
| |texels are replicated from the nearest |
| |edge for byte, word and dword sized |
| |texels. The out-of-bound behavior is |
| |undefined for larger sized texels with |
| |the "_ui" builtin-functions. |
+---------------------------------+----------------------------------------+
Additional notes on out-of-bound reads:
+++++++++++++++++++++++++++++++++++++++
1. For an image with byte texels, the boundary byte is replicated. For
example, for a boundary word B0B1B2B3, to replicate the left boundary
byte texel, the out of bound dwords have the format of B0B0B0B0, and that
for right boundary is B3B3B3B3.
2. For an image with word texels, boundary texel replication is on words. For
example, for a boundary dword B0B1B2B3, to replicate the left boundary
word texel, the out of bound dwords have the format of B0B1B0B1, and that
for right boundary is B2B3B2B3.
3. For special images with (word texel) YUV packed format as described in
the cl_intel_packed_yuv extension, there are two cases depending on the
Y location: CL_YUYV_INTEL and CL_UYVY_INTEL. Boundary handling for
CL_YVYU_INTEL is the same as that for CL_YUYV_INTEL. Similarly, boundary
handling for CL_VYUY_INTEL is the same as that for UYVY. For a boundary
dword Y0U0Y1V0, to replicate the left boundary, we get Y0U0Y0V0, and to
replicate the right boundary, we get Y1U0Y1V0. For a boundary dword
U0Y0V0Y1, to replicate the left boundary, we get U0Y0V0Y0, and to
replicate the right boundary, we get U0Y1V0Y1.
4. For an image with dword texels, the boundary dword texel is replicated.
5. The behavior is undefined for images with greater than dword sized texels
(such as CL_RGBA + CL_FLOAT).
_____________________________________________________________________________
Write operations
----------------
Byte sized write operations
+++++++++++++++++++++++++++
+----------------------------------+---------------------------------------+
|void |Writes to a 2D region of an image with |
|intel_sub_group_media_block_write_|surface formats of byte sized texels. |
|uc( | |
| int2 src_byte_offset, |The 2D source byte offset of the |
| int width, |top-left corner and the width and |
| int height, |height of the region are specified |
| uchar texel, |explicitly in the interface |
| image2d_t image ) |parameters. The source byte x-offset |
| |and width must be 4 byte aligned. |
+----------------------------------+ |
|intel_sub_group_media_block_write_|The width is specified in byte units |
|uc2( |and must be less than or equal to |
| int2 src_byte_offset, |32. The width and height of the region |
| int width, |must be compile-time constants. |
| int height, | |
| uchar2 texels, |The 2D region that is written to is |
| image2d_t image ) |logically re-organized taken in |
| |row-major order as another 2D region |
+----------------------------------+with the byte width equal to the |
|void |subgroup size. Then each work-item |
|intel_sub_group_media_block_write_|processes each byte column vector of |
|uc4( |the logically re-organized rectangle, |
| int2 src_byte_offset, |i.e. each column's subsequent data |
| int width, |element's address is strided by the |
| int height, |subgroup size. |
| uchar4 texels, | |
| image2d_t image ) |The max byte area of the region is |
| |defined as the byte size of the return |
+----------------------------------+type multiplied by the subgroup |
|void |size. If the byte area of the region is|
|intel_sub_group_media_block_write_|less than its max byte area, then |
|uc8( |corresponding tail elements of some of |
| int2 src_byte_offset, |the column vector will not be included |
| int width, |in the written out region. Conversely |
| int height, |if the byte area of the region is more |
| uchar8 texels, |than the max byte area, then |
| image2d_t image ) |corresponding some of the tail elements|
| |of the region are dropped. |
+----------------------------------+ |
|void |Out-of-bound writes are dropped. |
|intel_sub_group_media_block_write_| |
|uc16( | |
| int2 src_byte_offset, | |
| int width, | |
| int height, | |
| uchar16 texels, | |
| image2d_t image ) | |
| | |
+----------------------------------+---------------------------------------+
Word sized write operations
+++++++++++++++++++++++++++
+----------------------------------+---------------------------------------+
|void |Writes to a 2D region of an image with |
|intel_sub_group_media_block_write_|surface formats of word or byte sized |
|us( |texels. |
| int2 src_byte_offset, | |
| int width, |The 2D source byte offset of the |
| int height, |top-left corner and the width and |
| ushort texel, |height of the region are specified |
| image2d_t image ) |explicitly in the interface parameters.|
| |The source byte x-offset and width must|
+----------------------------------+be dword aligned. |
|void | |
|intel_sub_group_media_block_write_|The width is specified in word units |
|us2( |and must be less than or equal to |
| int2 src_byte_offset, |16. The width and height of the region |
| int width, |must be compile-time constants. |
| int height, | |
| ushort2 texels, |The 2D region that is written to is |
| image2d_t image ) |logically re-organized taken in |
| |row-major order as another 2D region |
+----------------------------------+with the word width equal to the |
|void |subgroup size. Then each work-item |
|intel_sub_group_media_block_write_|processes each column vector of the |
|us4( |logically re-organized rectangle, |
| int2 src_byte_offset, |i.e. each column's subsequent data |
| int width, |element's address is strided by the |
| int height, |subgroup size multipled by 2. |
| ushort4 texels, | |
| image2d_t image ) |The max word area of the region is |
| |defined as the word size of the return |
+----------------------------------+type multiplied by the subgroup |
|void |size. If the word area of the region is|
|intel_sub_group_media_block_write_|less than its max word area, then |
|us8( |corresponding tail elements of some of |
| int2 src_byte_offset, |the column vector will not be included |
| int width, |in the written out region. Conversely |
| int height, |if the word area of the region is more |
| ushort8 texels, |than the max word area, then |
| image2d_t image ) |corresponding some of the tail elements|
| |of the region are dropped. |
+----------------------------------+ |
|void |Out-of-bound writes are dropped. |
|intel_sub_group_media_block_write_| |
|us16( | |
| int2 src_byte_offset, | |
| int width, | |
| int height, | |
| ushort16 texels, | |
| image2d_t image ) | |
| | |
+----------------------------------+---------------------------------------+
Double word (DWORD) sized write operations
++++++++++++++++++++++++++++++++++++++++++
+----------------------------------+---------------------------------------+
|void |Writes to a 2D region of an image with |
|intel_sub_group_media_block_write_|surface formats of dword, word or sized|
|ui( |texels. |
| int2 src_byte_offset, | |
| int width, |The 2D source byte offset of the |
| int height, |top-left corner and the width and |
| uint texels, |height of the region are specified |
| image2d_t image ) |explicitly in the interface |
| |parameters. The source byte x-offset |
+----------------------------------+and width must be 4 byte aligned. |
|void | |
|intel_sub_group_media_block_write_|The width is specified in dword units |
|ui2( |and must be less than or equal to |
| int2 src_byte_offset, |8. The width and height of the region |
| int width, |must be compile-time constants. |
| int height, | |
| uint2 texels, |The 2D region that is written to is |
| image2d_t image ) |logically re-organized taken in |
| |row-major order as another 2D region |
+----------------------------------+with the dword width equal to the |
|void |subgroup size. Then each work-item |
|intel_sub_group_media_block_write_|processes each column vector of the |
|ui4( |logically re-organized rectangle, |
| int2 src_byte_offset, |i.e. each column's subsequent data |
| int width, |element's address is strided by the |
| int height, |subgroup size multiplied by 4. |
| uint4 texels, | |
| image2d_t image ) |The max dword area of the region is |
| |defined as the dword size of the return|
+----------------------------------+type multiplied by the subgroup |
|void |size. If the dword area of the region |
|intel_sub_group_media_block_write_|is less than its max dword area, then |
|ui8( |corresponding tail elements of some of |
| int2 src_byte_offset, |the column vector will not be included |
| int width, |in the written out region. Conversely |
| int height, |if the dword area of the region is more|
| uint8 texels, |than the max texel area, then |
| image2d_t image ) |corresponding some of the tail elements|
| |of the region are dropped |
| | |
| |Out-of-bound writes are dropped. |
+----------------------------------+---------------------------------------+
_____________________________________________________________________________
Examples
1. Reading the vertical left edge of a macroblock in a kernel that use the
device-side VME built-in functions.
All images are 8-bit images with the image_channel_order and the
image_data_type as CL_R and CL_UNORM_INT8 respectively.
__kernel
void vme_intra_estimation_kernel(
__read_only image2d_t src_img,
__read_only image2d_t ref_img,
__read_only image2d_t src_luma_img,
...
{
...
// Read the left edge for a macro-block.
int2 edgeCoord;
edgeCoord.x = srcCoord.x - 4;
edgeCoord.y = srcCoord.y;
uint leftLumaEdgeDW =
intel_sub_group_media_block_read_ui(
edgeCoord,
1, // image region width of 1 dword
16, // image region height of 16
src_luma_image );
leftLumaEdge = as_uchar4( leftLumaEdgeDW ).s3;
...
intel_sub_group_avc_sic_result_t result;
result =
intel_sub_group_avc_sic_evaluate_ipe(
src_img,
vme_sampler,
payload );
...
}
Image 2D region: Subgroup work-items:
++++++++++++++++ ++++++++++++++++++++
+-+
|0|
+-+
|1| Subgroup local id: 0 1 2 3 4 5 6 7 8 9 A B C D E F
+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|2| |0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F|
+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|3|
+-+
|4|
+-+
|5| The max byte area of the image region is 1*16 (16).
+-+
|6| The 1x16 dword image region is re-organized taken in row-
+-+ major order as another 2D region within the subgroup with
|7| the word width equal to the subgroup size of 16 - 16x1
+-+ region.
|8|
+-+
|9|
+-+
|A|
+-+
|B|
+-+
|C|
+-+
|D|
+-+
|E|
+-+
|F|
+-+
2. Reading a 16x2 word region from an image.
All images are 8-bit images with the image_channel_order and the
image_data_type as CL_R and CL_UNORM_INT8 respectively.
__kernel __attribute__((intel_reqd_sub_group_size(8)))
void vme_intra_estimation_kernel(
__read_only image2d_t src_img,
...
{
...
// Read the 16x2 word region in a subgroup of size 8.
int2 srcCoord;
...
ushort4 texels =
intel_sub_group_media_block_read_us4(
srcCoord,
16, // image region width of 16 words
2, // image rgeion height of 16
src_image );
...
}
Image 2D region:
++++++++++++++++
The max word area of the region is 4*8 (32).
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |A |B |C |D |E |F |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|10|11|12|13|14|15|16|17|18|19|1A|1B|1C|1D|1E|1F|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Subgroup work-items:
++++++++++++++++++++
The 16x2 image region is re-organized taken in row-major order as another
2D region within the subgroup with the word width equal to the subgroup
size of 8 - 8x4 region. Each work-item processes a data item strided by 8
words.
Subgroup local id: 0 1 2 3 4 5 6 7
+--+--+--+--+--+--+--+--+
|0 |1 |2 |3 |4 |5 |6 |7 |
+--+--+--+--+--+--+--+--+
|8 |9 |A |B |C |D |E |F |
+--+--+--+--+--+--+--+--+
|10|11|12|13|14|15|16|17|
+--+--+--+--+--+--+--+--+
|18|19|1A|1B|1C|1D|1E|1F|
+--+--+--+--+--+--+--+--+
Revision History
Version 1 (12/02/2016): First public revision.