| Name String |
| |
| cl_intel_subgroups |
| |
| Contributors |
| |
| Ben Ashbaugh, Intel |
| Allen Hux, Intel |
| Pranayini Gudali, Intel |
| Dawid Dominiak, Intel |
| Biju George, Intel |
| |
| Contact |
| |
| Ben Ashbaugh, Intel (ben.ashbaugh 'at' intel.com) |
| |
| Version |
| |
| Version 4, August 28, 2016 |
| |
| Number |
| |
| OpenCL Extension #35 |
| |
| Status |
| |
| Final Draft |
| |
| Dependencies |
| |
| OpenCL 1.2 is required. Some features (get_num_enqueued_sub_groups() and |
| the sub_group_barrier() function that accept a memory scope) require OpenCL |
| 2.0. |
| |
| This extension is written against revision 24 of the OpenCL 2.0 API |
| specification, against revision 24 of the OpenCL 2.0 OpenCL C specification, |
| and against revision 24 of the OpenCL 2.0 extension specification. |
| |
| Overview |
| |
| The goal of this extension is to allow programmers to improve the performance |
| of their applications by taking advantage of the fact that some work items in a |
| work group execute together as a group (a "subgroup"), and that work items in a |
| subgroup can take advantage of hardware features that are not available to work |
| items in a work group. Specifically, this extension is designed to allow work |
| items in a subgroup to share data without the use of local memory and work group |
| barriers, and to utilize specialized hardware to load and store blocks of data. |
| |
| There is a large amount of overlap between the functionality in this extension |
| and the functionality in the Khronos OpenCL 2.0 "cl_khr_subgroups" extension, so |
| this extension reuses many of the names, concepts, and functions already described |
| in the cl_khr_subgroups extension. The key differences between the Intel |
| subgroups extension and the Khronos subgroups extension are: |
| |
| * The Khronos subgroups extension requires OpenCL 2.0, but the Intel subgroups |
| extension may be available on OpenCL 1.2 devices. |
| |
| * The Khronos subgroups extension guarantees that subgroups in a work group |
| will make independent forward progress, but the Intel extension does not |
| guarantee that subgroups in a work group will make independent forward |
| progress. |
| |
| * The Intel extension adds a rich set of subgroup "shuffle" functions to |
| allow work items within a work group to interchange data without the use |
| of local memory and work group barriers. |
| |
| * The Intel extension adds a set of subgroup "block read and write" functions |
| to take advantage of specialized hardware to read or write blocks of data |
| from or to buffers or images. |
| |
| * The Intel subgroups extension does not include the subgroup pipes functions |
| that are included as part of the Khronos subgroups extension. |
| |
| * The Intel subgroups extension does not include the device-side kernel query |
| functions for subgroups that are included as part of the Khronos subgroups |
| extension. |
| |
| New API Functions |
| |
| This function is copied unchanged from the Khronos subgroups extension: |
| |
| cl_int clGetKernelSubGroupInfoKHR( |
| cl_kernel kernel, |
| cl_device_id device, |
| cl_kernel_sub_group_info param_name, |
| size_t input_value_size, |
| const void* input_value, |
| size_t param_value_size, |
| void* param_value, |
| size_t* param_value_size_ret) |
| |
| New API Enums |
| |
| These enums are copied unchanged from the Khronos subgroups extension: |
| |
| Accepted as the <param_name> parameter of clGetKernelSubGroupInfoKHR. |
| |
| CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR 0x2033 |
| CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE_KHR 0x2034 |
| |
| New OpenCL C Functions |
| |
| These built-in functions are copied unchanged from the Khronos subgroups |
| extension: |
| |
| uint get_sub_group_size( void ); |
| uint get_max_sub_group_size( void ); |
| uint get_num_sub_groups( void ); |
| |
| uint get_sub_group_id( void ); |
| uint get_sub_group_local_id( void ); |
| |
| void sub_group_barrier( cl_mem_fence_flags flags ); |
| |
| int sub_group_all( int predicate ); |
| int sub_group_any( int predicate ); |
| |
| If OpenCL 2.0 is supported: |
| |
| uint get_enqueued_num_sub_groups( void ); |
| void sub_group_barrier( cl_mem_fence_flags flags, memory_scope scope ); |
| |
| For the sub_group_broadcast functions, <gentype> is <int>, <uint>, |
| <long>, <ulong>, or <float>. |
| |
| If cl_khr_fp16 is supported, <gentype> also includes <half>. |
| If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>. |
| |
| <gentype> sub_group_broadcast( <gentype> x, uint sub_group_local_id ); |
| |
| For the sub_group_reduce, sub_group_scan_exclusive, and |
| sub_group_scan_inclusive functions, <gentype> is <int>, <uint>, <long>, |
| <ulong>, or <float>. <op> is <add>, <min>, or <max>. |
| |
| If cl_khr_fp16 is supported, <gentype> also includes <half>. |
| If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>. |
| |
| <gentype> sub_group_reduce_<op>( <gentype> x ); |
| <gentype> sub_group_scan_exclusive_<op>( <gentype> x ); |
| <gentype> sub_group_scan_inclusive_<op>( <gentype> x ); |
| |
| These built-in functions are unique to the Intel subgroups extension and are not |
| part of the Khronos subgroups extension: |
| |
| For the sub_group_shuffle, sub_group_shuffle_down, sub_group_shuffle_up, and |
| sub_group_shuffle_xor functions, <gentype> is <float>, <float2>, <float4>, |
| <float8>, <float16>, <int>, <int2>, <int4>, <int8>, <int16>, <uint>, <uint2>, |
| <uint4>, <uint8>, <uint16>, <long>, or <ulong>. |
| |
| If cl_khr_fp16 is supported, <gentype> also includes <half>. |
| If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>. |
| |
| <gentype> intel_sub_group_shuffle( <gentype> data, uint c ); |
| <gentype> intel_sub_group_shuffle_down( |
| <gentype> current, <gentype> next, uint delta ); |
| <gentype> intel_sub_group_shuffle_up( |
| <gentype> previous, <gentype> current, uint delta ); |
| <gentype> intel_sub_group_shuffle_xor( <gentype> data, uint value ); |
| |
| |
| uint intel_sub_group_block_read( const __global uint* p ); |
| uint2 intel_sub_group_block_read2( const __global uint* p ); |
| uint4 intel_sub_group_block_read4( const __global uint* p ); |
| uint8 intel_sub_group_block_read8( const __global uint* p ); |
| |
| uint intel_sub_group_block_read( image2d_t image, int2 byte_coord ); |
| uint2 intel_sub_group_block_read2( image2d_t image, int2 byte_coord ); |
| uint4 intel_sub_group_block_read4( image2d_t image, int2 byte_coord ); |
| uint8 intel_sub_group_block_read8( image2d_t image, int2 byte_coord ); |
| |
| void intel_sub_group_block_write( __global uint* p, uint data ); |
| void intel_sub_group_block_write2( __global uint* p, uint2 data ); |
| void intel_sub_group_block_write4( __global uint* p, uint4 data ); |
| void intel_sub_group_block_write8( __global uint* p, uint8 data ); |
| |
| void intel_sub_group_block_write( image2d_t image, int2 byte_coord, uint data ); |
| void intel_sub_group_block_write2( image2d_t image, int2 byte_coord, uint2 data ); |
| void intel_sub_group_block_write4( image2d_t image, int2 byte_coord, uint4 data ); |
| void intel_sub_group_block_write8( image2d_t image, int2 byte_coord, uint8 data ); |
| |
| New OpenCL C Enums |
| |
| This enum is copied unchanged from the Khronos subgroups extension: |
| |
| Add the following new value to the enumerated type <memory_scope>: |
| |
| memory_scope_sub_group |
| |
| Modifications to Section 2 - "Glossary" of the OpenCL 2.0 API Specification |
| |
| Add memory_scope_sub_group to the description of Memory Scopes: |
| |
| "Memory Scopes: Memory scopes define a hierarchy of visibilities when analyzing the |
| ordering constraints of memory operations. They are defined by the values of the |
| memory_scope enumeration constant. Current values are memory_scope_work_item (memory |
| constraints only apply to a single work item and in practice only apply to image |
| operations), memory_scope_sub_group (memory-ordering constraints only apply to work |
| items executing in a subgroup), memory_scope_work_group ..." |
| |
| Add memory_scope_sub_group to the description of Scope inclusion: |
| |
| "Scope inclusion: Two actions A and B are defined to have an inclusive scope if they |
| have the same scope P such that: (1) if P is memory_scope_sub_group, and A and B are |
| executed by work items within the same subgroup, or (2) if P is memory_scope_work_group, |
| and A and B are executed by work items within the same workgroup ..." |
| |
| Change the description for Subgroups to: |
| |
| "Subgroup: Subgroups are an implementation-dependent grouping of work items within a |
| work group. The size and number of subgroups is implementation-defined and not |
| exposed in the core OpenCL 2.0 feature set. Subgroups execute concurrently within |
| a work group, but are not guaranteed to make independent forward progress. |
| Subgroups may synchronize internally using subgroup barrier operations without |
| synchronizing with other subgroups." |
| |
| Modifications to Section 3.2.1 - "Execution Model: Mapping Work Items Onto an NDRange" of |
| the OpenCL 2.0 API Specification |
| |
| Change the paragraph describing subgroups to: |
| |
| "An implementation of OpenCL may divide each work group into one or more subgroups. |
| The size and number of subgroups is implementation-defined and not exposed in the |
| core OpenCL 2.0 feature set." |
| |
| Modifications to Section 3.2.2 - "Execution Model: Execution of Kernel Instances" of the |
| OpenCL 2.0 API Specification |
| |
| Remove the last paragraph describing subgroups and independent forward progress. |
| |
| Additions to Section 3.2 - "Execution Model" of the OpenCL 2.0 API Specification |
| |
| This text is largely the same as the text in the Khronos subgroups extension. |
| Only the sentence about independent forward progress has been modified. |
| |
| "Within a work group, work items may be divided into subgroups in an implementation- |
| defined fashion. The mapping of work items to subgroups is implementation-defined |
| and may be queried at runtime. While subgroups may be used in multi-dimensional |
| work groups, each subgroup is 1-dimensional and any given work item may query which |
| subgroup it is a member of. |
| |
| Work items are mapped into subgroups through a combination of compile-time decisions |
| and the parameters of the dispatch. The mapping to subgroups is invariant for the |
| duration of a kernel's execution, across dispatches of a given kernel with the same |
| launch parameters, and from one work group to another within the dispatch (excluding |
| the trailing edge work groups in the presence of non-uniform work group sizes). In |
| addition, all subgroups within a work group will be the same size, apart from the |
| subgroup with the maximum index, which may be smaller if the size of the work group |
| is not evenly divisible by the size of the subgroups. |
| |
| Subgroups execute concurrently within a given work group. Similar to work items |
| within a work group, subgroups executing within a work group are not guaranteed to make |
| independent forward progress. Work items in a subgroup can internally synchronize |
| using subgroup barrier operations without synchronizing with other subgroups." |
| |
| Additions to Section 3.3.4 - "Memory Model: Memory Consistency Model" |
| |
| Add memory_scope_sub_group to the bulleted descriptions of memory scopes: |
| |
| " * memory_scope_sub_group: memory-ordering constraints only apply to work items |
| executing within a single subgroup. |
| * memory_scope_work_group: ..." |
| |
| In the paragraph after the bulleted descriptions of memory scopes, include |
| memory_scope_sub_group as a valid memory scope for local memory: |
| |
| "... For local memory, memory_scope_sub_group and memory_scope_work_group are valid, |
| and may constrain visibility to the subgroup or workgroup." |
| |
| Additions to Section 3.3.5 - "Memory Model: Overview of atomic and fence operations" |
| |
| Add memory_scope_sub_group to the definition of inclusive scope: |
| |
| " * P is memory_scope_sub_group and A and B are executed by work items within the same |
| subgroup. |
| * P is memory_scope_work_group ..." |
| |
| Additions to Section 5.9.3 - "Kernel Object Queries" of the OpenCL 2.0 API Specification |
| |
| This addition is copied unchanged from the Khronos subgroups extension: |
| |
| "The function |
| |
| cl_int clGetKernelSubGroupInfoKHR( |
| cl_kernel kernel, |
| cl_device_id device, |
| cl_kernel_sub_group_info param_name, |
| size_t input_value_size, |
| const void* input_value, |
| size_t param_value_size, |
| void* param_value, |
| size_t* param_value_size_ret) |
| |
| returns information about the kernel object. |
| |
| <kernel> specifies the kernel object being queries. |
| |
| <device> identifies a specific device in the list of devices associated with <kernel>. |
| The list of devices is the list of devices in the OpenCL context that is associated |
| with <kernel>. If the list of devices associated with <kernel> is a single device, |
| <device> can be a NULL value. |
| |
| <param_name> specifies the information to query. The list of supported <param_name> |
| types and the information returned in <param_value> by clGetKernelSubGroupInfoKHR is |
| described in the table below. |
| |
| <input_value_size> is used to specify the size in bytes of memory pointed to by |
| <input_value>. This size must be equal to the size of the input type as described |
| in the table below. |
| |
| <input_value> is a pointer to memory where the appropriate parameterization of the |
| query is passed from. If <input_value> is NULL it is ignored. |
| |
| <param_value_size> is used to specify the size in bytes of memory pointed to by |
| <param_value>. This size must be greater than or equal to the size of the return type |
| as described in the table below. |
| |
| <param_value_size_ret> returns the actual size in bytes of data copied to <param_value>. |
| If <param_value_size_ret> is NULL it is ignored. |
| |
| -------------------------------------------------------------------------------------- |
| cl_kernel_sub_group_info Input Type Return Type Description |
| ------------------------ ---------- ----------- ----------------------------------- |
| CL_KERNEL_MAX_SUB_GROUP_ size_t* size_t Returns the maximum subgroup size |
| SIZE_FOR_NDRANGE for this kernel. All subgroups must |
| be the same size, while the last |
| subgroup in any work group (i.e. the |
| subgroup with the maximum index) |
| could be the same or smaller size. |
| |
| The <input_value> must be an array |
| of size_t values corresponding to |
| the local work size parameter of the |
| intended dispatch. The number of |
| dimensions in the NDRange will be |
| inferred form the value specified |
| for <input_value_size>. |
| |
| CL_KERNEL_SUB_GROUP_ size_t* size_t Returns the number of subgroups that |
| COUNT_FOR_NDRANGE will be present in each work group |
| for a given local work size. All |
| work groups, apart from the last |
| work group in each dimension in the |
| presence of non-uniform work group |
| sizes, will have the same number of |
| subgroups. |
| |
| The <input_value> must be an array |
| of size_t values corresponding to |
| the local work size parameter of the |
| intended dispatch. The number of |
| dimensions in the NDRange will be |
| inferred from the value specified |
| for <input_value_size>. |
| -------------------------------------------------------------------------------------- |
| |
| clGetKernelSubGroupInfoKHR returns CL_SUCCESS if the function executed successfully. |
| Otherwise, it returns one of the following errors: |
| |
| * CL_INVALID_DEVICE if <device> is not in the list of devices associated with <kernel>, |
| or if <device> is NULL but there is more than one device associated with <kernel>. |
| |
| * CL_INVALID_VALUE if <param_name> is not valid, or if the size in bytes specified by |
| <param_value_size> is less than the size of the return type as described in the |
| table above and <param_value> is not NULL. |
| |
| * CL_INVALID_VALUE if <param_name> is CL_KERNEL_SUB_GROUP_SIZE_FOR_NDRANGE and the |
| size in bytes specified by <input_value_size> is not valid or if <input_value> is |
| NULL. |
| |
| * CL_INVALID_KERNEL if <kernel> is not a valid kernel object. |
| |
| * CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the |
| OpenCL implementation on the device. |
| |
| * CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by |
| the OpenCL implementation on the host." |
| |
| Additions to Section 6.13.1 - "Work Item Functions" of the OpenCL 2.0 C Specification |
| |
| These additions are copied unchanged from the Khronos subgroups extension: |
| |
| "-------------------------------------------------------------------------------------- |
| Function Description |
| ------------------------------------- ----------------------------------------------- |
| uint get_sub_group_size( void ) Returns the number of work items in the |
| subgroup. This value is no more than the |
| maximum subgroup size and is implementation- |
| defined based on a combination of the compiled |
| compiled kernel and the dispatch dimensions. |
| This will be a constant value for the lifetime |
| of the subgroup. |
| |
| uint get_max_sub_group_size( void ) Returns the maximum size of a subgroup with the |
| dispatch. This value will be invariant for a |
| given set of dispatch dimensions and a kernel |
| object compiled for a given device. |
| |
| uint get_num_sub_groups( void ) Returns the number of subgroups that the current |
| work group is divided into. |
| |
| This number will be constant for the duration of |
| a work group's execution. If the kernel is |
| executed with a non-uniform work group size in |
| any dimension, calls to this built-in may return |
| a different values for some work groups than for |
| other work groups. |
| |
| uint get_sub_group_id( void ) Returns the subgroup ID, which is a number from |
| zero to get_num_sub_groups - 1. |
| |
| For clEnqueueTask, this returns zero. |
| |
| uint get_sub_group_local_id( void ) Returns the unique work item ID within the |
| current subgroup. The mapping from get_local_id |
| to get_sub_group_local_id will be invariant for |
| the lifetime of the work group. |
| |
| --------------------------------------------------------------------------------------" |
| |
| If OpenCL 2.0 is supported: |
| |
| "-------------------------------------------------------------------------------------- |
| Function Description |
| ---------------------------------------- -------------------------------------------- |
| uint get_enqueued_num_sub_groups( void ) Returns the same value as that returned by |
| get_num_sub_groups if the kernel is executed |
| with a uniform work group size. This value |
| will be constant for the entire NDRange. |
| |
| If the kernel is executed with a non-uniform |
| work group size, returns the number of |
| subgroups in a work group that makes up the |
| uniform region of the global NDRange. |
| --------------------------------------------------------------------------------------" |
| |
| Additions to Section 6.13.8 - "Synchronization Functions" of the OpenCL 2.0 C Specification |
| |
| These additions are mostly unchanged from the Khronos subgroups extension. There is |
| no new functionality, only minor edits for clarity: |
| |
| "-------------------------------------------------------------------------------------- |
| Function Description |
| ---------------------------------------- -------------------------------------------- |
| void sub_group_barrier( All work items in a subgroup executing the |
| cl_mem_fence_flags flags ) kernel on a processor must execute this |
| function before any are allowed to continue |
| execution beyond the subgroup barrier. This |
| function must be encountered by all work |
| items in a subgroup executing the kernel. |
| These rules apply to NDRanges implemented |
| with uniform and non-uniform work groups. |
| |
| If sub_group_barrier is inside a conditional |
| statement then all work items within the |
| subgroup must enter the conditional if |
| any work item in the subgroup enters the |
| conditional statement and executes the |
| sub_group_barrier. |
| |
| If sub_group_barrier is inside a loop, all |
| work items within the subgroup must execute |
| the sub_group_barrier for each iteration of |
| the loop before any are allowed to continue |
| execution beyond the sub_group_barrier. |
| |
| The sub_group_barrier function also queues a |
| memory fence (reads and writes) to ensure |
| correct ordering of memory operations to |
| local or global memory. |
| |
| The flags argument specifies the memory |
| address space and can be set to a |
| combination of the following values: |
| |
| CLK_LOCAL_MEM_FENCE - The sub_group_barrier |
| function will either flush any variables |
| stored in local memory or queue a memory |
| fence to ensure correct ordering of memory |
| operations to local memory. |
| |
| CLK_GLOBAL_MEM_FENCE - The sub_group_barrier |
| function will queue a memory fence to ensure |
| correct ordering of memory operations to |
| global memory. This can be useful when work |
| items, for example, write to buffer objects |
| and then want to read the updated data from |
| these buffer objects. |
| --------------------------------------------------------------------------------------" |
| |
| If OpenCL 2.0 is supported, add the following to the table above: |
| |
| "-------------------------------------------------------------------------------------- |
| Function Description |
| ---------------------------------------- -------------------------------------------- |
| void sub_group_barrier( ... |
| cl_mem_fence_flags flags, The sub_group_barrier function also supports |
| memory_scope scope ) a variant that specifies the memory scope. |
| For the sub_group_barrier variant that does |
| not take a memory scope, the scope is |
| memory_scope_sub_group. |
| |
| The scope argument specifies whether the |
| memory accesses of work items in the |
| subgroup to memory address space(s) |
| identified by flags become visible to all |
| work items in the subgroup, the work group, |
| the device, or all SVM devices. |
| ... |
| CLK_IMAGE_MEM_FENCE - The sub_group_barrier |
| function will queue a memory fence to ensure |
| correct ordering of memory operations to |
| image objects. This can be useful when work |
| items, for example, write to image objects |
| and then want to read the updated data from |
| these image objects. |
| --------------------------------------------------------------------------------------" |
| |
| Additions to Section 6.13.11 - "Atomic Functions" of the OpenCL 2.0 C Specification |
| |
| Modify the bullet describing behavior for functions that do not have a memory_scope |
| argument to say: |
| |
| " * The subgroup functions that do not have a memory_scope argument have the same |
| semantics as the corresponding functions with the memory_scope argument set to |
| memory_scope_sub_group. Other functions that do not have a memory_scope |
| argument have the same semantics as the corresponding functions with the |
| memory_scope argument set to memory_scope_device." |
| |
| This addition is copied unchanged from the Khronos subgroups extension: |
| |
| Add the following new value to the enumerated type <memory_scope> defined in Section |
| 6.13.11.4: |
| |
| "<memory_scope_sub_group> |
| |
| The <memory_scope_sub_group> specifies that the memory ordering constraints given by |
| <memory_order> apply to work items in a subgroup. This memory scope can be used when |
| performing atomic operations to global or local memory." |
| |
| Additions to Section 6.13.15 - "Work Group Functions" of the OpenCL 2.0 C Specification |
| |
| These additions are copied from the Khronos subgroups extension: |
| |
| "The OpenCL C programming language implements the following built-in functions that |
| operate on a subgroup level. These built-in functions must be encountered by all work |
| items in a subgroup executing the kernel. We use the generic term <gentype> to indicate |
| the built-in data types <int>, <uint>, <long>, <ulong>, or <float> as the type for the |
| arguments. |
| |
| If cl_khr_fp16 is supported, <gentype> also includes <half>. |
| If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>. |
| |
| -------------------------------------------------------------------------------------- |
| Function Description |
| ---------------------------------------- -------------------------------------------- |
| int sub_group_all( int predicate ) Evaluates predicate for all work items in |
| the subgroup and returns a non-zero value |
| if predicate evaluates to non-zero for all |
| work items in the subgroup. |
| |
| int sub_group_any( int predicate ) Evaluates predicate for all work items in |
| the subgroup and returns a non-zero value if |
| predicate evaluates to non-zero for any work |
| item in the subgroup. |
| |
| <gentype> sub_group_broadcast( Broadcasts the value of x for the work item |
| <gentype> x, identified by sub_group_local_id (value |
| uint sub_group_local_id ) returned by get_sub_group_local_id) to all |
| work items in the subgroup. |
| sub_group_local_id must be the same value |
| for all work items in the subgroup. |
| |
| <gentype> sub_group_reduce_<op>( Returns the result of the reduction operation |
| <gentype> x ) specified by <op> for all values x specified |
| by work items in a subgroup. |
| |
| <gentype> sub_group_scan_exclusive_<op>)( Does an exclusive scan operation specified by |
| <gentype> x ) <op> of all values specified by work items |
| in a subgroup. The scan results are |
| returned for each work item. |
| |
| The scan order is defined by increasing |
| sub_group_local_id within the subgroup. |
| |
| <gentype> sub_group_scan_inclusive_<op>( Does an inclusive scan operation specified by |
| <gentype> x ) <op> of all values specified by work items |
| in a subgroup. The scan results are |
| returned for each work item |
| |
| The scan order is defined by increasing |
| sub_group_local_id within the subgroup. |
| --------------------------------------------------------------------------------------" |
| |
| Add a new Section 6.13.X - "Sub Group Shuffle Functions" to the OpenCL 2.0 C Specification |
| |
| These additions are unique to the Intel subgroups extension and are not part of the |
| Khronos subgroups extension: |
| |
| "The OpenCL C programming language implements the following subgroup shuffle built-in |
| functions to allow data to be exchanged among work items in a subgroup. These |
| built-in functions need not be encountered by all work items in a subgroup executing |
| the kernel, however, data may only be shuffled among work items encountering the |
| subgroup shuffle function. Shuffling data from a work item that does not encounter |
| the subgroup shuffle function will produce undefined results. |
| |
| For these functions, <gentype> is <float>, <float2>, <float4>, <float8>, <float16>, |
| <int>, <int2>, <int4>, <int8>, <int16>, <uint>, <uint2>, <uint4>, <uint8>, <uint16>, |
| <long>, or <ulong>. |
| |
| If cl_khr_fp16 is supported, <gentype> also includes <half>. |
| If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>. |
| |
| -------------------------------------------------------------------------------------- |
| Function Description |
| ---------------------------------------- -------------------------------------------- |
| <gentype> intel_sub_group_shuffle( Allows data to be arbitrarily transferred |
| <gentype> data, between work items in a subgroup. The data |
| uint sub_group_local_id ) that is returned for this work item is the |
| value of data for the work item identified |
| by sub_group_local_id. |
| |
| sub_group_local_id need not be the same |
| value for all work items in the subgroup. |
| There is no defined behavior for out-of- |
| range sub_group_local_ids. |
| |
| <gentype> intel_sub_group_shuffle_down( Allows data to be transferred from a work |
| <gentype> current, item in the subgroup with a higher |
| <gentype> next, sub_group_local_id down to a work item in |
| uint delta ) the subgroup with a lower sub_group_local_id. |
| |
| There are two data sources to this built-in |
| function: current and next. To determine the |
| result of this built-in function, first let |
| the unsigned shuffle index be equivalent to |
| the sum of this work item's sub_group_local_id |
| plus the specified delta: |
| |
| If the shuffle index is less than the |
| max_sub_group_size, the result of this built-in |
| function is the value of the current data |
| source for the work item with |
| sub_group_local_id equal to the shuffle index. |
| |
| If the shuffle index is greater or equal to the |
| max_sub_group_size but less than twice the |
| max_sub_group_size, the result of this |
| built-in function is the value of the next |
| data source for the work item with |
| sub_group_local_id equal to the shuffle index |
| minus the max_sub_group_size. |
| |
| All other values of the shuffle index are |
| considered to be out-of-range. There is no |
| defined behavior for out-of-range indices. |
| |
| delta need not be the same value for all work |
| items in the subgroup. |
| |
| <gentype> intel_sub_group_shuffle_up( Allows data to be transferred from a work |
| <gentype> previous, item in the subgroup with a lower |
| <gentype> current, sub_group_local_id up to a work item in the |
| uint delta ) subgroup with a higher sub_group_local_id. |
| |
| There are two data sources to this built-in |
| function: previous and current. To determine |
| the result of this built-in function, first |
| let the signed shuffle index be equivalent to |
| this work item's sub_group_local_id minus the |
| specified delta: |
| |
| If the shuffle index is greater than or equal |
| to zero and less than the max_sub_group_size, |
| the result of this built-in function is the |
| value of the current data source for the work |
| item with sub_group_local_id equal to the |
| shuffle index. |
| |
| If the shuffle index is less than zero but |
| greater than or equal to the negative |
| max_sub_group_size, the result of this |
| built-in function is the value of the previous |
| data source for the work item with |
| sub_group_local_id equal to the shuffle index |
| plus the max_sub_group_size. |
| |
| All other values of the shuffle index are |
| considered to be out-of-range. There is no |
| defined behavior for out-of-range indices. |
| |
| delta need not be the same value for all work |
| items in the subgroup. |
| |
| <gentype> intel_sub_group_shuffle_xor( Allows data to be transferred between work |
| <gentype> data, items in a subgroup as a function of the work |
| uint value ) item's sub_group_local_id. The data that is |
| returned for this work item is the value of |
| data for the work item with sub_group_local_id |
| equal to this work item's sub_group_local_id |
| XOR'd with the specified value. If the result |
| of the XOR is greater than max_sub_group_size |
| then it is considered out-of-range. |
| |
| value need not be the same for all work items |
| in the subgroup. There is no defined behavior |
| for out-of-range indices. |
| --------------------------------------------------------------------------------------" |
| |
| Add a new Section 6.13.X - "Sub Group Read and Write Functions" to the OpenCL 2.0 C |
| Specification |
| |
| These additions are unique to the Intel subgroups extension and are not part of the |
| Khronos subgroups extension: |
| |
| "The OpenCL C programming language implements the following built-in functions to allow |
| data to be read or written as a block by all work items in a subgroup. These built-in |
| functions must be encountered by all work items in a subgroup executing the kernel. |
| Furthermore, since these are block operations, the pointer, image, and coordinate |
| arguments to these built-in functions must be the same for all work items in the |
| subgroup (when applicable, only the data argument may be different). |
| |
| -------------------------------------------------------------------------------------- |
| Function Description |
| ---------------------------------------- -------------------------------------------- |
| uint intel_sub_group_block_read( Reads 1, 2, 4, or 8 uints of data for each |
| const __global uint* p ) work item in the subgroup from the specified |
| uint2 intel_sub_group_block_read2( pointer as a block operation. |
| const __global uint* p ) The data is read strided, so the first |
| uint4 intel_sub_group_block_read4( value read is: |
| const __global uint* p ) p[ sub_group_local_id ] |
| uint8 intel_sub_group_block_read8( and the second value read is: |
| const __global uint* p ) p[ sub_group_local_id + max_sub_group_size ] |
| etc. |
| |
| There is no defined out-of-range behavior |
| for these functions. |
| |
| uint intel_sub_group_block_read( Reads 1, 2, 4, or 8 uints of data for each |
| image2d_t image, work item in the subgroup from the specified |
| int2 byte_coord ) image at the specified coordinate as a block |
| uint2 intel_sub_group_block_read2( operation. Note that the coordinate is a |
| image2d_t image, byte coordinate, not an image element |
| int2 byte_coord ) coordinate. Also note that the image data |
| uint4 intel_sub_group_block_read4( is read without format conversion, so each |
| image2d_t image, work item may read multiple image elements |
| int2 byte_coord ) (for images with element size smaller than |
| uint8 intel_sub_group_block_read8( 32-bits). |
| image2d_t image, |
| int2 byte_coord ) The data is read row-by-row, so the first |
| value read is from the row specified in the |
| y-component of the provided byte_coord, the |
| second value is read from the y-component |
| of the provided byte_coord plus one, etc. |
| |
| Please see the note below describing out-of- |
| bounds behavior for the subgroup image block |
| read functions. |
| |
| void intel_sub_group_block_write( Writes 1, 2, 4, or 8 uints of data for each |
| __global uint* p, uint data ) work item in the subgroup to the specified |
| void intel_sub_group_block_write2( pointer as a block operation. |
| __global uint* p, uint2 data ) The data is written strided, so the first |
| void intel_sub_group_block_write4( value is written to: |
| __global uint* p, uint4 data ) p[ sub_group_local_id ] |
| void intel_sub_group_block_write8( and the second value is written to: |
| __global uint* p, uint8 data ) p[ sub_group_local_id + max_sub_group_size ] |
| etc. |
| |
| There is no defined out-of-range behavior |
| for these functions. |
| |
| void intel_sub_group_block_write( Writes 1, 2, 4, or 8 uints of data for each |
| image2d_t image, work item in the subgroup to the specified |
| int2 byte_coord, uint data ) image at the specified coordinate as a block |
| void intel_sub_group_block_write2( operation. Note that the coordinate is a |
| image2d_t image, byte coordinate, not an image element |
| int2 byte_coord, uint2 data ) coordinate. Unlike the image block read |
| void intel_sub_group_block_write4( function, which may read from any arbitrary |
| image2d_t image, byte offset, the x-component of the byte |
| int2 byte_coord, uint4 data ) coordinate for the image block write |
| void intel_sub_group_block_write8( functions must be a multiple of four; in |
| image2d_t image, other words, the write must begin at a |
| int2 byte_coord, uint8 data ) 32-bit boundary. There is no restriction on |
| the y-component of the coordinate. Also, note |
| that the image data is written without format |
| conversion, so each work item may write |
| multiple image elements (for images with |
| element size smaller than 32-bits). |
| |
| The data is written row-by-row, so the first |
| value written is from the row specified by |
| the y-component of the provided byte_coord, |
| the second value is written from the y- |
| component of the provided byte_coord plus |
| one, etc. |
| |
| Please see the note below describing out-of- |
| bounds behavior for the subgroup image block |
| write functions. |
| ------------------------------------------------------------------------------------- |
| |
| Note: The subgroup image block read and write built-ins do support bounds checking, |
| however these built-ins bounds-check to the image width in units of uints, not in |
| units of image elements. This means: |
| |
| * If the image has an element size equal to the size of a uint (four bytes, for |
| example CL_RGBA + CL_UNORM_INT8), the image will be correctly bounds-checked. |
| In this case, out-of-bounds reads will return the edge image element (the |
| equivalent of CLK_ADDRESS_CLAMP_TO_EDGE), and out-of-bounds writes will be |
| ignored. |
| |
| * If the image has element size less than the size of a uint (such as CL_R + |
| CL_UNSIGNED_INT8), the entire image is addressable, however bounds checking |
| will occur too late. For this reason, extra care should be taken to avoid out- |
| of-bounds reads and writes, since out-of-bounds reads may return invalid data |
| and out-of-bounds writes may corrupt other images or buffers unpredictably. |
| |
| 6.13.X.1 - Restrictions |
| |
| The following restrictions apply to the subgroup buffer block read and write |
| functions: |
| |
| * The pointer 'p' must be 32-bit (4-byte) aligned for reads, and must be |
| 128-bit (16-byte) aligned for writes. |
| |
| * If the pointer 'p' is computed from a kernel argument that is a cl_mem |
| that was created with CL_MEM_USE_HOST_PTR, then the <host_ptr> must be |
| 32-bit (4-byte) aligned for reads, and must be 128-bit (16-byte) aligned |
| for writes. |
| |
| * If the pointer 'p' is computed from a kernel argument that is a cl_mem |
| that is a sub-buffer, then the <origin> defining the sub-buffer offset into |
| the <buffer> must be a multiple of 4 bytes for reads, and must be a multiple |
| of 16 bytes for write, in addition to the CL_DEVICE_MEM_BASE_ADDR_ALIGN |
| requirements. Additionally, if the <buffer> that the sub-buffer is created |
| from was created with CL_MEM_USE_HOST_PTR, then the <host_ptr> for the |
| <buffer> must be 32-bit (4-byte) aligned for reads, and must be 128-bit |
| (16-byte) aligned for writes. |
| |
| * If the pointer 'p' is computed from an SVM pointer kernel argument, then the |
| SVM pointer kernel argument must be 32-bit (4-byte) aligned for reads, and |
| must be 128-bit (16-byte) aligned for writes. |
| |
| The following restrictions apply to the subgroup image block read and write |
| functions: |
| |
| * The behavior of the subgroup image block read and write built-ins is |
| undefined for images with an element size greater than four bytes |
| (such as CL_RGBA + CL_FLOAT). |
| |
| * When reading or writing a 2D image created from a buffer with the subgroup |
| block read and write built-ins, the image row pitch is required to be a |
| multiple of 64-bytes, in addition to the CL_DEVICE_IMAGE_PITCH_ALIGNMENT |
| requirements. |
| |
| * When reading or writing a 2D image created from a buffer with the subgroup |
| block read and write built-ins, if the buffer is a cl_mem that was created |
| with CL_MEM_USE_HOST_PTR, then the <host_ptr> must be 256-bit (32-byte) |
| aligned. |
| |
| * When reading or writing a 2D image created from a buffer with the subgroup |
| block read and write built-ins, if the buffer is a cl_mem that is a |
| sub-buffer, then the <origin> must be a multiple of 32-bytes. Additionally, |
| if the <buffer> that the sub-buffer is created from was created with |
| CL_MEM_USE_HOST_PTR, then the <host_ptr> for the <buffer> must be 256-bit |
| (32-byte) aligned." |
| |
| Revision History |
| |
| Version 1 (2014/12/02): First public revision. |
| Version 2 (2015/03/12): Fixed minor formatting errors, added restriction for |
| subgroup image block read and write built-ins with large |
| image formats. |
| Version 3 (2016/02/12): Fixed a small bug in the shuffle up and shuffle down |
| descriptions. |
| Version 4 (2016/08/28): Added additional restrictions and programming notes for the |
| subgroup shuffle and block read built-ins. |