extensions/intel/cl_intel_subgroups.txt - external/github.com/KhronosGroup/OpenCL-Registry - Git at Google

 Name String

     cl_intel_subgroups

 Contributors

     Ben Ashbaugh, Intel
     Allen Hux, Intel
     Pranayini Gudali, Intel
     Dawid Dominiak, Intel
     Biju George, Intel

 Contact

     Ben Ashbaugh, Intel (ben.ashbaugh 'at' intel.com)

 Version

     Version 4, August 28, 2016

 Number

     OpenCL Extension #35

 Status

     Final Draft

 Dependencies

     OpenCL 1.2 is required.  Some features (get_num_enqueued_sub_groups() and
     the sub_group_barrier() function that accept a memory scope) require OpenCL
     2.0.

     This extension is written against revision 24 of the OpenCL 2.0 API
     specification, against revision 24 of the OpenCL 2.0 OpenCL C specification,
     and against revision 24 of the OpenCL 2.0 extension specification.

 Overview

     The goal of this extension is to allow programmers to improve the performance
     of their applications by taking advantage of the fact that some work items in a
     work group execute together as a group (a "subgroup"), and that work items in a
     subgroup can take advantage of hardware features that are not available to work
     items in a work group.  Specifically, this extension is designed to allow work
     items in a subgroup to share data without the use of local memory and work group
     barriers, and to utilize specialized hardware to load and store blocks of data.

     There is a large amount of overlap between the functionality in this extension
     and the functionality in the Khronos OpenCL 2.0 "cl_khr_subgroups" extension, so
     this extension reuses many of the names, concepts, and functions already described
     in the cl_khr_subgroups extension.  The key differences between the Intel
     subgroups extension and the Khronos subgroups extension are:

         * The Khronos subgroups extension requires OpenCL 2.0, but the Intel subgroups
           extension may be available on OpenCL 1.2 devices.

         * The Khronos subgroups extension guarantees that subgroups in a work group
           will make independent forward progress, but the Intel extension does not
           guarantee that subgroups in a work group will make independent forward
           progress.

         * The Intel extension adds a rich set of subgroup "shuffle" functions to
           allow work items within a work group to interchange data without the use
           of local memory and work group barriers.

         * The Intel extension adds a set of subgroup "block read and write" functions
           to take advantage of specialized hardware to read or write blocks of data
           from or to buffers or images.

         * The Intel subgroups extension does not include the subgroup pipes functions
           that are included as part of the Khronos subgroups extension.

         * The Intel subgroups extension does not include the device-side kernel query
           functions for subgroups that are included as part of the Khronos subgroups
           extension.

 New API Functions

     This function is copied unchanged from the Khronos subgroups extension:

     cl_int clGetKernelSubGroupInfoKHR(
         cl_kernel kernel,
         cl_device_id device,
         cl_kernel_sub_group_info param_name,
         size_t input_value_size,
         const void* input_value,
         size_t param_value_size,
         void* param_value,
         size_t* param_value_size_ret)

 New API Enums

     These enums are copied unchanged from the Khronos subgroups extension:

     Accepted as the <param_name> parameter of clGetKernelSubGroupInfoKHR.

         CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR    0x2033
         CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE_KHR       0x2034

 New OpenCL C Functions

     These built-in functions are copied unchanged from the Khronos subgroups
     extension:

         uint    get_sub_group_size( void );
         uint    get_max_sub_group_size( void );
         uint    get_num_sub_groups( void );

         uint    get_sub_group_id( void );
         uint    get_sub_group_local_id( void );

         void    sub_group_barrier( cl_mem_fence_flags flags );

         int     sub_group_all( int predicate );
         int     sub_group_any( int predicate );

         If OpenCL 2.0 is supported:

         uint    get_enqueued_num_sub_groups( void );
         void    sub_group_barrier( cl_mem_fence_flags flags, memory_scope scope );

         For the sub_group_broadcast functions, <gentype> is <int>, <uint>,
         <long>, <ulong>, or <float>.

         If cl_khr_fp16 is supported, <gentype> also includes <half>.
         If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>.

         <gentype> sub_group_broadcast( <gentype> x, uint sub_group_local_id );

         For the sub_group_reduce, sub_group_scan_exclusive, and
         sub_group_scan_inclusive functions, <gentype> is <int>, <uint>, <long>,
         <ulong>, or <float>.  <op> is <add>, <min>, or <max>.

         If cl_khr_fp16 is supported, <gentype> also includes <half>.
         If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>.

         <gentype> sub_group_reduce_<op>( <gentype> x );
         <gentype> sub_group_scan_exclusive_<op>( <gentype> x );
         <gentype> sub_group_scan_inclusive_<op>( <gentype> x );

     These built-in functions are unique to the Intel subgroups extension and are not
     part of the Khronos subgroups extension:

         For the sub_group_shuffle, sub_group_shuffle_down, sub_group_shuffle_up, and
         sub_group_shuffle_xor functions, <gentype> is <float>, <float2>, <float4>,
         <float8>, <float16>, <int>, <int2>, <int4>, <int8>, <int16>, <uint>, <uint2>,
         <uint4>, <uint8>, <uint16>, <long>, or <ulong>.

         If cl_khr_fp16 is supported, <gentype> also includes <half>.
         If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>.

         <gentype> intel_sub_group_shuffle( <gentype> data, uint c );
         <gentype> intel_sub_group_shuffle_down(
                       <gentype> current, <gentype> next, uint delta );
         <gentype> intel_sub_group_shuffle_up(
                       <gentype> previous, <gentype> current, uint delta );
         <gentype> intel_sub_group_shuffle_xor( <gentype> data, uint value );


         uint    intel_sub_group_block_read( const __global uint* p );
         uint2   intel_sub_group_block_read2( const __global uint* p );
         uint4   intel_sub_group_block_read4( const __global uint* p );
         uint8   intel_sub_group_block_read8( const __global uint* p );

         uint    intel_sub_group_block_read( image2d_t image, int2 byte_coord );
         uint2   intel_sub_group_block_read2( image2d_t image, int2 byte_coord );
         uint4   intel_sub_group_block_read4( image2d_t image, int2 byte_coord );
         uint8   intel_sub_group_block_read8( image2d_t image, int2 byte_coord );

         void    intel_sub_group_block_write( __global uint* p, uint data );
         void    intel_sub_group_block_write2( __global uint* p, uint2 data );
         void    intel_sub_group_block_write4( __global uint* p, uint4 data );
         void    intel_sub_group_block_write8( __global uint* p, uint8 data );

         void    intel_sub_group_block_write( image2d_t image, int2 byte_coord, uint data );
         void    intel_sub_group_block_write2( image2d_t image, int2 byte_coord, uint2 data );
         void    intel_sub_group_block_write4( image2d_t image, int2 byte_coord, uint4 data );
         void    intel_sub_group_block_write8( image2d_t image, int2 byte_coord, uint8 data );

 New OpenCL C Enums

     This enum is copied unchanged from the Khronos subgroups extension:

     Add the following new value to the enumerated type <memory_scope>:

         memory_scope_sub_group

 Modifications to Section 2 - "Glossary" of the OpenCL 2.0 API Specification

     Add memory_scope_sub_group to the description of Memory Scopes:

    "Memory Scopes: Memory scopes define a hierarchy of visibilities when analyzing the
     ordering constraints of memory operations.  They are defined by the values of the
     memory_scope enumeration constant.  Current values are memory_scope_work_item (memory
     constraints only apply to a single work item and in practice only apply to image
     operations), memory_scope_sub_group (memory-ordering constraints only apply to work
     items executing in a subgroup), memory_scope_work_group ..."

     Add memory_scope_sub_group to the description of Scope inclusion:

    "Scope inclusion: Two actions A and B are defined to have an inclusive scope if they
     have the same scope P such that: (1) if P is memory_scope_sub_group, and A and B are
     executed by work items within the same subgroup, or (2) if P is memory_scope_work_group,
     and A and B are executed by work items within the same workgroup ..."

     Change the description for Subgroups to:

    "Subgroup: Subgroups are an implementation-dependent grouping of work items within a
     work group.  The size and number of subgroups is implementation-defined and not
     exposed in the core OpenCL 2.0 feature set.  Subgroups execute concurrently within
     a work group, but are not guaranteed to make independent forward progress.
     Subgroups may synchronize internally using subgroup barrier operations without
     synchronizing with other subgroups."

 Modifications to Section 3.2.1 - "Execution Model: Mapping Work Items Onto an NDRange" of
 the OpenCL 2.0 API Specification

     Change the paragraph describing subgroups to:

    "An implementation of OpenCL may divide each work group into one or more subgroups.
     The size and number of subgroups is implementation-defined and not exposed in the
     core OpenCL 2.0 feature set."

 Modifications to Section 3.2.2 - "Execution Model: Execution of Kernel Instances" of the
 OpenCL 2.0 API Specification

     Remove the last paragraph describing subgroups and independent forward progress.

 Additions to Section 3.2 - "Execution Model" of the OpenCL 2.0 API Specification

     This text is largely the same as the text in the Khronos subgroups extension.
     Only the sentence about independent forward progress has been modified.

    "Within a work group, work items may be divided into subgroups in an implementation-
     defined fashion.  The mapping of work items to subgroups is implementation-defined
     and may be queried at runtime.  While subgroups may be used in multi-dimensional
     work groups, each subgroup is 1-dimensional and any given work item may query which
     subgroup it is a member of.

     Work items are mapped into subgroups through a combination of compile-time decisions
     and the parameters of the dispatch.  The mapping to subgroups is invariant for the
     duration of a kernel's execution, across dispatches of a given kernel with the same
     launch parameters, and from one work group to another within the dispatch (excluding
     the trailing edge work groups in the presence of non-uniform work group sizes).  In
     addition, all subgroups within a work group will be the same size, apart from the
     subgroup with the maximum index, which may be smaller if the size of the work group
     is not evenly divisible by the size of the subgroups.

     Subgroups execute concurrently within a given work group.  Similar to work items
     within a work group, subgroups executing within a work group are not guaranteed to make
     independent forward progress.  Work items in a subgroup can internally synchronize
     using subgroup barrier operations without synchronizing with other subgroups."

 Additions to Section 3.3.4 - "Memory Model: Memory Consistency Model"

     Add memory_scope_sub_group to the bulleted descriptions of memory scopes:

    " * memory_scope_sub_group: memory-ordering constraints only apply to work items
        executing within a single subgroup.
      * memory_scope_work_group: ..."

     In the paragraph after the bulleted descriptions of memory scopes, include
     memory_scope_sub_group as a valid memory scope for local memory:

    "... For local memory, memory_scope_sub_group and memory_scope_work_group are valid,
     and may constrain visibility to the subgroup or workgroup."

 Additions to Section 3.3.5 - "Memory Model: Overview of atomic and fence operations"

     Add memory_scope_sub_group to the definition of inclusive scope:

    " * P is memory_scope_sub_group and A and B are executed by work items within the same
        subgroup.
      * P is memory_scope_work_group ..."

 Additions to Section 5.9.3 - "Kernel Object Queries" of the OpenCL 2.0 API Specification

     This addition is copied unchanged from the Khronos subgroups extension:

    "The function

         cl_int clGetKernelSubGroupInfoKHR(
             cl_kernel kernel,
             cl_device_id device,
             cl_kernel_sub_group_info param_name,
             size_t input_value_size,
             const void* input_value,
             size_t param_value_size,
             void* param_value,
             size_t* param_value_size_ret)

     returns information about the kernel object.

     <kernel> specifies the kernel object being queries.

     <device> identifies a specific device in the list of devices associated with <kernel>.
     The list of devices is the list of devices in the OpenCL context that is associated
     with <kernel>.  If the list of devices associated with <kernel> is a single device,
     <device> can be a NULL value.

     <param_name> specifies the information to query.  The list of supported <param_name>
     types and the information returned in <param_value> by clGetKernelSubGroupInfoKHR is
     described in the table below.

     <input_value_size> is used to specify the size in bytes of memory pointed to by
     <input_value>.  This size must be equal to the size of the input type as described
     in the table below.

     <input_value> is a pointer to memory where the appropriate parameterization of the
     query is passed from.  If <input_value> is NULL it is ignored.

     <param_value_size> is used to specify the size in bytes of memory pointed to by
     <param_value>.  This size must be greater than or equal to the size of the return type
     as described in the table below.

     <param_value_size_ret> returns the actual size in bytes of data copied to <param_value>.
     If <param_value_size_ret> is NULL it is ignored.

     --------------------------------------------------------------------------------------
     cl_kernel_sub_group_info  Input Type  Return Type  Description
     ------------------------  ----------  -----------  -----------------------------------
     CL_KERNEL_MAX_SUB_GROUP_  size_t*     size_t       Returns the maximum subgroup size
     SIZE_FOR_NDRANGE                                   for this kernel.  All subgroups must
                                                        be the same size, while the last
                                                        subgroup in any work group (i.e. the
                                                        subgroup with the maximum index)
                                                        could be the same or smaller size.

                                                        The <input_value> must be an array
                                                        of size_t values corresponding to
                                                        the local work size parameter of the
                                                        intended dispatch.  The number of
                                                        dimensions in the NDRange will be
                                                        inferred form the value specified
                                                        for <input_value_size>.

     CL_KERNEL_SUB_GROUP_      size_t*     size_t       Returns the number of subgroups that
     COUNT_FOR_NDRANGE                                  will be present in each work group
                                                        for a given local work size.  All
                                                        work groups, apart from the last
                                                        work group in each dimension in the
                                                        presence of non-uniform work group
                                                        sizes, will have the same number of
                                                        subgroups.

                                                        The <input_value> must be an array
                                                        of size_t values corresponding to
                                                        the local work size parameter of the
                                                        intended dispatch.  The number of
                                                        dimensions in the NDRange will be
                                                        inferred from the value specified
                                                        for <input_value_size>.
     --------------------------------------------------------------------------------------

     clGetKernelSubGroupInfoKHR returns CL_SUCCESS if the function executed successfully.
     Otherwise, it returns one of the following errors:

     * CL_INVALID_DEVICE if <device> is not in the list of devices associated with <kernel>,
       or if <device> is NULL but there is more than one device associated with <kernel>.

     * CL_INVALID_VALUE if <param_name> is not valid, or if the size in bytes specified by
       <param_value_size> is less than the size of the return type as described in the
       table above and <param_value> is not NULL.

     * CL_INVALID_VALUE if <param_name> is CL_KERNEL_SUB_GROUP_SIZE_FOR_NDRANGE and the
       size in bytes specified by <input_value_size> is not valid or if <input_value> is
       NULL.

     * CL_INVALID_KERNEL if <kernel> is not a valid kernel object.

     * CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the
       OpenCL implementation on the device.

     * CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by
       the OpenCL implementation on the host."

 Additions to Section 6.13.1 - "Work Item Functions" of the OpenCL 2.0 C Specification

     These additions are copied unchanged from the Khronos subgroups extension:

    "--------------------------------------------------------------------------------------
     Function                               Description
     -------------------------------------  -----------------------------------------------
     uint get_sub_group_size( void )        Returns the number of work items in the
                                            subgroup.  This value is no more than the
                                            maximum subgroup size and is implementation-
                                            defined based on a combination of the compiled
                                            compiled kernel and the dispatch dimensions.
                                            This will be a constant value for the lifetime
                                            of the subgroup.

     uint get_max_sub_group_size( void )    Returns the maximum size of a subgroup with the
                                            dispatch.  This value will be invariant for a
                                            given set of dispatch dimensions and a kernel
                                            object compiled for a given device.

     uint get_num_sub_groups( void )        Returns the number of subgroups that the current
                                            work group is divided into.

                                            This number will be constant for the duration of
                                            a work group's execution.  If the kernel is
                                            executed with a non-uniform work group size in
                                            any dimension, calls to this built-in may return
                                            a different values for some work groups than for
                                            other work groups.

     uint get_sub_group_id( void )          Returns the subgroup ID, which is a number from
                                            zero to get_num_sub_groups - 1.

                                            For clEnqueueTask, this returns zero.

     uint get_sub_group_local_id( void )    Returns the unique work item ID within the
                                            current subgroup.  The mapping from get_local_id
                                            to get_sub_group_local_id will be invariant for
                                            the lifetime of the work group.

     --------------------------------------------------------------------------------------"

     If OpenCL 2.0 is supported:

    "--------------------------------------------------------------------------------------
     Function                                  Description
     ----------------------------------------  --------------------------------------------
     uint get_enqueued_num_sub_groups( void )  Returns the same value as that returned by
                                               get_num_sub_groups if the kernel is executed
                                               with a uniform work group size.  This value
                                               will be constant for the entire NDRange.

                                               If the kernel is executed with a non-uniform
                                               work group size, returns the number of
                                               subgroups in a work group that makes up the
                                               uniform region of the global NDRange.
     --------------------------------------------------------------------------------------"

 Additions to Section 6.13.8 - "Synchronization Functions" of the OpenCL 2.0 C Specification

     These additions are mostly unchanged from the Khronos subgroups extension.  There is
     no new functionality, only minor edits for clarity:

    "--------------------------------------------------------------------------------------
     Function                                  Description
     ----------------------------------------  --------------------------------------------
     void sub_group_barrier(                   All work items in a subgroup executing the
              cl_mem_fence_flags flags )       kernel on a processor must execute this
                                               function before any are allowed to continue
                                               execution beyond the subgroup barrier.  This
                                               function must be encountered by all work
                                               items in a subgroup executing the kernel.
                                               These rules apply to NDRanges implemented
                                               with uniform and non-uniform work groups.

                                               If sub_group_barrier is inside a conditional
                                               statement then all work items within the
                                               subgroup must enter the conditional if
                                               any work item in the subgroup enters the
                                               conditional statement and executes the
                                               sub_group_barrier.

                                               If sub_group_barrier is inside a loop, all
                                               work items within the subgroup must execute
                                               the sub_group_barrier for each iteration of
                                               the loop before any are allowed to continue
                                               execution beyond the sub_group_barrier.

                                               The sub_group_barrier function also queues a
                                               memory fence (reads and writes) to ensure
                                               correct ordering of memory operations to
                                               local or global memory.

                                               The flags argument specifies the memory
                                               address space and can be set to a
                                               combination of the following values:

                                               CLK_LOCAL_MEM_FENCE - The sub_group_barrier
                                               function will either flush any variables
                                               stored in local memory or queue a memory
                                               fence to ensure correct ordering of memory
                                               operations to local memory.

                                               CLK_GLOBAL_MEM_FENCE - The sub_group_barrier
                                               function will queue a memory fence to ensure
                                               correct ordering of memory operations to
                                               global memory.  This can be useful when work
                                               items, for example, write to buffer objects
                                               and then want to read the updated data from
                                               these buffer objects.
     --------------------------------------------------------------------------------------"

     If OpenCL 2.0 is supported, add the following to the table above:

    "--------------------------------------------------------------------------------------
     Function                                  Description
     ----------------------------------------  --------------------------------------------
     void sub_group_barrier(                   ...
              cl_mem_fence_flags flags,        The sub_group_barrier function also supports
              memory_scope scope )             a variant that specifies the memory scope.
                                               For the sub_group_barrier variant that does
                                               not take a memory scope, the scope is
                                               memory_scope_sub_group.

                                               The scope argument specifies whether the
                                               memory accesses of work items in the
                                               subgroup to memory address space(s)
                                               identified by flags become visible to all
                                               work items in the subgroup, the work group,
                                               the device, or all SVM devices.
                                               ...
                                               CLK_IMAGE_MEM_FENCE - The sub_group_barrier
                                               function will queue a memory fence to ensure
                                               correct ordering of memory operations to
                                               image objects.  This can be useful when work
                                               items, for example, write to image objects
                                               and then want to read the updated data from
                                               these image objects.
     --------------------------------------------------------------------------------------"

 Additions to Section 6.13.11 - "Atomic Functions" of the OpenCL 2.0 C Specification

     Modify the bullet describing behavior for functions that do not have a memory_scope
     argument to say:

    " * The subgroup functions that do not have a memory_scope argument have the same
        semantics as the corresponding functions with the memory_scope argument set to
        memory_scope_sub_group.  Other functions that do not have a memory_scope
        argument have the same semantics as the corresponding functions with the
        memory_scope argument set to memory_scope_device."

     This addition is copied unchanged from the Khronos subgroups extension:

     Add the following new value to the enumerated type <memory_scope> defined in Section
     6.13.11.4:

    "<memory_scope_sub_group>

     The <memory_scope_sub_group> specifies that the memory ordering constraints given by
     <memory_order> apply to work items in a subgroup.  This memory scope can be used when
     performing atomic operations to global or local memory."

 Additions to Section 6.13.15 - "Work Group Functions" of the OpenCL 2.0 C Specification

     These additions are copied from the Khronos subgroups extension:

    "The OpenCL C programming language implements the following built-in functions that
     operate on a subgroup level.  These built-in functions must be encountered by all work
     items in a subgroup executing the kernel.  We use the generic term <gentype> to indicate
     the built-in data types <int>, <uint>, <long>, <ulong>, or <float> as the type for the
     arguments.

     If cl_khr_fp16 is supported, <gentype> also includes <half>.
     If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>.

     --------------------------------------------------------------------------------------
     Function                                  Description
     ----------------------------------------  --------------------------------------------
     int     sub_group_all( int predicate )    Evaluates predicate for all work items in
                                               the subgroup and returns a non-zero value
                                               if predicate evaluates to non-zero for all
                                               work items in the subgroup.

     int     sub_group_any( int predicate )    Evaluates predicate for all work items in
                                               the subgroup and returns a non-zero value if
                                               predicate evaluates to non-zero for any work
                                               item in the subgroup.

     <gentype> sub_group_broadcast(            Broadcasts the value of x for the work item
                   <gentype> x,                identified by sub_group_local_id (value
                   uint sub_group_local_id )   returned by get_sub_group_local_id) to all
                                               work items in the subgroup.
                                               sub_group_local_id must be the same value
                                               for all work items in the subgroup.

     <gentype> sub_group_reduce_<op>(          Returns the result of the reduction operation
                   <gentype> x )               specified by <op> for all values x specified
                                               by work items in a subgroup.

     <gentype> sub_group_scan_exclusive_<op>)( Does an exclusive scan operation specified by
                   <gentype> x )               <op> of all values specified by work items
                                               in a subgroup.  The scan results are
                                               returned for each work item.

                                               The scan order is defined by increasing
                                               sub_group_local_id within the subgroup.

     <gentype> sub_group_scan_inclusive_<op>(  Does an inclusive scan operation specified by
                   <gentype> x )               <op> of all values specified by work items
                                               in a subgroup.  The scan results are
                                               returned for each work item

                                               The scan order is defined by increasing
                                               sub_group_local_id within the subgroup.
     --------------------------------------------------------------------------------------"

 Add a new Section 6.13.X - "Sub Group Shuffle Functions" to the OpenCL 2.0 C Specification

     These additions are unique to the Intel subgroups extension and are not part of the
     Khronos subgroups extension:

    "The OpenCL C programming language implements the following subgroup shuffle built-in
     functions to allow data to be exchanged among work items in a subgroup.  These
     built-in functions need not be encountered by all work items in a subgroup executing
     the kernel, however, data may only be shuffled among work items encountering the
     subgroup shuffle function.  Shuffling data from a work item that does not encounter
     the subgroup shuffle function will produce undefined results.

     For these functions, <gentype> is <float>, <float2>, <float4>, <float8>, <float16>,
     <int>, <int2>, <int4>, <int8>, <int16>, <uint>, <uint2>, <uint4>, <uint8>, <uint16>,
     <long>, or <ulong>.

     If cl_khr_fp16 is supported, <gentype> also includes <half>.
     If cl_khr_fp64 or doubles are supported, <gentype> also includes <double>.

     --------------------------------------------------------------------------------------
     Function                                  Description
     ----------------------------------------  --------------------------------------------
     <gentype> intel_sub_group_shuffle(        Allows data to be arbitrarily transferred
                   <gentype> data,             between work items in a subgroup.  The data
                   uint sub_group_local_id )   that is returned for this work item is the
                                               value of data for the work item identified
                                               by sub_group_local_id.

                                               sub_group_local_id need not be the same
                                               value for all work items in the subgroup.
                                               There is no defined behavior for out-of-
                                               range sub_group_local_ids.

     <gentype> intel_sub_group_shuffle_down(   Allows data to be transferred from a work
                   <gentype> current,          item in the subgroup with a higher
                   <gentype> next,             sub_group_local_id down to a work item in
                   uint delta )                the subgroup with a lower sub_group_local_id.

                                               There are two data sources to this built-in
                                               function: current and next.  To determine the
                                               result of this built-in function, first let
                                               the unsigned shuffle index be equivalent to
                                               the sum of this work item's sub_group_local_id
                                               plus the specified delta:

                                               If the shuffle index is less than the
                                               max_sub_group_size, the result of this built-in
                                               function is the value of the current data
                                               source for the work item with
                                               sub_group_local_id equal to the shuffle index.

                                               If the shuffle index is greater or equal to the
                                               max_sub_group_size but less than twice the
                                               max_sub_group_size, the result of this
                                               built-in function is the value of the next
                                               data source for the work item with
                                               sub_group_local_id equal to the shuffle index
                                               minus the max_sub_group_size.

                                               All other values of the shuffle index are
                                               considered to be out-of-range.  There is no
                                               defined behavior for out-of-range indices.

                                               delta need not be the same value for all work
                                               items in the subgroup.

     <gentype> intel_sub_group_shuffle_up(     Allows data to be transferred from a work
                   <gentype> previous,         item in the subgroup with a lower
                   <gentype> current,          sub_group_local_id up to a work item in the
                   uint delta )                subgroup with a higher sub_group_local_id.

                                               There are two data sources to this built-in
                                               function: previous and current.  To determine
                                               the result of this built-in function, first
                                               let the signed shuffle index be equivalent to
                                               this work item's sub_group_local_id minus the
                                               specified delta:

                                               If the shuffle index is greater than or equal
                                               to zero and less than the max_sub_group_size,
                                               the result of this built-in function is the
                                               value of the current data source for the work
                                               item with sub_group_local_id equal to the
                                               shuffle index.

                                               If the shuffle index is less than zero but
                                               greater than or equal to the negative
                                               max_sub_group_size, the result of this
                                               built-in function is the value of the previous
                                               data source for the work item with
                                               sub_group_local_id equal to the shuffle index
                                               plus the max_sub_group_size.

                                               All other values of the shuffle index are
                                               considered to be out-of-range.  There is no
                                               defined behavior for out-of-range indices.

                                               delta need not be the same value for all work
                                               items in the subgroup.

     <gentype> intel_sub_group_shuffle_xor(    Allows data to be transferred between work
                   <gentype> data,             items in a subgroup as a function of the work
                   uint value )                item's sub_group_local_id.  The data that is
                                               returned for this work item is the value of
                                               data for the work item with sub_group_local_id
                                               equal to this work item's sub_group_local_id
                                               XOR'd with the specified value.  If the result
                                               of the XOR is greater than max_sub_group_size
                                               then it is considered out-of-range.

                                               value need not be the same for all work items
                                               in the subgroup.  There is no defined behavior
                                               for out-of-range indices.
     --------------------------------------------------------------------------------------"

 Add a new Section 6.13.X - "Sub Group Read and Write Functions" to the OpenCL 2.0 C
 Specification

     These additions are unique to the Intel subgroups extension and are not part of the
     Khronos subgroups extension:

    "The OpenCL C programming language implements the following built-in functions to allow
     data to be read or written as a block by all work items in a subgroup.  These built-in
     functions must be encountered by all work items in a subgroup executing the kernel.
     Furthermore, since these are block operations, the pointer, image, and coordinate
     arguments to these built-in functions must be the same for all work items in the
     subgroup (when applicable, only the data argument may be different).

     --------------------------------------------------------------------------------------
     Function                                  Description
     ----------------------------------------  --------------------------------------------
     uint  intel_sub_group_block_read(         Reads 1, 2, 4, or 8 uints of data for each
               const __global uint* p )        work item in the subgroup from the specified
     uint2 intel_sub_group_block_read2(        pointer as a block operation.
               const __global uint* p )        The data is read strided, so the first
     uint4 intel_sub_group_block_read4(        value read is:
               const __global uint* p )          p[ sub_group_local_id ]
     uint8 intel_sub_group_block_read8(        and the second value read is:
               const __global uint* p )          p[ sub_group_local_id + max_sub_group_size ]
                                               etc.

                                               There is no defined out-of-range behavior
                                               for these functions.

     uint  intel_sub_group_block_read(         Reads 1, 2, 4, or 8 uints of data for each
               image2d_t image,                work item in the subgroup from the specified
               int2 byte_coord )               image at the specified coordinate as a block
     uint2 intel_sub_group_block_read2(        operation.  Note that the coordinate is a
               image2d_t image,                byte coordinate, not an image element
               int2 byte_coord )               coordinate.  Also note that the image data
     uint4 intel_sub_group_block_read4(        is read without format conversion, so each
               image2d_t image,                work item may read multiple image elements
               int2 byte_coord )               (for images with element size smaller than
     uint8 intel_sub_group_block_read8(        32-bits).
               image2d_t image,
               int2 byte_coord )               The data is read row-by-row, so the first
                                               value read is from the row specified in the
                                               y-component of the provided byte_coord, the
                                               second value is read from the y-component
                                               of the provided byte_coord plus one, etc.

                                               Please see the note below describing out-of-
                                               bounds behavior for the subgroup image block
                                               read functions.

     void  intel_sub_group_block_write(        Writes 1, 2, 4, or 8 uints of data for each
               __global uint* p, uint data )   work item in the subgroup to the specified
     void  intel_sub_group_block_write2(       pointer as a block operation.
               __global uint* p, uint2 data )  The data is written strided, so the first
     void  intel_sub_group_block_write4(       value is written to:
               __global uint* p, uint4 data )    p[ sub_group_local_id ]
     void  intel_sub_group_block_write8(       and the second value is written to:
               __global uint* p, uint8 data )    p[ sub_group_local_id + max_sub_group_size ]
                                               etc.

                                               There is no defined out-of-range behavior
                                               for these functions.

     void  intel_sub_group_block_write(        Writes 1, 2, 4, or 8 uints of data for each
               image2d_t image,                work item in the subgroup to the specified
               int2 byte_coord, uint data )    image at the specified coordinate as a block
     void  intel_sub_group_block_write2(       operation.  Note that the coordinate is a
               image2d_t image,                byte coordinate, not an image element
               int2 byte_coord, uint2 data )   coordinate.  Unlike the image block read
     void  intel_sub_group_block_write4(       function, which may read from any arbitrary
               image2d_t image,                byte offset, the x-component of the byte
               int2 byte_coord, uint4 data )   coordinate for the image block write
     void  intel_sub_group_block_write8(       functions must be a multiple of four; in
               image2d_t image,                other words, the write must begin at a
               int2 byte_coord, uint8 data )   32-bit boundary.  There is no restriction on
                                               the y-component of the coordinate.  Also, note
                                               that the image data is written without format
                                               conversion, so each work item may write
                                               multiple image elements (for images with
                                               element size smaller than 32-bits).

                                               The data is written row-by-row, so the first
                                               value written is from the row specified by
                                               the y-component of the provided byte_coord,
                                               the second value is written from the y-
                                               component of the provided byte_coord plus
                                               one, etc.

                                               Please see the note below describing out-of-
                                               bounds behavior for the subgroup image block
                                               write functions.
     -------------------------------------------------------------------------------------

     Note: The subgroup image block read and write built-ins do support bounds checking,
     however these built-ins bounds-check to the image width in units of uints, not in
     units of image elements.  This means:

       * If the image has an element size equal to the size of a uint (four bytes, for
         example CL_RGBA + CL_UNORM_INT8), the image will be correctly bounds-checked.
         In this case, out-of-bounds reads will return the edge image element (the
         equivalent of CLK_ADDRESS_CLAMP_TO_EDGE), and out-of-bounds writes will be
         ignored.

       * If the image has element size less than the size of a uint (such as CL_R +
         CL_UNSIGNED_INT8), the entire image is addressable, however bounds checking
         will occur too late.  For this reason, extra care should be taken to avoid out-
         of-bounds reads and writes, since out-of-bounds reads may return invalid data
         and out-of-bounds writes may corrupt other images or buffers unpredictably.

     6.13.X.1 - Restrictions

     The following restrictions apply to the subgroup buffer block read and write
     functions:

       * The pointer 'p' must be 32-bit (4-byte) aligned for reads, and must be
         128-bit (16-byte) aligned for writes.

       * If the pointer 'p' is computed from a kernel argument that is a cl_mem
         that was created with CL_MEM_USE_HOST_PTR, then the <host_ptr> must be
         32-bit (4-byte) aligned for reads, and must be 128-bit (16-byte) aligned
         for writes.

       * If the pointer 'p' is computed from a kernel argument that is a cl_mem
         that is a sub-buffer, then the <origin> defining the sub-buffer offset into
         the <buffer> must be a multiple of 4 bytes for reads, and must be a multiple
         of 16 bytes for write, in addition to the CL_DEVICE_MEM_BASE_ADDR_ALIGN
         requirements.  Additionally, if the <buffer> that the sub-buffer is created
         from was created with CL_MEM_USE_HOST_PTR, then the <host_ptr> for the
         <buffer> must be 32-bit (4-byte) aligned for reads, and must be 128-bit
         (16-byte) aligned for writes.

       * If the pointer 'p' is computed from an SVM pointer kernel argument, then the
         SVM pointer kernel argument must be 32-bit (4-byte) aligned for reads, and
         must be 128-bit (16-byte) aligned for writes.

     The following restrictions apply to the subgroup image block read and write
     functions:

       * The behavior of the subgroup image block read and write built-ins is
         undefined for images with an element size greater than four bytes
         (such as CL_RGBA + CL_FLOAT).

       * When reading or writing a 2D image created from a buffer with the subgroup
         block read and write built-ins, the image row pitch is required to be a
         multiple of 64-bytes, in addition to the CL_DEVICE_IMAGE_PITCH_ALIGNMENT
         requirements.

       * When reading or writing a 2D image created from a buffer with the subgroup
         block read and write built-ins, if the buffer is a cl_mem that was created
         with CL_MEM_USE_HOST_PTR, then the <host_ptr> must be 256-bit (32-byte)
         aligned.

       * When reading or writing a 2D image created from a buffer with the subgroup
         block read and write built-ins, if the buffer is a cl_mem that is a
         sub-buffer, then the <origin> must be a multiple of 32-bytes.  Additionally,
         if the <buffer> that the sub-buffer is created from was created with
         CL_MEM_USE_HOST_PTR, then the <host_ptr> for the <buffer> must be 256-bit
         (32-byte) aligned."

 Revision History

     Version 1 (2014/12/02): First public revision.
     Version 2 (2015/03/12): Fixed minor formatting errors, added restriction for
                             subgroup image block read and write built-ins with large
                             image formats.
     Version 3 (2016/02/12): Fixed a small bug in the shuffle up and shuffle down
                             descriptions.
     Version 4 (2016/08/28): Added additional restrictions and programming notes for the
                             subgroup shuffle and block read built-ins.