extensions/NV/NV_compute_program5.txt - external/github.com/KhronosGroup/OpenGL-Registry - Git at Google

 Name

     NV_compute_program5

 Name Strings

     GL_NV_compute_program5

 Contact

     Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

 Status

     Complete

 Version

     Last Modified Date:         10/23/2012
     NVIDIA Revision:            2

 Number

     421

 Dependencies

     OpenGL 4.0 (Core or Compatibiity Profile) is required.

     This extension is written against the OpenGL 4.2 Specification
     (Compatibility Profile).

     NV_gpu_program4 and NV_gpu_program5 are required.

     ARB_compute_shader is required.

     This specification interacts with NV_shader_atomic_float.

     This specification interacts with EXT_shader_image_load_store.

 Overview

     This extension builds on the ARB_compute_shader extension to provide new
     assembly compute program capability for OpenGL.  ARB_compute_shader adds
     the basic functionality, including the ability to dispatch compute work.
     This extension provides the ability to write a compute program in
     assembly, using the same basic syntax and capability set found in the
     NV_gpu_program4 and NV_gpu_program5 extensions.

 New Procedures and Functions

     None.

 New Tokens

     Accepted by the <cap> parameter of Disable, Enable, and IsEnabled,
     by the <pname> parameter of GetBooleanv, GetIntegerv, GetFloatv,
     and GetDoublev, and by the <target> parameter of ProgramStringARB,
     BindProgramARB, ProgramEnvParameter4[df][v]ARB,
     ProgramLocalParameter4[df][v]ARB, GetProgramEnvParameter[df]vARB,
     GetProgramLocalParameter[df]vARB, GetProgramivARB and
     GetProgramStringARB:

         COMPUTE_PROGRAM_NV                              0x90FB

     Accepted by the <target> parameter of ProgramBufferParametersfvNV,
     ProgramBufferParametersIivNV, and ProgramBufferParametersIuivNV,
     BindBufferRangeNV, BindBufferOffsetNV, BindBufferBaseNV, and BindBuffer
     and the <value> parameter of GetIntegerIndexedvEXT:

         COMPUTE_PROGRAM_PARAMETER_BUFFER_NV             0x90FC

     (Note:  Various enumerants from ARB_compute_shader will also be used by
      this extension.)

 Additions to Chapter 2 of the OpenGL 4.2 (Compatibility Profile) Specification
 (OpenGL Operation)

     Modify Section 2.X, GPU Programs, of NV_gpu_program4 (as modified by
     NV_gpu_program5)

     (insert after second paragraph)

     Compute Programs

     Compute programs are used to perform general purpose computations using a
     three-dimensional array of program invocations (threads).  The compute
     shader invocations are arranged into work groups specified by the
     mandatory GROUP_SIZE declaration, each of which comprises a fixed-size,
     three-dimensional array of program invocations.  One or more work groups
     are scheduled for execution using the DispatchCompute or
     DispatchComputeIndirect commands.

     Each work group scheduled for execution will launch a separate program
     invocation for each work group member.  While the program invocations in a
     work group are launched together, they run independently after launch.
     The BAR (barrier) instruction is available to synchronize program
     invocations; an invocation stops at each BAR instruction until all
     invocations in the work group have executed the BAR instruction.  Each
     work group has an optional shared memory allocation (specified by the
     SHARED_MEMORY declaration) that can be read or written by any invocations
     of the work group.

     Unlike other program types, compute program invocations have no inputs or
     outputs interfacing with the rest of the pipeline.  Compute programs may
     obtain inputs using mechanisms such as global loads, image loads, atomic
     counter reads, shader storage buffer reads, and program parameters.
     Built-in inputs are also provided to allow a compute shader invocation to
     determine its position in the work group, the position of its work group
     in the full dispatch, as well as the work group and full dispatch sizes.
     Compute program results are expected to be written to globally accessible
     memory using mechanisms such as global stores, image stores, atomic
     counters, and shader storage buffers.


     Modify Section 2.X.2, Program Grammar

     (replace third paragraph)

     Compute programs are required to begin with the header string "!!NVcp5.0".
     This header string identifies the subsequent program body as being a
     compute program and indicates that it should be parsed according to the
     base NV_gpu_program5 grammar plus the additions below.  Program string
     parsing begins with the character immediately following the header string.

     (add the following grammar rules to the NV_gpu_program5 base grammar for
      compute programs)

     <declSequence>          ::= <declaration> <declSequence>

     <instruction>           ::= <SpecialInstruction>

     <opModifier>            ::= "CTA"

     <namingStatement>       ::= <SHARED_statement>

     <SHARED_statement>      ::= "SHARED" <establishName> <sharedSingleInit>
                               | "SHARED" <establishName> <optArraySize>
                                 <sharedMultipleInit>

     <sharedSingleInit>      ::= "=" <sharedUseDS>

     <sharedMultipleInit>    ::= "=" "{" <sharedItemList> "}"

     <sharedItemList>        ::= <sharedUseDM>
                               | <sharedUseDM> "," <sharedItemList>

     <sharedUseV>            ::= <sharedVarName> <optArrayMem>

     <sharedUseDS>           ::= <sharedBaseBinding> <arrayMemAbs>

     <sharedUseDM>           ::= <sharedUseDS>
                               | <sharedBaseBinding> <arrayRange>

     <sharedBaseBinding>     ::= "program" "." "sharedmem"

     <SpecialInstruction>    ::= "BAR"
                               | "ATOMS" <opModifiers> <instResult> ","
                                 <instOperandV> "," <sharedUseV>
                               | "LDS" <opModifiers> <instResult> ","
                                 <sharedUseV>
                               | "STS" <opModifiers> <instOperandV> ","
                                 <sharedUseV>

     <declaration>           ::= "GROUP_SIZE" <int>
                               | "GROUP_SIZE" <int> <int>
                               | "GROUP_SIZE" <int> <int> <int>
                               | "SHARED_MEMORY" <int>

     <attribBasic>           ::= "invocation" "." "localid"
                               | "invocation" "." "globalid"
                               | "invocation" "." "groupid"
                               | "invocation" "." "groupcount"
                               | "invocation" "." "groupsize"
                               | "invocation" "." "localindex"


     (add the following subsection to Section 2.X.3.2, Program Attribute
      Variables)

     Compute program attribute variables describe the attributes of the current
     program invocation.  Each DispatchCompute command produces a set of
     program invocations arranged as a one-, two-, or three-dimensional array.
     Figure X.1 illustrates a two-dimensional dispatch with a local work group
     size of 8x4, and a total dispatch of 5x4 local workgroups.  Each
     individual program invocation has a global one-, two-, or
     three-dimensional global coordinate, which can be further decomposed into
     a work group offset (in fixed-size work groups) and a local offset
     relative to the origin of an invocation's work group.

                 +-------+-------+-------+-------+-------+
                 |       |       | work  |       |       |
                 |       |       | group |       |       |
                 |       |       | (2,3) |       |       |
          (0,12) +-------+-------+-------+-------+-------+
                 |       |       |       |       |       |
                 |       |       |       |       |       |
                 |       | *     |       |       |       |
           (0,8) +-------+-------+-------+-------+-------+
                 |       |       |       |       | work  |
                 |       |       |       |       | group |
                 |       |       |       |       | (4,1) |
           (0,4) +-------+-------+-------+-------+-------+
                 | work  |       |       |       |       |
                 | group |       |       |       |       |
                 | (0,0) |       |       |       |       |
                 +-------+-------+-------+-------+-------+
               (0,0)   (8,0)   (16,0)  (24,0)  (32,0)

       Figure X.1, Compute Dispatch.  The single invocation at the location
       labeled "*" has a location (invocation.globalid) of (10,9).  The offset
       relative to its local work group (invocation.localid) is (2,1).  Its
       local work group has an offset (invocation.groupid) of (1,2), in units
       of work groups.

     The set of available compute program attribute bindings is enumerated in
     Table X.1.  All bindings are considered four-component unsigned integer
     vectors with the value of the fourth component undefined.

       Attribute Binding          Components  Underlying State
       -------------------------  ----------  ------------------------------
       invocation.localid         (x,y,z,-)   offset relative to base of
                                              work group

       invocation.globalid        (x,y,z,-)   offset relative to the base
                                              of the dispatched work

       invocation.groupid         (x,y,z,-)   offset (in groups) of local work
                                              group

       invocation.groupcount      (x,y,z,-)   total local work group count

       invocation.groupsize       (x,y,z,-)   number of invocations in each
                                              dimension of the local work group

       invocation.localindex      (x,-,-,-)   one-dimensional (flattened) index
                                              in local workgroup

       Table X.1, Compute Program Attribute Bindings.

     If a compute attribute binding matches "invocation.localid", the "x", "y",
     and "z" components of the invocation attribute variable are filled with
     the "x", "y", "z" components, respectively, of the offset of the
     invocation relative to the base of its local workgroup.  The "w" component
     of the attribute is undefined.

     If a compute attribute binding matches "invocation.globalid", the "x",
     "y", and "z" components of the invocation attribute variable are filled
     with the "x", "y", "z" components, respectively, of the offset of the
     invocation relative to the full compute dispatch.  The "w" component of
     the attribute is undefined.

     If a compute attribute binding matches "invocation.groupid", the "x", "y",
     and "z" components of the invocation attribute variable are filled with
     the "x", "y", "z" components, respectively, of the offset of the local
     work group (in groups) relative to the full compute dispatch.  The "w"
     component of the attribute is undefined.

     If a compute attribute binding matches "invocation.groupcount", the "x",
     "y", and "z" components of the invocation attribute variable are filled
     the "x", "y", and "z" dimensions, respectively, in local work groups of
     the full compute dispatch.  The "w" component of the attribute is
     undefined.

     If a compute attribute binding matches "invocation.groupsize", the "x",
     "y", and "z" components of the invocation attribute variable are filled
     the "x", "y", and "z" dimensions, respectively, of the local work group,
     as specified by the GROUP_SIZE declaration.  The "w" component of the
     attribute is undefined.

     If a compute attribute binding matches "invocation.localindex", the "x",
     components of the invocation attribute variable is filled with a flattened
     one-dimensional index of the invocation, which is derived as:

       invocation.localid.z * invocation.groupsize.x * invocation.groupsize.y +
       invocation.localid.y * invocation.groupsize.x +
       invocation.localid.x

     The "y", "z", and "w" components of the attribute are undefined.

     For one-dimensional dispatches, the "y" components of
     "invocation.localid", "invocation.globalid", and "invocation.groupid" will
     be zero.  For one- and two- dimensional dispatches, the "z" components of
     "invocation.localid", "invocation.globalid", and "invocation.groupid" will
     be zero.  The same components of "invocation.groupcount" and
     "invocation.groupsize" will be one in these cases.


     (add the following subsection to section 2.X.3.5, Program Results.)

     Compute programs have no result variables; all shader results must be
     written to memory.


     Add New Section 2.X.3.Y, Compute Program Shared Memory, after Section
     2.X.3.6, Program Parameter Buffers

     Compute program shared memory variables are arrays of basic machine units
     from which data can be read or written using the LDS and STS instructions.
     Compute program shared memory also supports atomic memory operations using
     the ATOMS instruction.  The GL allocates a single block of shared memory
     for each local work group, whose size in basic machine units is specified
     by the "SHARED_MEMORY" statement.  The contents of compute program shared
     memory are undefined when program execution for the local work group
     begins and can be changed only by using the ATOMS or STS instructions.
     Compute program shared memory variables are shared between all invocations
     of a local work group.  Writes performed by one invocation will be visible
     for any reads of the same memory from any other invocation executed after
     the write.  Note that the order of reads and writes between different
     invocations in a local work group is largely undefined, although the BAR
     instruction can be used to introduce synchronization points for all
     invocations in a local work group.

     Shared memory variables may only be used as operands in the ATOMS, LDS,
     and STS instructions; they may not be used by used as results or operands
     in general instructions.  Shared memory variables must be declared
     explicitly via the <SHARED_statement> grammar rule.  Shared memory
     bindings can not be used directly in executable instructions.

     Shader storage buffer variables may be declared as arrays, but all
     bindings assigned to the array must use the same binding point(s) and must
     increase consecutively.

       Binding                        Components  Underlying State
       -----------------------------  ----------  -----------------------------
       program.sharedmem[a]           (x,x,x,x)   compute shared memory,
                                                    element a
       program.sharedmem[a..b]        (x,x,x,x)   compute shared memory,
                                                    elements a through b
       program.sharedmem              (x,x,x,x)   compute shared memory,
                                                    all elements

       Table X.3: Shared Memory Bindings.  <a> and <b> indicate individual
       elements of shared memory.

     If a shared memory binding matches "program.sharedmem[a]", the shared
     memory variable is associated with basic machine element <a> of compute
     shared memory.

     For shared memory declarations, "program.sharedmem[a..b]" is equivalent to
     specifying elements <a> through <b> of compute shared memory in order.

     For shared memory declarations, "program.sharedmem" is equivalent to
     specifying elements zero through <N>-1 of compute shared memory in order,
     where <N> is the total shared memory size declared by the "SHARED_MEMORY"
     statement.


     Modify Section 2.X.4, Program Execution Environment

     (add to the opcode table)

                   Modifiers
       Instruction F I C S H D  Out Inputs    Description
       ----------- - - - - - -  --- --------  --------------------------------
       ATOMS       - - X - - -  s   v,su      atomic transaction to shared mem
       BAR         - - - - - -  -   -         work group execution barrier
       LDS         - - X X - F  v   su        load from shared memory
       STS         - - - - - -  -   v,su      store to shared memory


     Modify Section 2.X.4.1, Program Instruction Modifiers

       Modifier  Description
       --------  -----------------------------------------------
       CTA       Memory barrier orders only memory transactions
                 relative to invocations within local work group

     (add to descriptions of opcode modifiers)

     For the MEMBAR (memory barrier) instruction, the "CTA" modifier specifies
     that memory transactions before and after the barrier are strongly ordered
     as observed by any other shader invocation in the local work group.


     Modify Section 2.X.4.5, Program Memory Access, from NV_gpu_program5

     (add to the end of the first paragraph) ... Additionally programs may load
     from or store to shared memory via the ATOMS (atomic shared memory
     operation), LDS (load from shared memory), and STS (store to shared
     memory) instructions.

     (modify miscellaneous other language referring to "buffer object memory"
     to instead refer to "buffer object and shared memory")

     (add hypothetical built-in functions SharedMemoryLoad() and
     SharedMemoryStore() that behave similarly to BufferMemoryLoad() and
     BufferMemoryStore(), except that they access local work group shared
     memory instead of buffer object memory)


     Add the following subsection to section 2.X.7, Program Declarations

     Section 2.X.7.Y, Compute Program Declarations

     Compute programs support two types of declaration statement, as described
     below.

     - Shader Thread Group Size (GROUP_SIZE)

     The GROUP_SIZE statement declares the number of shader threads in a one-,
     two-, or three-dimensional local work group.  The statement must have one
     to three unsigned integer arguments.  Each argument must be less than or
     equal to the value of the implementation-dependent limit
     MAX_COMPUTE_LOCAL_WORK_SIZE for its corresponding dimension (X, Y, or Z).
     A program will fail to load unless it contains exactly one GROUP_SIZE
     declaration.


     - Shared Memory Storage Size (SHARED_MEMORY)

     The SHARED_MEMORY statement declares the size of the shared memory, in
     basic machine units, available to the threads of each local work group.
     The SHARED_MEMORY statement is optional, but a program will fail to load
     if it includes multiple SHARED_MEMORY declarations, if it uses the the
     ATOMS, LDS, or STS instructions in a program without a SHARED_MEMORY
     declaration, if uses these instructions with an offset that would access
     memory beyond the declared shared memory size, or if the declared shared
     memory size is greater than the implementation-dependent limit
     MAX_COMPUTE_SHARED_VARIABLE_SIZE.


     (add the following subsection to section 2.X.8, Program Instruction Set.)

     Section 2.X.8.Z, ATOMS:  Atomic Memory Operation (Shared Memory)

     The ATOMS instruction performs an atomic memory operation by reading from
     shared memory specified by the second unsigned integer scalar operand,
     computing a new value based on the value read from memory and the first
     (vector) operand, and then writing the result back to the same memory
     address.  The memory transaction is atomic, guaranteeing that no other
     write to the memory accessed will occur between the time it is read and
     written by the ATOMS instruction.  The result of the ATOMS instruction is
     the scalar value read from memory.  The second operand used for the ATOMS
     instruction must correspond to a shared memory variable declared using the
     "SHARED" statement; a program will fail to load if any other type of
     operand is used for the second operand of an ATOMS instruction.

     The ATOMS instruction has two required instruction modifiers.  The atomic
     modifier specifies the type of operation to be performed.  The storage
     modifier specifies the size and data type of the operand read from memory
     and the base data type of the operation used to compute the value to be
     written to memory.

       atomic     storage
       modifier   modifiers            operation
       --------   ------------------   --------------------------------------
        ADD       U32, S32, U64, F32   compute a sum
        MIN       U32, S32             compute minimum
        MAX       U32, S32             compute maximum
        IWRAP     U32                  increment memory, wrapping at operand
        DWRAP     U32                  decrement memory, wrapping at operand
        AND       U32, S32             compute bit-wise AND
        OR        U32, S32             compute bit-wise OR
        XOR       U32, S32             compute bit-wise XOR
        EXCH      U32, S32, U64, F32   exchange memory with operand
        CSWAP     U32, S32, U64        compare-and-swap

      Table X.Y, Supported atomic and storage modifiers for the ATOM
      instruction.

     Not all storage modifiers are supported by ATOMS, and the set of modifiers
     allowed for any given instruction depends on the atomic modifier
     specified.  Table X.Y enumerates the set of atomic modifiers supported by
     the ATOMS instruction, and the storage modifiers allowed for each.

       tmp0 = VectorLoad(op0);
       result = SharedMemoryLoad(op1, storageModifier);
       switch (atomicModifier) {
       case ADD:
         writeval = tmp0.x + result;
         break;
       case MIN:
         writeval = min(tmp0.x, result);
         break;
       case MAX:
         writeval = max(tmp0.x, result);
         break;
       case IWRAP:
         writeval = (result >= tmp0.x) ? 0 : result+1;
         break;
       case DWRAP:
         writeval = (result == 0 || result > tmp0.x) ? tmp0.x : result-1;
         break;
       case AND:
         writeval = tmp0.x & result;
         break;
       case OR:
         writeval = tmp0.x | result;
         break;
       case XOR:
         writeval = tmp0.x ^ result;
         break;
       case EXCH:
         break;
       case CSWAP:
         if (result == tmp0.x) {
           writeval = tmp0.y;
         } else {
           return result;  // no memory store
         }
         break;
       }
       SharedMemoryStore(op1, writeval, storageModifier);

     ATOMS performs a scalar atomic operation.  The <y>, <z>, and <w>
     components of the result vector are undefined.

     ATOMS supports no base data type modifiers, but requires exactly one
     storage modifier.  The base data types of the result vector, and the first
     (vector) operand are derived from the storage modifier.  The second
     operand is always interpreted as a scalar unsigned integer.


     Section 2.X.8.Z, BAR:  Execution Barrier

     The BAR instruction synchronizes the execution of compute shader
     invocations within a local work group.  When a compute shader invocation
     executes the BAR instruction, it pauses until the same BAR instruction has
     been executed by all invocations in the current local work group.  Once
     all invocations have executed the BAR instruction, processing continues
     with the instruction following the BAR instruction.

     There is no compile-time restriction on the locations in a program where
     BAR is allowed.  However, BAR instructions are not allowed in divergent
     flow control; if any compute shader invocation in the work group executes
     the BAR instruction, all compute shaders invocations must execute the
     instruction.  Results of executing a BAR instruction are undefined and can
     result in application hangs and/or program termination if the instruction
     is issued:

       * inside any IF/ELSE/ENDIF block where the results of the condition
         evaluated by the IF instruction are not identical across the work
         group;

       * inside any iteration of REP/ENDREP block where at least one invocation
         in the work group has skipped to the next iteration using the CONT
         instruction, exited the loop using a BRK or RET instruction, or exited
         the loop due to having completed the requested number of loop
         iterations; or

       * inside any subroutine (including main) where at least one invocation
         in the work group has exited the subroutine using the RET instruction.

     BAR has no operands and generates no result.


     Section 2.X.8.Z, LDS:  Load from Shared Memory

     The LDS instruction generates a result vector by fetching data from the
     shared memory for the current local work group identified by the first
     operand, as described in Section 2.X.4.5.  The single operand for the LDS
     instruction must correspond to a shader shared memory variable declared
     using the "SHARED" statement; a program will fail to load if any other
     type of operand is used in an LDS instruction.

       result = SharedMemoryLoad(op0, storageModifier);

     LDS supports no base data type modifiers, but requires exactly one storage
     modifier.  The base data type of the result vector is derived from the
     storage modifier.


     Replace Section 2.X.8.Z, MEMBAR:  Memory Barrier, as added by
     EXT_shader_image_load_store

     The MEMBAR instruction synchronizes memory transactions to ensure that
     memory transactions resulting from any instruction executed by the thread
     prior to the MEMBAR instruction complete prior to any memory transactions
     issued after the instruction, as observed by other shader invocations.

     The MEMBAR instruction has one optional instruction modifier.  If the CTA
     instruction modifier is specified, memory transactions before and after
     the barrier will be strongly ordered as observed by other shader
     invocations in the same local work group.  However, it does not order
     transactions as viewed by any other shader.  With the CTA modifier,
     shaders not in the local work group may observe the results of memory
     transactions issued after the MEMBAR instruction before those issued
     before the MEMBAR instruction.  If the CTA instruction modifier is not
     specified, all shader invocations will see the results of any memory
     transaction issued before the MEMBAR instruction before those issued after
     the MEMBAR instruction.

     MEMBAR has no operands and generates no result.


     Section 2.X.8.Z, STS:  Store to Shared Memory

     The STS instruction writes the contents of the first vector operand to
     shared memory for the current local work group identified by the second
     operand, as described in Section 2.X.4.5.  This instruction generates no
     result.  The second operand for the STS instruction must correspond to a
     shared memory variable declared using the "SHARED" statement; a program
     will fail to load if any other type of operand is used in an STS
     instruction.

       tmp0 = VectorLoad(op0);
       SharedMemoryStore(op1, tmp0, storageModifier);

     STS supports no base data type modifiers, but requires exactly one storage
     modifier.  The base data type of the vector components of the first
     operand is derived from the storage modifier.


 Additions to Chapter 3 of the OpenGL 4.2 (Compatibility Profile) Specification
 (Rasterization)

     None.

 Additions to Chapter 4 of the OpenGL 4.2 (Compatibility Profile) Specification
 (Per-Fragment Operations and the Frame Buffer)

     None.

 Additions to Chapter 5 of the OpenGL 4.2 (Compatibility Profile) Specification
 (Special Functions)

     None.

 Additions to Chapter 6 of the OpenGL 4.2 (Compatibility Profile) Specification
 (State and State Requests)

     None.

 Additions to the AGL/GLX/WGL Specifications

     None.

 GLX Protocol

     None.

 Dependencies on NV_shader_atomic_float

     If NV_shader_atomic_float is not supported, the ADD and EXCH atomic
     operations in the ATOMS instruction do not support the "F32" storage
     modifier.

 Dependencies on EXT_shader_image_load_store

     If EXT_shader_image_load_store is not supported, language describing the
     "CTA" instruction modifier and modifying the MEMBAR instruction (as added
     by EXT_shader_image_load_store) should be removed.

 Errors

     None.

 New State

     (Modify ARB_vertex_program, Table X.6 -- Program State)

                                                       Initial
     Get Value                    Type    Get Command  Value   Description               Sec.    Attribute
     ---------                    ------- -----------  ------- ------------------------  ------  ---------
     COMPUTE_PROGRAM_PARAMETER_   Z+      GetIntegerv  0       Active compute program    2.14.1  -
       BUFFER_NV                                               buffer object binding
     COMPUTE_PROGRAM_PARAMETER_   nxZ+    GetInteger-  0       Buffer objects bound for  2.14.1  -
       BUFFER_NV                          IndexedvEXT          compute program use

     Also shares buffer bindings and other state with the ARB_compute_shader
     extension.

 New Implementation Dependent State

     None, but shares implementation-dependent state with the
     ARB_compute_shader extension.

 Issues

     None.

 Revision History

     Rev.    Date    Author    Changes
     ----  --------  --------  --------------------------------------------
      2    10/23/12  pbrown    Remove the restriction forbidding the use of BAR
                               inside potentially divergent flow control.
                               Instead, we will allow BAR to be executed
                               anywhere, but specify undefined results
                               (including hangs or program termination) if the
                               flow control is divergent (bug 9367).

      1              pbrown    Internal spec development.