extensions/NV/NV_shader_thread_group.txt - external/github.com/KhronosGroup/OpenGL-Registry - Git at Google

 Name

     NV_shader_thread_group

 Name Strings

     GL_NV_shader_thread_group

 Contributors

     Jeannot Breton, NVIDIA
     Pat Brown, NVIDIA
     Eric Werness, NVIDIA
     Mark Kilgard, NVIDIA

 Contact

     Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com)

 Status

     Shipping.

 Version

     Last Modified Date:         7/21/2015
     NVIDIA Revision:            4

 Number

     OpenGL Extension #447

 Dependencies

     This extension is written against the OpenGL 4.3 (Compatibility Profile)
     Specification.

     This extension is written against version 4.30 (revision 07) of the OpenGL
     Shading Language Specification.

     OpenGL 4.3 and GLSL 4.3 are required.

     This extension interacts with NV_gpu_program5

     This extension interacts with NV_compute_program5

     This extension interacts with NV_tessellation_program5

 Overview

     Implementations of the OpenGL Shading Language may, but are not required
     to, run multiple shader threads for a single stage as a SIMD thread group,
     where individual execution threads are assigned to thread groups in an
     undefined, implementation-dependent order.  This extension provides a set
     of new features to the OpenGL Shading Language to query thread states and
     to share data between fragments within a 2x2 pixel quad.

     More specifically the following functionalities were added:

     *   New uniform variables and tokens to query the number of threads in a
         warp, the number of warps running on a SM and the number of SMs on the
         GPU.

     *   New shader inputs to query the thread id, the warp id and the SM id.

     *   New shader inputs to query if a fragment shader thread is a helper
         thread.

     *   New shader built-in functions to query the state of a Boolean condition
         over all threads in a thread group.

     *   New shader built-in functions to query which threads are active within
         a thread group.

     *   New fragment shader built-in functions to share data between fragments
         within a 2x2 pixel quad.

     Shaders using the new functionalities provided by this extension should
     enable this functionality via the construct

         #extension GL_NV_shader_thread_group : require     (or enable)

     This extension also specifies some modifications to the program assembly
     language to support the thread state query and thread data sharing
     functionalities.

     Note that in this extension specification warp and thread group have the
     same meaning.  A warp is a group of threads that get executed in lockstep.
     Each thread in a warp executes the same instruction of a program, but on
     different data.

 New Procedures and Functions

     None


 New Tokens

     Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
     GetFloatv, and GetDoublev:

         WARP_SIZE_NV                                    0x9339
         WARPS_PER_SM_NV                                 0x933A
         SM_COUNT_NV                                     0x933B


 Modifications to The OpenGL Shading Language Specification, Version 4.30
 (Revision 07)

     Including the following line in a shader can be used to control the
     language features described in this extension:

       #extension GL_NV_shader_thread_group : <behavior>

     where <behavior> is as specified in section 3.3.

     New preprocessor #defines are added to the OpenGL Shading Language:

       #define GL_NV_shader_thread_group         1

     Modify Section 7.1, Built-in Languages Variable, p. 110

     (Add to the list of built-in variables for the compute, vertex, geometry,
      tessellation control, tessellation evaluation and fragment languages)

         in uint  gl_ThreadInWarpNV;
         in uint  gl_ThreadEqMaskNV;
         in uint  gl_ThreadGeMaskNV;
         in uint  gl_ThreadGtMaskNV;
         in uint  gl_ThreadLeMaskNV;
         in uint  gl_ThreadLtMaskNV;
         in uint  gl_WarpIDNV;
         in uint  gl_SMIDNV;

     (Add to the list of built-in variables for the fragment languages)

         in bool  gl_HelperThreadNV;

     (Add those paragraphs at the end of this section)

     The variable gl_ThreadInWarpNV hold the id of the thread within the thread
     group(or warp).  This variable is in the range 0 to gl_WarpSizeNV-1, where
     gl_WarpSizeNV is the total number of thread in a warp.

     The variable gl_ThreadEqMaskNV is a bitfield in which the bit equal to the
     current thread id is set.  The variable gl_ThreadGeMaskNV is a bitfield in
     which bits greater or equal to the current thread id are set.  The variable
     gl_ThreadGtMaskNV is a bitfield in which bits greater than the current
     thread id are set.  The variable gl_ThreadLeMaskNV is a bitfield in which
     bits lower or equal to the current thread id are set.  The variable
     gl_ThreadLtMaskNV is a bitfield in which bits lower than the current thread
     id are set.

     The value of gl_ThreadEqMaskNV, gl_ThreadGeMaskNV, gl_ThreadGtMaskNV,
     gl_ThreadLeMaskNV and gl_ThreadLtMaskNV are derived from the value of
     gl_ThreadInWarpNV using simple bit-shift arithmetic, they don't take into
     account the value of the thread group active mask.  For example, if the
     application wants a bitfield in which bits lower or equal to the current
     thread id are set only for active threads, the result of gl_ThreadLeMaskNV
     will need to be ANDed with the thread group active mask.

     The variable gl_WarpIDNV hold the warp id of the executing thread.  This
     variable is in the range 0 to gl_WarpsPerSMNV-1, where gl_WarpsPerSMNV is
     the maximum number of warp executing on a SM.

     The variable gl_SMIDNV hold the SM id of the executing thread.  This
     variable is in the range 0 to gl_SMCountNV-1, where gl_SMCountNV is the
     number of SM on the GPU.

     The variable gl_HelperThreadNV specifies if the current thread is a helper
     thread.  In implementations supporting this extension, fragment shader
     invocations may be arranged in SIMD thread groups of 2x2 fragments called
     "quad".  When a fragment shader instruction is executed on a quad, it's
     possible that some fragments within the quad will execute the instruction
     even if they are not covered by the primitive.  Those threads are called
     helper threads.  Their outputs will be discarded and they will not execute
     global store functions, but the intermediate values they compute can still
     be used by thread group sharing functions or by fragment derivative
     functions like dFdx and dFdy.


     Modify Section 7.4, Built-In Uniform State, p. 125

     (Add to the list of built-in uniform variable declaration)

         uniform uint  gl_WarpSizeNV;
         uniform uint  gl_WarpsPerSMNV;
         uniform uint  gl_SMCountNV;

     (Add this paragraph at the end of this section)

     The variable gl_WarpSizeNV is the total number of thread in a warp.  The
     variable gl_WarpsPerSMNV is the maximum number of warp executing on a SM.
     The variable gl_SMCountNV is the number of SM on the GPU.


     Modify Section 8.3, Common Functions, p. 133

     (add a function to query which threads are active within a thread group)

     Syntax:

       uint  activeThreadsNV(void)

     In the value returned by activeThreadsNV(), bit <N> is set to 1 if the
     corresponding thread in the SIMD thread group is executing the call to
     activeThreadsNV() and 0 otherwise.  A bit in the return value may be set
     to zero due to conditional flow control (e.g., returning from a function,
     executing the "else" part of an "if" statement) or SIMD thread group was
     dispatched without a full collection of threads.

     (add a function to query the state of a Boolean condition over all the
     threads in a thread group)

     Syntax:

       uint  ballotThreadNV(bool value)

     The function ballotThreadNV() computes a 32-bit bitfield.  It looks at the
     condition <value> for each active thread of a thread group and set to 1
     each bit for which the condition in the corresponding thread is true.  Bits
     for threads with false condition are set to 0.  Bits for inactive threads
     are also set to 0.  It's possible to query the active thread mask by
     calling the function activeThreadsNV.

     (add a function to share data between fragment in a quad)

     Syntax:

         float  quadSwizzle0NV(float swizzledValue, [float unswizzledValue])
         vec2   quadSwizzle0NV(vec2  swizzledValue, [vec2  unswizzledValue])
         vec3   quadSwizzle0NV(vec3  swizzledValue, [vec3  unswizzledValue])
         vec4   quadSwizzle0NV(vec4  swizzledValue, [vec4  unswizzledValue])

         float  quadSwizzle1NV(float swizzledValue, [float unswizzledValue])
         vec2   quadSwizzle1NV(vec2  swizzledValue, [vec2  unswizzledValue])
         vec3   quadSwizzle1NV(vec3  swizzledValue, [vec3  unswizzledValue])
         vec4   quadSwizzle1NV(vec4  swizzledValue, [vec4  unswizzledValue])

         float  quadSwizzle2NV(float swizzledValue, [float unswizzledValue])
         vec2   quadSwizzle2NV(vec2  swizzledValue, [vec2  unswizzledValue])
         vec3   quadSwizzle2NV(vec3  swizzledValue, [vec3  unswizzledValue])
         vec4   quadSwizzle2NV(vec4  swizzledValue, [vec4  unswizzledValue])

         float  quadSwizzle3NV(float swizzledValue, [float unswizzledValue])
         vec2   quadSwizzle3NV(vec2  swizzledValue, [vec2  unswizzledValue])
         vec3   quadSwizzle3NV(vec3  swizzledValue, [vec3  unswizzledValue])
         vec4   quadSwizzle3NV(vec4  swizzledValue, [vec4  unswizzledValue])

         float  quadSwizzleXNV(float swizzledValue, [float unswizzledValue])
         vec2   quadSwizzleXNV(vec2  swizzledValue, [vec2  unswizzledValue])
         vec3   quadSwizzleXNV(vec3  swizzledValue, [vec3  unswizzledValue])
         vec4   quadSwizzleXNV(vec4  swizzledValue, [vec4  unswizzledValue])

         float  quadSwizzleYNV(float swizzledValue, [float unswizzledValue])
         vec2   quadSwizzleYNV(vec2  swizzledValue, [vec2  unswizzledValue])
         vec3   quadSwizzleYNV(vec3  swizzledValue, [vec3  unswizzledValue])
         vec4   quadSwizzleYNV(vec4  swizzledValue, [vec4  unswizzledValue])

     In implementations supporting this extension, if a primitive covers a
     fragment at (x,y), its fragment shader invocation will be arranged in a
     SIMD thread group with fragment shader invocations corresponding to three
     neighboring pixels.  These four invocations are arranged in a 2x2 grid,
     called a "quad".  If the neighbors of a fragment are not covered by the
     primitive, fragment shader invocations will still be generated.  The
     implementation may compute differences between values in these threads to
     estimate derivatives for dFdx(), dFdy(), and for texture lookups with
     automatic LOD calculations.

     Fragments may have different locations in the quads based on the type of
     render target.

     When rendering to a window, fragments within a quad follow this pattern:

         ---------------------------------------------------
         | gl_ThreadInWarpNV 4N+0 | gl_ThreadInWarpNV 4N+1 |
         |     pixel (X+0,Y+1)    |     pixel (X+1,Y+1)    |
         ---------------------------------------------------
         | gl_ThreadInWarpNV 4N+2 | gl_ThreadInWarpNV 4N+3 |
         |     pixel (X+0,Y+0)    |     pixel (X+1,Y+0)    |
         ---------------------------------------------------


     When rendering to a framebuffer object, fragments within a quad follow this
     pattern:

         ---------------------------------------------------
         | gl_ThreadInWarpNV 4N+2 | gl_ThreadInWarpNV 4N+3 |
         |     pixel (X+0,Y+1)    |     pixel (X+1,Y+1)    |
         ---------------------------------------------------
         | gl_ThreadInWarpNV 4N+0 | gl_ThreadInWarpNV 4N+1 |
         |     pixel (X+0,Y+0)    |     pixel (X+1,Y+0)    |
         ---------------------------------------------------

     There are 6 quadSwizzle functions that allow fragments within a quad to
     exchange data.  All those functions will read a floating point
     operand <swizzledValue>, which can come from any fragment in the quad.
     Another optional floating point operand <unswizzledValue>, which comes from
     the current fragment, can be added to <swizzledValue>.  The only difference
     between all those quadSwizzle functions is the location where they get the
     <swizzledValue> operand within the 2x2 pixel quad.

     quadSwizzle0NV will read the <swizzledValue> operand from the fragment 0:

         result[thread N] = swizzledValue[thread 0] + unswizzledValue[thread N]


     quadSwizzle1NV will read the <swizzledValue> operand from the fragment 1:

         result[thread N] = swizzledValue[thread 1] + unswizzledValue[thread N]


     quadSwizzle2NV will read the <swizzledValue> operand from the fragment 2:

         result[thread N] = swizzledValue[thread 2] + unswizzledValue[thread N]


     quadSwizzle3NV will read the <swizzledValue> operand from the fragment 3:

         result[thread N] = swizzledValue[thread 3] + unswizzledValue[thread N]


     quadSwizzleXNV will read the <swizzledValue> operand for each fragment
     from its neighbor in X:

         result[thread 0] = swizzledValue[thread 1] + unswizzledValue[thread 0]
         result[thread 1] = swizzledValue[thread 0] + unswizzledValue[thread 1]
         result[thread 2] = swizzledValue[thread 3] + unswizzledValue[thread 2]
         result[thread 3] = swizzledValue[thread 2] + unswizzledValue[thread 3]


     quadSwizzleYNV will read the <swizzledValue> operand for each fragment
     from its neighbor in Y:

         result[thread 0] = swizzledValue[thread 2] + unswizzledValue[thread 0]
         result[thread 1] = swizzledValue[thread 3] + unswizzledValue[thread 1]
         result[thread 2] = swizzledValue[thread 0] + unswizzledValue[thread 2]
         result[thread 3] = swizzledValue[thread 1] + unswizzledValue[thread 3]


     If any thread in a 2x2 pixel quad is inactive, the quad is divergent.  In
     this case quadSwizzle will return 0 for all fragments in the quad.


 Dependencies on NV_gpu_program5

     If NV_gpu_program5 is supported and "OPTION NV_shader_thread_group" is
     specified in an assembly program, the following edits are made to extend
     the assembly programming model documented in the NV_gpu_program4 extension
     and extended by NV_gpu_program5.

     If NV_gpu_program5 is not supported, or if "OPTION NV_shader_thread_group"
     is not specified in an assembly program, the contents of this dependencies
     section should be ignored.

     Modify Section 2.X.2, Program Grammar

     (add the following rules to the the NV_gpu_program4 and
      NV_gpu_program5 base grammars)

     <VECTORop>              ::= "TGBALLOT"

     <stateSingleItem>       ::= "state" "." <stateThreadItem>

     <stateThreadItem>       ::= "thread" "." <stateThreadProperty>

     <stateThreadProperty>   ::= "warpsize"
                               | "warpspersm"
                               | "smcount"

     (add/change the following rules to the NV_fragment_program4 and
      NV_gpu_program5 base grammars)

     <VECTORop>              ::= "QSWZ0"
                               | "QSWZ1"
                               | "QSWZ2"
                               | "QSWZ3"
                               | "QSWZX"
                               | "QSWZY"

     <attribBasic>           ::= <fragPrefix> "threadid"
                               | <fragPrefix> "threadeqmask"
                               | <fragPrefix> "threadltmask"
                               | <fragPrefix> "threadlemask"
                               | <fragPrefix> "threadgtmask"
                               | <fragPrefix> "threadgemask"
                               | <fragPrefix> "warpid"
                               | <fragPrefix> "smid"
                               | <fragPrefix> "helperthread"

     (add/change the following rules to the NV_vertex_program4 and
      NV_gpu_program5 base grammars)

     <attribBasic>           ::= <vtxPrefix> "threadid"
                               | <vtxPrefix> "threadeqmask"
                               | <vtxPrefix> "threadltmask"
                               | <vtxPrefix> "threadlemask"
                               | <vtxPrefix> "threadgtmask"
                               | <vtxPrefix> "threadgemask"
                               | <vtxPrefix> "warpid"
                               | <vtxPrefix> "smid"

     (add/change the following rules to the NV_geometry_program4 and
      NV_gpu_program5 base grammars)

     <attribBasic>           ::= <primPrefix> "threadid"
                               | <primPrefix> "threadeqmask"
                               | <primPrefix> "threadltmask"
                               | <primPrefix> "threadlemask"
                               | <primPrefix> "threadgtmask"
                               | <primPrefix> "threadgemask"
                               | <primPrefix> "warpid"
                               | <primPrefix> "smid"

     Modify Section 2.X.3.2 of the NV_gpu_program4 specification, Program
     Attribute Variables.

     (Add the table entries and relevant text describing the fragment program
      input variable use to query thread states.)

       Fragment Attribute Binding  Components  Underlying State
       --------------------------  ----------  ----------------------------
       ...
       fragment.threadid           (id,-,-,-)  id of the current thread
       fragment.threadeqmask       (m,-,-,-)   mask with the current thread
       fragment.threadltmask       (m,-,-,-)   mask with lower thread
       fragment.threadlemask       (m,-,-,-)   mask with lower or equal thread
       fragment.threadgtmask       (m,-,-,-)   mask with greater thread
       fragment.threadgemask       (m,-,-,-)   mask with greater or equal thread
       fragment.warpid             (id,-,-,-)  warp id of the current thread
       fragment.smid               (id,-,-,-)  SM id of the current thread
       fragment.helperthread       (k,-,-,-)   current thread is a helper thread
       ...

     If a fragment attribute binding matches "fragment.threadid", the "x"
     component is filled with the thread id of the current thread.  The thread
     id is an unsigned integer in the range 0 to 31.

     If a fragment attribute binding matches "fragment.threadeqmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which the
     bit equal to the current thread id is set.

     If a fragment attribute binding matches "fragment.threadltmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower than the current thread id are set.

     If a fragment attribute binding matches "fragment.threadlemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower or equal to the current thread id are set.

     If a fragment attribute binding matches "fragment.threadgtmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater than the current thread id are set.

     If a fragment attribute binding matches "fragment.threadgemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater or equal to the current thread id are set.

     If a fragment attribute binding matches "fragment.warpid", the "x"
     component is filled with the warp id of the current thread.  The warp id is
     an unsigned integer, the range of this value is hw dependent.

     If a fragment attribute binding matches "fragment.smid", the "x" component
     is filled with the SM id of the current thread.  The SM id is an unsigned
     integer, the range of this value is hw dependent.

     If a fragment attribute binding matches "fragment.helperthread", the "x"
     component is an integer value equal to -1 when the current thread is a
     helper thread and 0 otherwise.  In implementations supporting this
     extension, fragment program invocations may be arranged in SIMD thread
     groups of 2x2 fragments called "quad".  When a fragment program instruction
     is executed on a quad, it's possible that some fragments within the quad
     will execute the instruction even if they are not covered by the primitive.
     Those threads are called helper threads.  Their outputs will be discarded
     and they will not execute global store instructions, but the intermediate
     values they compute can still be used by thread group sharing instructions
     or by fragment derivative instructions like DDX and DDY.

     (Add the table entries and relevant text describing the vertex program
      attribute variable use to query thread states.)

       Vertex Attribute Binding  Components  Underlying State
       ------------------------  ----------  ----------------------------
       ...
       vertex.threadid           (id,-,-,-)  id of the current thread
       vertex.threadeqmask       (m,-,-,-)   mask with the current thread
       vertex.threadltmask       (m,-,-,-)   mask with lower thread
       vertex.threadlemask       (m,-,-,-)   mask with lower or equal thread
       vertex.threadgtmask       (m,-,-,-)   mask with greater thread
       vertex.threadgemask       (m,-,-,-)   mask with greater or equal thread
       vertex.warpid             (id,-,-,-)  warp id of the current thread
       vertex.smid               (id,-,-,-)  SM id of the current thread
       ...

     If a vertex attribute binding matches "vertex.threadid", the "x" component
     is filled with the thread id of the current thread.  The thread id is an
     unsigned integer in the range 0 to 31.

     If a vertex attribute binding matches "vertex.threadeqmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which the
     bit equal to the current thread id is set.

     If a vertex attribute binding matches "vertex.threadltmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower than the current thread id are set.

     If a vertex attribute binding matches "vertex.threadlemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower or equal to the current thread id are set.

     If a vertex attribute binding matches "vertex.threadgtmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater than the current thread id are set.

     If a vertex attribute binding matches "vertex.threadgemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater or equal to the current thread id are set.

     If a vertex attribute binding matches "vertex.warpid", the "x" component is
     filled with the warp id of the current thread.  The warp id is an unsigned
     integer, the range of this value is hw dependent.

     If a vertex attribute binding matches "vertex.smid", the "x" component
     is filled with the SM id of the current thread.  The SM id is an unsigned
     integer, the range of this value is hw dependent.


     (Add the table entries and relevant text describing the geometry program
      attribute variable use to query thread states.)

       Geometry Attribute Binding  Components  Underlying State
       --------------------------  ----------  ----------------------------
       ...
       primitive.threadid          (id,-,-,-)  id of the current thread
       primitive.threadeqmask      (m,-,-,-)   mask with the current thread
       primitive.threadltmask      (m,-,-,-)   mask with lower thread
       primitive.threadlemask      (m,-,-,-)   mask with lower or equal thread
       primitive.threadgtmask      (m,-,-,-)   mask with greater thread
       primitive.threadgemask      (m,-,-,-)   mask with greater or equal thread
       primitive.warpid            (id,-,-,-)  warp id of the current thread
       primitive.smid              (id,-,-,-)  SM id of the current thread
       ...

     If a geometry attribute binding matches "primitive.threadid", the "x"
     component is filled with the thread id of the current thread.  The thread
     id is an unsigned integer in the range 0 to 31.

     If a geometry attribute binding matches "primitive.threadeqmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which the
     bit equal to the current thread id is set.

     If a geometry attribute binding matches "primitive.threadltmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower than the current thread id are set.

     If a geometry attribute binding matches "primitive.threadlemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower or equal to the current thread id are set.

     If a geometry attribute binding matches "primitive.threadgtmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater than the current thread id are set.

     If a geometry attribute binding matches "primitive.threadgemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater or equal to the current thread id are set.

     If a geometry attribute binding matches "primitive.warpid", the "x"
     component is filled with the warp id of the current thread.  The warp id is
     an unsigned integer, the range of this value is hw dependent.

     If a geometry attribute binding matches "primitive.smid", the "x" component
     is filled with the SM id of the current thread.  The SM id is an unsigned
     integer, the range of this value is hw dependent.


     (add the following subsection to section 2.X.3.3, Parameters)

     Thread Group Property Bindings

       Binding                        Components  Underlying State
       -----------------------------  ----------  ----------------------------
       state.thread.warpsize          (x,-,-,-)   total number of thread in a
                                                  warp
       state.thread.warpspersm        (x,-,-,-)   maximum number of warp
                                                  executing on a SM
       state.thread.smcount           (x,-,-,-)   number of SM on the GPU

     If a program parameter binding matches "state.thread.warpsize", the "x"
     component of the program parameter variable is filled with an integer value
     indicating the total number of thread in a warp.  The "y", "z", and "w"
     components are undefined.

     If a program parameter binding matches "state.thread.warpspersm", the "x"
     component of the program parameter variable is filled with an integer value
     indicating the maximum number of warp executing on a SM.  The "y", "z", and
     "w" components are undefined.

     If a program parameter binding matches "state.thread.smcount", the "x"
     component of the program parameter variable is filled with an integer value
     indicating the number of SM on the GPU.  The "y", "z", and "w" components
     are undefined.


     Modify Section 2.X.4, Program Execution Environment

     (Add the table entries and relevant text describing the program
      instruction to query thread conditions.)

       Instr-      Modifiers
       uction   V  F I C S H D  Out Inputs    Description
       -------  -- - - - - - -  --- --------  --------------------------------
       ...
       TGBALLOT 50 X X X X - - F  vu  v        query a boolean in thread group
       ...


     (Add the table entries and relevant text describing the fragment program
      instructions to exchange data between threads.)

       Instr-      Modifiers
       uction   V  F I C S H D  Out Inputs    Description
       -------  -- - - - - - -  --- --------  --------------------------------
       ...
       QSWZ0    50 X - - - - - F  v   v,v      add fragment 0 in a quad
       QSWZ1    50 X - - - - - F  v   v,v      add fragment 1 in a quad
       QSWZ2    50 X - - - - - F  v   v,v      add fragment 2 in a quad
       QSWZ3    50 X - - - - - F  v   v,v      add fragment 3 in a quad
       QSWZX    50 X - - - - - F  v   v,v      add fragments horizontally
       QSWZY    50 X - - - - - F  v   v,v      add fragments vertically
       ...


     (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
      as extended by NV_gpu_program5)

     + Shader thread group (NV_shader_thread_group)

     If a fragment program specifies the "NV_shader_thread_group" option, it
     may use the "fragment.threadid", "fragment.threadeqmask",
     "fragment.threadltmask", "fragment.threadlemask", "fragment.threadgtmask",
     "fragment.threadgemask", "fragment.warpid", "fragment.smid",
     "fragment.helperthread", "state.thread.warpsize", "state.thread.warpspersm"
     and "state.thread.smcount" bindings.  It may also use the "TGBALLOT",
     "QSWZ0", "QSWZ1", "QSWZ2", "QSWZ3", "QSWZX" and "QSWZY" instructions.  If
     this option is not specified, a program will fail to compile if it uses
     those instructions or bindings.

     If a vertex program specifies the "NV_shader_thread_group" option, it may
     use the "vertex.threadid", "vertex.threadeqmask", "vertex.threadltmask",
     "vertex.threadlemask", "vertex.threadgtmask", "vertex.threadgemask",
     "vertex.warpid", "vertex.smid", "state.thread.warpsize",
     "state.thread.warpspersm" and "state.thread.smcount" bindings.  It may also
     use the "TGBALLOT" instruction.  If this option is not specified, a program
     will fail to compile if it uses those instructions or bindings.

     If a geometry program specifies the "NV_shader_thread_group" option, it
     may use the "primitive.threadid", "primitive.threadeqmask",
     "primitive.threadltmask", "primitive.threadlemask",
     "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid",
     "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and
     "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
     instruction.  If this option is not specified, a program will fail to
     compile if it uses those instructions or bindings.

     Section 2.X.8.Z, QSWZ0:  add fragment 0 data to all fragment in a quad

     The QSWZ0 instruction produces a floating point result by adding the
     first operand, a floating point value from fragment 0, to the second
     operand, another floating point value from the current fragment.

     quadSwizzle0NV is the GLSL function that implements the same functionality
     as the QSWZ0 assembly instruction.  The section 8.3 of the OpenGL Shading
     Language Specification has more detail about the implementation of
     quadSwizzle0NV.  This additional information also applies to QSWZ0.


     Section 2.X.8.Z, QSWZ1:  add fragment 1 data to all fragment in a quad

     The QSWZ1 instruction produces a floating point result by adding the
     first operand, a floating point value from fragment 1, to the second
     operand, another floating point value from the current fragment.

     quadSwizzle1NV is the GLSL function that implements the same functionality
     as the QSWZ1 assembly instruction.  The section 8.3 of the OpenGL Shading
     Language Specification has more detail about the implementation of
     quadSwizzle1NV.  This additional information also applies to QSWZ1.


     Section 2.X.8.Z, QSWZ2:  add fragment 2 data to all fragment in a quad

     The QSWZ2 instruction produces a floating point result by adding the
     first operand, a floating point value from fragment 2, to the second
     operand, another floating point value from the current fragment.

     quadSwizzle2NV is the GLSL function that implements the same functionality
     as the QSWZ2 assembly instruction.  The section 8.3 of the OpenGL Shading
     Language Specification has more detail about the implementation of
     quadSwizzle2NV.  This additional information also applies to QSWZ2.


     Section 2.X.8.Z, QSWZ3:  add fragment 3 data to all fragment in a quad

     The QSWZ3 instruction produces a floating point result by adding the
     first operand, a floating point value from fragment 3, to the second
     operand, another floating point value from the current fragment.

     quadSwizzle3NV is the GLSL function that implements the same functionality
     as the QSWZ3 assembly instruction.  The section 8.3 of the OpenGL Shading
     Language Specification has more detail about the implementation of
     quadSwizzle3NV.  This additional information also applies to QSWZ3.


     Section 2.X.8.Z, QSWZX:  add fragments in a quad horizontally

     The QSWZX instruction produces a floating point result by adding the
     first operand, a floating point value from the fragment neighbor in X to
     the current fragment, to the second operand, another floating point value
     from the current fragment.

     quadSwizzleXNV is the GLSL function that implements the same functionality
     as the QSWZX assembly instruction.  The section 8.3 of the OpenGL Shading
     Language Specification has more detail about the implementation of
     quadSwizzleXNV.  This additional information also applies to QSWZX.


     Section 2.X.8.Z, QSWZY:  add fragments in a quad vertically

     The QSWZY instruction produces a floating point result by adding the
     first operand, a floating point value from the fragment neighbor in Y to
     the current fragment, to the second operand, another floating point value
     from the current fragment.

     quadSwizzleYNV is the GLSL function that implements the same functionality
     as the QSWZY assembly instruction.  The section 8.3 of the OpenGL Shading
     Language Specification has more detail about the implementation of
     quadSwizzleYNV.  This additional information also applies to QSWZY.


     Section 2.X.8.Z, TGBALLOT:  query a boolean condition over a thread group

     The TGBALLOT instruction produces a result vector by reading a vector
     operand for each active thread in the current thread group and comparing
     each component to zero.  A result vector component contains an integer
     bitmask  value (described below) for which the bits in a component bitmask
     are set if the value in the operand vector is non-zero for the
     corresponding thread, and not set otherwise.

     Sometime when the instruction is in a conditional control flow block or
     when it's not possible to completely fill a thread group, only a subset of
     the threads in the thread group will be active and will execute the
     TGBALLOT instruction.  Each bit in the bitfield corresponding to inactive
     threads will be set to 0.  It's possible to query the active thread mask
     by calling TGBALLOT with 1 as the first operand.

       tmp = VectorLoad(op0);
       result = { 0, 0, 0, 0 };
       for (all active threads) {
         if ([thread]tmp.x != 0) result.x |= 1 << thread;
         if ([thread]tmp.y != 0) result.y |= 1 << thread;
         if ([thread]tmp.z != 0) result.z |= 1 << thread;
         if ([thread]tmp.w != 0) result.w |= 1 << thread;
       }

 Dependencies on NV_tessellation_program5

     If NV_tessellation_program5 is supported and
     "OPTION NV_shader_thread_group" is specified in an assembly program, the
     following edits are made to extend the assembly programming model
     documented in the NV_gpu_program4 extension and extended by NV_gpu_program5
     and NV_tessellation_program5.

     If NV_tessellation_program5 is not supported, or if
     "OPTION NV_shader_thread_group" is not specified in an assembly program,
     the contents of this dependencies section should be ignored.


     Modify Section 2.X.2, Program Grammar

     (add/change the following rules to the NV_gpu_program5 base grammars for
      tessellation control programs)

     <attribBasic>           ::= <primPrefix> "threadid"
                               | <primPrefix> "threadeqmask"
                               | <primPrefix> "threadltmask"
                               | <primPrefix> "threadlemask"
                               | <primPrefix> "threadgtmask"
                               | <primPrefix> "threadgemask"
                               | <primPrefix> "warpid"
                               | <primPrefix> "smid"

     (add/change the following rules to the NV_gpu_program5 base grammars for
      tessellation evaluation programs)

     <attribBasic>           ::= <primPrefix> "threadid"
                               | <primPrefix> "threadeqmask"
                               | <primPrefix> "threadltmask"
                               | <primPrefix> "threadlemask"
                               | <primPrefix> "threadgtmask"
                               | <primPrefix> "threadgemask"
                               | <primPrefix> "warpid"
                               | <primPrefix> "smid"


     Modify Section 2.X.3.2 of the NV_tessellation_program5 specification,
     Program Attribute Variables.

     (Add the table entries and relevant text describing the Tessellation
      control and evaluation program attribute variables use to query thread
      states.)


       Primitive Binding Suffix    Components  Underlying State
       --------------------------  ----------  ----------------------------
       ...
       primitive.threadid         (id,-,-,-)  id of the current thread
       primitive.threadeqmask     (m,-,-,-)   mask with the current thread
       primitive.threadltmask     (m,-,-,-)   mask with lower thread
       primitive.threadlemask     (m,-,-,-)   mask with lower or equal thread
       primitive.threadgtmask     (m,-,-,-)   mask with greater thread
       primitive.threadgemask     (m,-,-,-)   mask with greater or equal thread
       primitive.warpid           (id,-,-,-)  warp id of the current thread
       primitive.smid             (id,-,-,-)  SM id of the current thread
       ...

     If a attribute binding matches "primitive.threadid", the "x" component is
     filled with the thread id of the current thread.  The thread id is an
     unsigned integer in the range 0 to 31.

     If a attribute binding matches "primitive.threadeqmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which the
     bit equal to the current thread id is set.

     If a attribute binding matches "primitive.threadltmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower than the current thread id are set.

     If a attribute binding matches "primitive.threadlemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower or equal to the current thread id are set.

     If a attribute binding matches "primitive.threadgtmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater than the current thread id are set.

     If a attribute binding matches "primitive.threadgemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater or equal to the current thread id are set.

     If a attribute binding matches "primitive.warpid", the "x" component is
     filled with the warp id of the current thread.  The warp id is an unsigned
     integer, the range of this value is hw dependent.

     If a attribute binding matches "primitive.smid", the "x" component is
     filled with the SM id of the current thread.  The SM id is an unsigned
     integer, the range of this value is hw dependent.

     (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
      as extended by NV_gpu_program5 and NV_tessellation_program5)

     + Shader thread group (NV_shader_thread_group)

     If a program specifies the "NV_shader_thread_group" option, it may use
     the "primitive.threadid", "primitive.threadeqmask",
     "primitive.threadltmask", "primitive.threadlemask",
     "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid",
     "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and
     "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
     instruction.  If this option is not specified, a program will fail to
     compile if it uses those bindings.


 Dependencies on NV_compute_program5

     If NV_compute_program5 is supported and "OPTION NV_shader_thread_group" is
     specified in an assembly program, the following edits are made to extend
     the assembly programming model documented in the NV_gpu_program4 extension
     and extended by NV_gpu_program5 and NV_compute_program5.

     If NV_compute_program5 is not supported, or if
     "OPTION NV_shader_thread_group" is not specified in an assembly program,
     the contents of this dependencies section should be ignored.

     Section 2.X.2, Program Grammar

     (add the following rules to the grammar)

     <attribBasic>           ::= "invocation" "." "threadid"
                               | "invocation" "." "threadeqmask"
                               | "invocation" "." "threadltmask"
                               | "invocation" "." "threadlemask"
                               | "invocation" "." "threadgtmask"
                               | "invocation" "." "threadgemask"
                               | "invocation" "." "warpid"
                               | "invocation" "." "smid"

     Modify Section 2.X.3.2 of the NV_compute_program5 specification, Program
     Attribute Variables.

     (Add the table entries and relevant text describing the compute program
      input variable use to query thread states.)

       Attribute Binding           Components  Underlying State
       --------------------------  ----------  ----------------------------
       ...
       invocation.threadid         (id,-,-,-)  id of the current thread
       invocation.threadeqmask     (m,-,-,-)   mask with the current thread
       invocation.threadltmask     (m,-,-,-)   mask with lower thread
       invocation.threadlemask     (m,-,-,-)   mask with lower or equal thread
       invocation.threadgtmask     (m,-,-,-)   mask with greater thread
       invocation.threadgemask     (m,-,-,-)   mask with greater or equal thread
       invocation.warpid           (id,-,-,-)  warp id of the current thread
       invocation.smid             (id,-,-,-)  SM id of the current thread
       ...

     If a compute attribute binding matches "invocation.threadid", the "x"
     component is filled with the thread id of the current thread.  The thread
     id is an unsigned integer in the range 0 to 31.

     If a compute attribute binding matches "invocation.threadeqmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which the
     bit equal to the current thread id is set.

     If a compute attribute binding matches "invocation.threadltmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower than the current thread id are set.

     If a compute attribute binding matches "invocation.threadlemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     lower or equal to the current thread id are set.

     If a compute attribute binding matches "invocation.threadgtmask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater than the current thread id are set.

     If a compute attribute binding matches "invocation.threadgemask", the "x"
     component is filled with a 32-bit unsigned integer bitfield in which bits
     greater or equal to the current thread id are set.

     If a compute attribute binding matches "invocation.warpid", the "x"
     component is filled with the warp id of the current thread.  The warp id is
     an unsigned integer, the range of this value is hw dependent.

     If a compute attribute binding matches "invocation.smid", the "x" component
     is filled with the SM id of the current thread.  The SM id is an unsigned
     integer, the range of this value is hw dependent.

     (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
      as extended by NV_gpu_program5 and NV_compute_program5)


     + Shader thread group (NV_shader_thread_group)

     If a program specifies the "NV_shader_thread_group" option, it may use the
     "invocation.threadid", "invocation.threadeqmask",
     "invocation.threadltmask", "invocation.threadlemask",
     "invocation.threadgtmask", "invocation.threadgemask", "invocation.warpid",
     "invocation.smid", "state.thread.warpsize", "state.thread.warpspersm" and
     "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
     instruction.  If this option is not specified, a program will fail to
     compile if it uses those bindings.


 Errors

     None.

 New State

     None.

 New Implementation Dependent State

                                                              Minimum
     Get Value                         Type  Get Command       Value   Description           Sec.   Attrib
     --------------------------------  ----  ---------------  -------  --------------------- ------ ------
     WARP_SIZE_NV                       Z+   GetIntegerv        1       total number of      2.X.3.3  -
                                                                        thread in a warp.

     WARPS_PER_SM_NV                    Z+   GetIntegerv        1       maximum number of    2.X.3.3  -
                                                                        warp executing on a
                                                                        SM.

     SM_COUNT_NV                        Z+   GetIntegerv        1       number of SM on the  2.X.3.3  -
                                                                        GPU.


 Issues

     None


 Revision History

     Rev.    Date    Author    Changes
     ----  --------  --------  -----------------------------------------
      4     7/21/15  jbreton    Update the layout of threads within a quad for
                                window and framebuffer object rendering.
      3     2/14/14  jbreton    Rename the extension from NVX to NV.
      2      9/4/13  jbreton    Add helperThread attribute binding.
      1    12/19/12  jbreton    Internal revisions.