extensions/NV/NV_parameter_buffer_object2.txt - external/github.com/KhronosGroup/OpenGL-Registry - Git at Google

 Name

     NV_parameter_buffer_object2

 Name Strings

     GL_NV_parameter_buffer_object2

 Contact

     Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

 Status

     Shipping (July 2009, Release 190)

 Version

     Last Modified Date:         09/09/09
     NVIDIA Revision:            2

 Number

     378

 Dependencies

     OpenGL 2.0 is required.

     NV_gpu_program4 is required.

     NV_parameter_buffer_object is required.

     This extension is written against the NV_gpu_program4 specification.

     NV_shader_buffer_load trivially affects the definition of this extension.

 Overview

     This extension builds on the NV_parameter_buffer_object extension to
     provide additional flexibility in sourcing data from buffer objects.

     The original NV_parameter_buffer_object (PaBO) extension provided the
     ability to bind buffer objects to a set of numbered binding points and
     access them in assembly programs as though they were arrays of 32-bit
     scalars (via the BUFFER variable type) or arrays of four-component vectors
     with 32-bit scalar components (via the BUFFER4 variable type).  However,
     the functionality it provided had some significant limits on flexibility.
     Since any given buffer binding point could be used either as a BUFFER or
     BUFFER4, but not both, programs couldn't do both 32- and 128-bit fetches
     from a single binding point.  Additionally, No support was provided for
     8-, 16-, or 64-bit fetches, though they could be emulated using a larger
     loads, with bitfield operations and/or write masking to put components in
     the right places.  Indexing was supported, but strides were limited to 4-
     and 16-byte multiples, depending on whether BUFFER or BUFFER4 is used.

     This new extension provides the buffer variable declaration type CBUFFER
     to specify a buffer that is treated as an array of bytes, rather than an
     array of words or vectors.  The LDC instruction allows programs to extract
     a vector of data from a CBUFFER variable, using a size and component count
     specified in the opcode modifier.  1-, 2-, and 4-component fetches are
     supported.  The LDC instruction supports byte offsets using normal array
     indexing mechanisms; both run-time and immediate offsets are supported.
     Offsets used for a buffer object fetch are required to be aligned to the
     size of the fetch (1, 2, 4, 8, or 16 bytes).

 New Procedures and Functions

     None.

 New Tokens

     None.

 Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)

     (All modifications are relative to Section 2.X, GPU Programs, from the
      NV_gpu_program4 specification.)

     Modify Section 2.X.2, Program Grammar

     (add after the long list of grammar rules) If a program specifies the
     NV_parameter_buffer_object2 program option, the following rules are added
     to the NV_gpu_program4 base program grammar:

     <VECTORop>              ::= "LDC"

     <opModifier>            ::= "F32";
                               | "F32X2";
                               | "F32X4";
                               | "S8";
                               | "S16";
                               | "S32";
                               | "S32X2";
                               | "S32X4";
                               | "U8";
                               | "U16";
                               | "U32";
                               | "U32X2";
                               | "U32X4";

     <bufferDeclType>        ::= "CBUFFER"


     Modify Section 2.X.3.6, Program Parameter Buffers

     (modify the paragraph describing the different type of parameter buffer
     variable declarations to include support for "CBUFFER".)

     Program parameter buffer variables are treated as an array of
     single-component words if the <bufferDeclType> grammar rule matches
     "BUFFER" or as an array of four-component vectors if it matches "BUFFER4".
     Program parameter buffers may also be declared as an array of basic
     machine units from which data can be extracted using the LDC (load
     constant) instruction, if <bufferDeclType> matches "CBUFFER".  Parameter
     buffer variables declared using "CBUFFER" may not be used as an operand in
     any instruction other than LDC, while "BUFFER" and "BUFFER4" variables may
     not be used with LDC.  A program will fail to load if a variable declared
     as "BUFFER" and another variable declared as "BUFFER4" use the same buffer
     binding point.  There is no limitation on the use of "CBUFFER" variables
     in conjunction with "BUFFER" or "BUFFER4" variables using the same buffer
     binding point.

     (modify/restructure the paragraph describing basic program parameter
      bindings to handle the byte bindings provided by "CBUFFER" variables)

     If a program parameter buffer binding matches "program.buffer[a][b]", the
     program parameter variable corresponds to element <b> of the buffer object
     bound to binding point <a>.  Each element of the bound buffer object is
     treated as:

       * a single basic machine unit of data, if the variable is declared using
         "CBUFFER";

       * a single word of data that can hold an integer or floating-point
         value, if the variable is declared as "BUFFER"; or

       * four words of data that can hold integer or floating-point values, if
         the variable is declared as "BUFFER4".

     When a binding corresponding to a "BUFFER" variable is used as an operand,
     the selected word is broadcast to all four components of the variable.
     When a binding corresponding to a "BUFFER4" variable is used as an
     operand, the four components of the selected buffer element are loaded
     into the variable.  A binding corresponding to a "CBUFFER" variable may be
     used only in the LDC instruction, and will be used there as a pointer to
     extract operand values from buffer memory.  If no buffer object is bound
     to binding point <a>, or the bound buffer object is not large enough to
     hold element <b>, the values used are undefined.  The binding point <a>
     must be a nonnegative integer constant.


     Modify Section 2.X.4, Program Execution Environment

     (Add to the set of opcodes in Table X.13)

                   Modifiers
       Instruction F I C S H D  Out Inputs    Description
       ----------- - - - - - -  --- --------  --------------------------------
       LDC         X X X X - F  v   v         load from constant buffer


     Modify Section 2.X.4.1, Program Instruction Modifiers

     (Add to Table X.14, Instruction Modifiers, and to the corresponding
     description following the table)

       Modifier  Description
       --------  -----------------------------------------------
       F32       Access one 32-bit floating-point value
       F32X2     Access two 32-bit floating-point values
       F32X4     Access four 32-bit floating-point values
       S8        Access one 8-bit signed integer value
       S16       Access one 16-bit signed integer value
       S32       Access one 32-bit signed integer value
       S32X2     Access two 32-bit signed integer values
       S32X4     Access four 32-bit signed integer values
       U8        Access one 8-bit unsigned integer value
       U16       Access one 16-bit unsigned integer value
       U32       Access one 32-bit unsigned integer value
       U32X2     Access two 32-bit unsigned integer values
       U32X4     Access four 32-bit unsigned integer values

     For memory load operations, the "F32", "F32X2", "F32X4", "S8", "S16",
     "S32", "S32X2", "S32X4", "U8", "U16", "U32", "U32X2", and "U32X4" storage
     modifiers control how data are loaded from memory.  Storage modifiers are
     supported by the LDC and LOAD instructions and are covered in more detail
     in the descriptions of these instructions.  These instructions must
     specify exactly one of these modifiers, and may not specify any of the
     base data type modifiers (F,U,S) described above.  The base data type of
     the result vector of a LOAD or LDC instruction is trivially derived from
     the storage modifier.


     Add New Section 2.X.4.5, Program Memory Access

     Programs may load from buffer object memory via the LDC (load constant)
     and LOAD (global load) instructions.

     Load instructions read 8, 16, 32, 64, or 128 bits of data from a source
     address to produce a four-component vector, according to the storage
     modifier specified with the instruction.  The storage modifier has three
     parts:

       - a base data type, "F", "S", or "U", specifying that the instruction
         fetches floating-point, signed integer, or unsigned integer values,
         respectively;

       - a component size, specifying that the components fetched by the
         instruction have 8, 16, or 32 bits; and

       - an optional component count, where "X2" and "X4" indicate that two or
         four components be fetched, and no count indicates a single component
         fetch.

     When the storage modifier specifies that fewer than four components should
     be fetched, remaining components are filled with zeroes.  When performing
     a global load (LOAD), the GPU address is specified as an instruction
     operand.  When performing a constant buffer load (LDC), the GPU address is
     derived by adding the base address of the bound buffer object to an offset
     specified as an instruction operand.  Given a GPU address <address> and a
     storage modifier <modifier>, the memory load can be described by the
     following code:

       result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
       {
         result_t_vec result = { 0, 0, 0, 0 };
         switch (modifier) {
         case F32:
             result.x = ((float32_t *)address)[0];
             break;
         case F32X2:
             result.x = ((float32_t *)address)[0];
             result.y = ((float32_t *)address)[1];
             break;
         case F32X4:
             result.x = ((float32_t *)address)[0];
             result.y = ((float32_t *)address)[1];
             result.z = ((float32_t *)address)[2];
             result.w = ((float32_t *)address)[3];
             break;
         case S8:
             result.x = ((int8_t *)address)[0];
             break;
         case S16:
             result.x = ((int16_t *)address)[0];
             break;
         case S32:
             result.x = ((int32_t *)address)[0];
             break;
         case S32X2:
             result.x = ((int32_t *)address)[0];
             result.y = ((int32_t *)address)[1];
             break;
         case S32X4:
             result.x = ((int32_t *)address)[0];
             result.y = ((int32_t *)address)[1];
             result.z = ((int32_t *)address)[2];
             result.w = ((int32_t *)address)[3];
             break;
         case U8:
             result.x = ((uint8_t *)address)[0];
             break;
         case U16:
             result.x = ((uint16_t *)address)[0];
             break;
         case U32:
             result.x = ((uint32_t *)address)[0];
             break;
         case U32X2:
             result.x = ((uint32_t *)address)[0];
             result.y = ((uint32_t *)address)[1];
             break;
         case U32X4:
             result.x = ((uint32_t *)address)[0];
             result.y = ((uint32_t *)address)[1];
             result.z = ((uint32_t *)address)[2];
             result.w = ((uint32_t *)address)[3];
             break;
         }
         return result;
       }

     The offset used for the constant buffer loads must be aligned to the fetch
     size corresponding to the storage opcode modifier.  For S8 and U8, the
     offset has no alignment requirements.  For S16 and U16, the offset must be
     a multiple of two basic machine units.  For F32, S32, and U32, the offset
     must be a multiple of four.  For F32X2, S32X2, and U32X2, the offset must
     be a multiple of eight.  For F32X4, S32X4, and U32X4, the offset must be a
     multiple of sixteen.  If an offset is not correctly aligned, the values
     returned by a constant buffer load will be undefined.


     Modify Section 2.X.6, Program Options

     + Extended Parameter Buffer Object Support (NV_parameter_buffer_object2)

     If a program specifies the "NV_parameter_buffer_object2" option, it may
     use the CBUFFER statement to declare program parameter buffer variables
     and the LDC instruction to load data from parameter buffer variables using
     arbitrary offsets.


     Modify Section 2.X.8, Program Instruction Set

     Section 2.X.8.Z, LDC:  Load from Constant Buffer

     The LDC instruction loads a vector operand from a buffer object to yield a
     result vector.  The operand used for the LDC instruction must correspond
     to a parameter buffer variable declared using the "CBUFFER" statement; a
     program will fail to load if any other type of operand is used in an LDC
     instruction.

       result = BufferMemoryLoad(&op0, storageModifier);

     A base operand vector is fetched from memory as described in Section
     2.X.4.5, with the GPU address derived from the binding corresponding to
     the operand.  A final operand vector is derived from the base operand
     vector by applying swizzle, negation, and absolute value operand modifiers
     as described in Section 2.X.4.2.

     The amount of memory in any given buffer object binding accessible by the
     LDC instruction may be limited.  If any component fetched by the LDC
     instruction extends 4*<n> or more basic machine units from the beginning
     of the buffer object binding, where <n> is the implementation-dependent
     constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that
     component will be undefined.

     LDC supports no base data type modifiers, but requires exactly one storage
     modifier.  The base data types of the operand and result vectors are
     derived from the storage modifier.


 Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization)

     None.

 Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment
 Operations and the Frame Buffer)

     None.

 Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)

     None.

 Additions to Chapter 6 of the OpenGL 3.0 Specification (State and
 State Requests)

     None.

 Additions to Appendix A of the OpenGL 3.0 Specification (Invariance)

     None.

 Additions to the AGL/GLX/WGL Specifications

     None.

 Errors

     No new errors.

 Dependencies on NV_shader_buffer_load

     If NV_shader_buffer_load (or equivalent functionality) is not supported,
     references to the "LOAD" opcode in the description of the opcode modifiers
     for "LDC" should be removed.

 New State

     None.

 New Implementation Dependent State

     None.

 Issues

     (1) What sort of alignment requirements, if any, should be imposed on the
         operand provided to the LDC instruction?

       RESOLVED:  The offset of the operand must be aligned according to the
       size of the fetch.  For 1-, 2-, and 4-component fetches, the offset must
       be a multiple of <N>, 2*<N>, and 4*<N>, where <N> is the size in bytes
       of the components being fetched.

     (2) NV_parameter_buffer_object provides an implementation-dependent limit
         on the portion of a buffer object that may be fetched via BUFFER and
         BUFFER4 variables?  Should the same limits apply to the LDC
         instruction?

       RESOLVED:  Yes.  On currently shipping NVIDIA GPUs, the maximum program
       parameter buffer size is 16384 32-bit words, or 64KB.  Buffers larger
       than 64KB may be used, but any fetches accessing memory beyond the first
       64KB of a buffer binding will return undefined values.

     (3) Should we support fetches of 3-component vectors?  If so, what should
     be the minimum alignment for the specified offset?

       RESOLVED:  No, we'll leave 3-component vectors out of this extension.
       This limitation can be worked around by either by doing three separate
       single-component fetches or a four-component fetch with an appropriate
       write mask.  The former approach supports indexing in a tightly packed
       array of 3-component vectors; the latter would require that array
       elements be padded to four components.

     (4) Should we support fetches of 8- and 16-bit components?

       RESOLVED:  Yes, we will support fetches of 8- and 16-bit signed and
       unsigned integers.

       Fetches of vectors of 8- and 16-bit integers are not supported but may
       be emulated by performing shift/mask operations on the results of 32-bit
       fetches.

       Fetches of 16-bit floating-point values, or floating-point vectors
       thereof, are not supported.  A single fp16 fetch may be emulated using a
       16-bit unsigned integer fetch and the UP2H instruction to convert the 16
       LSBs of the fetch to a floating-point value.  The encoding of 16-bit
       floating-point values is described in section 2.1.2 of the OpenGL 3.0
       specification.

     (5) Should we support fetches of 64-bit components?

       RESOLVED:  No; the instruction set provided by NV_gpu_program4 does not
       support 64-bit components anywhere.  If future instructions support
       64-bit components, this restriction should be removed.

     (6) How should the operands of the LDC instruction should be specified?

       RESOLVED:  We will create a new type of buffer variable ("CBUFFER"),
       which defines an array of bytes to be fetched form.  The type of fetch
       to perform is specified by a storage modifier (as in
       NV_shader_buffer_load).  An offset relative to the buffer binding (in
       bytes) may be specified using normal array indexing syntax, and an index
       computed at run-time is supported.

       Some examples:

         CBUFFER buffer[] = { program.buffer[0] };
         TEMP      i;
         MOV.S     i, 32;                  # computed offset of 32B
         LDC.F32   result, buffer[12];     # (x,0,0,0) from bytes 12..15
         LDC.F32X4 result, buffer[16];     # (x,y,z,w) from bytes 16..31
         LDC.U8    result, buffer[i.x+3];  # (x,0,0,0) from byte 35
         LDC.S32   result, buffer[i.x+12]; # (x,0,0,0) from bytes 44..47
         LDC.U32X2 result, buffer[i.x+8];  # (x,y,0,0) from bytes 40..47
         LDC.S16   result, buffer[i.x+2];  # (x,0,0,0) from bytes 34..35

       We chose to provide the new buffer variable type (CBUFFER) rather than
       reusing BUFFER or BUFFER4.  For CBUFFER variables, "buffer[12]"
       unambiguously specifies a 12-byte offset.  For BUFFER or BUFFER4
       variables, an operand of "buffer[12]" already has an existing meaning,
       implying an offset of 12 words or vectors, which would be 48 or 192
       bytes, respectively.  Because we want to be able to fetch 8-, and 16-bit
       units, having an offset multiplied by four doesn't make sense.  We could
       have had LDC simply ignore the type of binding and always interpret an
       index as a byte offset, but chose the new declaration type to avoid
       confusion.

       We also considered an approach where the buffer and offset were
       specified in separate operands.  That would be similar to texture, where
       the coordinates and texture are specified separately.  The first operand
       would have been interpreted as a unsigned scalar specifying a byte
       offset, the second operand would have specified a buffer variable
       binding, and a pointer would be obtained by adding the two
       operands. This would have looked something like:

         BUFFER buffer[] = { program.buffer[0] };
         LDC.S32X2 result, offset.x, buffer;

       We chose not to implement this approach mainly because this syntax would
       require specifying a new type of instruction; the syntax we adopted
       simply reuses existing vector operand and indexing mechanisms.
       Additionally, the syntax in this extension provides immediate offsets
       for "free", which the operand-buffer syntax would not support directly
       without additional new syntax.  For example, to load a structure with a
       pair of two-component vectors using offset-buffer syntax, you would have
       to do something like:

         BUFFER buffer[] = { program.buffer[0] };
         TEMP offset;
         LDC.S32X2 result1, offset.x, buffer;
         ADD.U offset.x, offset.x, 8;            # bump offset to second vector
         LDC.S32X2 result2, offset.x, buffer;

     (7) How should the fetches in the LDC instruction interact with other
         operand modifiers (swizzle, absolute value, negation)?  With result
         modifiers (condition codes, saturation)?

       RESOLVED:  These features will be orthogonal.  When any of these
       modifiers are specified, the base data type to which they apply come
       from the storage modifier of the LDC instruction.

       The LDC instruction is defined to produce a "base operand vector" from a
       memory fetch.  This isn't particularly different from normal operands,
       where a base operand vector is derived from the binding corresponding to
       the operand.  In both cases, the components of this vector are swizzled
       and have optional absolute value and negation operations performed to
       produce a final vector operand, as is the case with other vector
       operands.

       If condition code operations or saturation are specified for the result
       vector, these operations are performed using the appropriate data types.

     (8) What happens if a non-zero base offset is specified for a CBUFFER
         variable?

       RESOLVED:  A subset of the bytes in a buffer object can be specified
       using range syntax like the following:

         CBUFFER buffer[] = { program.buffer[0][16..31] };

       The sub-range need not start at the beginning of the buffer object; in
       the example above, it starts 16 bytes into the buffer.  When accessing a
       parameter buffer variable corresponding to such a sub-range, an array
       index is relative to the base of the sub-range.  So the offset of the
       sub-range is effectively added to the index used for the LDC operand:

         LDC.F32   result, buffer[12];     # (x,0,0,0) from bytes 28..31

     (9) What happens if a non-array CBUFFER variable is used?

       RESOLVED:  A non-array variable may be used with LDC.  However, array
       indexing isn't supported with non-array variables, so all LDC loads
       using that variable will fetch using the same base address.

         CBUFFER bufferElement = program.buffer[0][32];
         LDC.U8    result, buffer;     # (x,0,0,0) from byte 32
         LDC.S16   result, buffer;     # (x,0,0,0) from bytes 32..33
         LDC.F32   result, buffer;     # (x,0,0,0) from bytes 32..35
         LDC.F32X4 result, buffer;     # (x,y,z,w) from bytes 32..47

     (10) Should single-component fetches from LDC smear their results across
          all four components of the result vector, to allow packing multiple
          non-vectors into a single vector?

       RESOLVED:  No.  However, swizzle suffixes on the operand will provide
       this capability for free.  For example, let's say you wanted to fetch
       four scalars from a buffer and pack the results into a single temporary
       vector.  The swizzle syntax lets you do this by smearing the real
       component (always fetched in "x") into the other components:

         CBUFFER buffer[] = { program.buffer[0] };
         LDC.F32 temp.x, buffer[16];
         LDC.F32 temp.y, buffer[28].x;
         LDC.F32 temp.z, buffer[32].x;
         LDC.F32 temp.w, buffer[40].x;


 Revision History

     Rev.    Date    Author    Changes
     ----  --------  --------  -----------------------------------------
      1              pbrown    Internal revisions.
      2    09/09/09  mjk       Assigned number