blob: 8eb3429f0a56cc145a79b316e89fdee6e73eedb9 [file] [log] [blame]
Name Strings
Pat Brown, NVIDIA Corporation (pbrown 'at'
Shipping (July 2009, Release 190)
Last Modified Date: 09/09/09
NVIDIA Revision: 2
OpenGL 2.0 is required.
NV_gpu_program4 is required.
NV_parameter_buffer_object is required.
This extension is written against the NV_gpu_program4 specification.
NV_shader_buffer_load trivially affects the definition of this extension.
This extension builds on the NV_parameter_buffer_object extension to
provide additional flexibility in sourcing data from buffer objects.
The original NV_parameter_buffer_object (PaBO) extension provided the
ability to bind buffer objects to a set of numbered binding points and
access them in assembly programs as though they were arrays of 32-bit
scalars (via the BUFFER variable type) or arrays of four-component vectors
with 32-bit scalar components (via the BUFFER4 variable type). However,
the functionality it provided had some significant limits on flexibility.
Since any given buffer binding point could be used either as a BUFFER or
BUFFER4, but not both, programs couldn't do both 32- and 128-bit fetches
from a single binding point. Additionally, No support was provided for
8-, 16-, or 64-bit fetches, though they could be emulated using a larger
loads, with bitfield operations and/or write masking to put components in
the right places. Indexing was supported, but strides were limited to 4-
and 16-byte multiples, depending on whether BUFFER or BUFFER4 is used.
This new extension provides the buffer variable declaration type CBUFFER
to specify a buffer that is treated as an array of bytes, rather than an
array of words or vectors. The LDC instruction allows programs to extract
a vector of data from a CBUFFER variable, using a size and component count
specified in the opcode modifier. 1-, 2-, and 4-component fetches are
supported. The LDC instruction supports byte offsets using normal array
indexing mechanisms; both run-time and immediate offsets are supported.
Offsets used for a buffer object fetch are required to be aligned to the
size of the fetch (1, 2, 4, 8, or 16 bytes).
New Procedures and Functions
New Tokens
Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)
(All modifications are relative to Section 2.X, GPU Programs, from the
NV_gpu_program4 specification.)
Modify Section 2.X.2, Program Grammar
(add after the long list of grammar rules) If a program specifies the
NV_parameter_buffer_object2 program option, the following rules are added
to the NV_gpu_program4 base program grammar:
<VECTORop> ::= "LDC"
<opModifier> ::= "F32";
| "F32X2";
| "F32X4";
| "S8";
| "S16";
| "S32";
| "S32X2";
| "S32X4";
| "U8";
| "U16";
| "U32";
| "U32X2";
| "U32X4";
<bufferDeclType> ::= "CBUFFER"
Modify Section 2.X.3.6, Program Parameter Buffers
(modify the paragraph describing the different type of parameter buffer
variable declarations to include support for "CBUFFER".)
Program parameter buffer variables are treated as an array of
single-component words if the <bufferDeclType> grammar rule matches
"BUFFER" or as an array of four-component vectors if it matches "BUFFER4".
Program parameter buffers may also be declared as an array of basic
machine units from which data can be extracted using the LDC (load
constant) instruction, if <bufferDeclType> matches "CBUFFER". Parameter
buffer variables declared using "CBUFFER" may not be used as an operand in
any instruction other than LDC, while "BUFFER" and "BUFFER4" variables may
not be used with LDC. A program will fail to load if a variable declared
as "BUFFER" and another variable declared as "BUFFER4" use the same buffer
binding point. There is no limitation on the use of "CBUFFER" variables
in conjunction with "BUFFER" or "BUFFER4" variables using the same buffer
binding point.
(modify/restructure the paragraph describing basic program parameter
bindings to handle the byte bindings provided by "CBUFFER" variables)
If a program parameter buffer binding matches "program.buffer[a][b]", the
program parameter variable corresponds to element <b> of the buffer object
bound to binding point <a>. Each element of the bound buffer object is
treated as:
* a single basic machine unit of data, if the variable is declared using
* a single word of data that can hold an integer or floating-point
value, if the variable is declared as "BUFFER"; or
* four words of data that can hold integer or floating-point values, if
the variable is declared as "BUFFER4".
When a binding corresponding to a "BUFFER" variable is used as an operand,
the selected word is broadcast to all four components of the variable.
When a binding corresponding to a "BUFFER4" variable is used as an
operand, the four components of the selected buffer element are loaded
into the variable. A binding corresponding to a "CBUFFER" variable may be
used only in the LDC instruction, and will be used there as a pointer to
extract operand values from buffer memory. If no buffer object is bound
to binding point <a>, or the bound buffer object is not large enough to
hold element <b>, the values used are undefined. The binding point <a>
must be a nonnegative integer constant.
Modify Section 2.X.4, Program Execution Environment
(Add to the set of opcodes in Table X.13)
Instruction F I C S H D Out Inputs Description
----------- - - - - - - --- -------- --------------------------------
LDC X X X X - F v v load from constant buffer
Modify Section 2.X.4.1, Program Instruction Modifiers
(Add to Table X.14, Instruction Modifiers, and to the corresponding
description following the table)
Modifier Description
-------- -----------------------------------------------
F32 Access one 32-bit floating-point value
F32X2 Access two 32-bit floating-point values
F32X4 Access four 32-bit floating-point values
S8 Access one 8-bit signed integer value
S16 Access one 16-bit signed integer value
S32 Access one 32-bit signed integer value
S32X2 Access two 32-bit signed integer values
S32X4 Access four 32-bit signed integer values
U8 Access one 8-bit unsigned integer value
U16 Access one 16-bit unsigned integer value
U32 Access one 32-bit unsigned integer value
U32X2 Access two 32-bit unsigned integer values
U32X4 Access four 32-bit unsigned integer values
For memory load operations, the "F32", "F32X2", "F32X4", "S8", "S16",
"S32", "S32X2", "S32X4", "U8", "U16", "U32", "U32X2", and "U32X4" storage
modifiers control how data are loaded from memory. Storage modifiers are
supported by the LDC and LOAD instructions and are covered in more detail
in the descriptions of these instructions. These instructions must
specify exactly one of these modifiers, and may not specify any of the
base data type modifiers (F,U,S) described above. The base data type of
the result vector of a LOAD or LDC instruction is trivially derived from
the storage modifier.
Add New Section 2.X.4.5, Program Memory Access
Programs may load from buffer object memory via the LDC (load constant)
and LOAD (global load) instructions.
Load instructions read 8, 16, 32, 64, or 128 bits of data from a source
address to produce a four-component vector, according to the storage
modifier specified with the instruction. The storage modifier has three
- a base data type, "F", "S", or "U", specifying that the instruction
fetches floating-point, signed integer, or unsigned integer values,
- a component size, specifying that the components fetched by the
instruction have 8, 16, or 32 bits; and
- an optional component count, where "X2" and "X4" indicate that two or
four components be fetched, and no count indicates a single component
When the storage modifier specifies that fewer than four components should
be fetched, remaining components are filled with zeroes. When performing
a global load (LOAD), the GPU address is specified as an instruction
operand. When performing a constant buffer load (LDC), the GPU address is
derived by adding the base address of the bound buffer object to an offset
specified as an instruction operand. Given a GPU address <address> and a
storage modifier <modifier>, the memory load can be described by the
following code:
result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
result_t_vec result = { 0, 0, 0, 0 };
switch (modifier) {
case F32:
result.x = ((float32_t *)address)[0];
case F32X2:
result.x = ((float32_t *)address)[0];
result.y = ((float32_t *)address)[1];
case F32X4:
result.x = ((float32_t *)address)[0];
result.y = ((float32_t *)address)[1];
result.z = ((float32_t *)address)[2];
result.w = ((float32_t *)address)[3];
case S8:
result.x = ((int8_t *)address)[0];
case S16:
result.x = ((int16_t *)address)[0];
case S32:
result.x = ((int32_t *)address)[0];
case S32X2:
result.x = ((int32_t *)address)[0];
result.y = ((int32_t *)address)[1];
case S32X4:
result.x = ((int32_t *)address)[0];
result.y = ((int32_t *)address)[1];
result.z = ((int32_t *)address)[2];
result.w = ((int32_t *)address)[3];
case U8:
result.x = ((uint8_t *)address)[0];
case U16:
result.x = ((uint16_t *)address)[0];
case U32:
result.x = ((uint32_t *)address)[0];
case U32X2:
result.x = ((uint32_t *)address)[0];
result.y = ((uint32_t *)address)[1];
case U32X4:
result.x = ((uint32_t *)address)[0];
result.y = ((uint32_t *)address)[1];
result.z = ((uint32_t *)address)[2];
result.w = ((uint32_t *)address)[3];
return result;
The offset used for the constant buffer loads must be aligned to the fetch
size corresponding to the storage opcode modifier. For S8 and U8, the
offset has no alignment requirements. For S16 and U16, the offset must be
a multiple of two basic machine units. For F32, S32, and U32, the offset
must be a multiple of four. For F32X2, S32X2, and U32X2, the offset must
be a multiple of eight. For F32X4, S32X4, and U32X4, the offset must be a
multiple of sixteen. If an offset is not correctly aligned, the values
returned by a constant buffer load will be undefined.
Modify Section 2.X.6, Program Options
+ Extended Parameter Buffer Object Support (NV_parameter_buffer_object2)
If a program specifies the "NV_parameter_buffer_object2" option, it may
use the CBUFFER statement to declare program parameter buffer variables
and the LDC instruction to load data from parameter buffer variables using
arbitrary offsets.
Modify Section 2.X.8, Program Instruction Set
Section 2.X.8.Z, LDC: Load from Constant Buffer
The LDC instruction loads a vector operand from a buffer object to yield a
result vector. The operand used for the LDC instruction must correspond
to a parameter buffer variable declared using the "CBUFFER" statement; a
program will fail to load if any other type of operand is used in an LDC
result = BufferMemoryLoad(&op0, storageModifier);
A base operand vector is fetched from memory as described in Section
2.X.4.5, with the GPU address derived from the binding corresponding to
the operand. A final operand vector is derived from the base operand
vector by applying swizzle, negation, and absolute value operand modifiers
as described in Section 2.X.4.2.
The amount of memory in any given buffer object binding accessible by the
LDC instruction may be limited. If any component fetched by the LDC
instruction extends 4*<n> or more basic machine units from the beginning
of the buffer object binding, where <n> is the implementation-dependent
constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that
component will be undefined.
LDC supports no base data type modifiers, but requires exactly one storage
modifier. The base data types of the operand and result vectors are
derived from the storage modifier.
Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization)
Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment
Operations and the Frame Buffer)
Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)
Additions to Chapter 6 of the OpenGL 3.0 Specification (State and
State Requests)
Additions to Appendix A of the OpenGL 3.0 Specification (Invariance)
Additions to the AGL/GLX/WGL Specifications
No new errors.
Dependencies on NV_shader_buffer_load
If NV_shader_buffer_load (or equivalent functionality) is not supported,
references to the "LOAD" opcode in the description of the opcode modifiers
for "LDC" should be removed.
New State
New Implementation Dependent State
(1) What sort of alignment requirements, if any, should be imposed on the
operand provided to the LDC instruction?
RESOLVED: The offset of the operand must be aligned according to the
size of the fetch. For 1-, 2-, and 4-component fetches, the offset must
be a multiple of <N>, 2*<N>, and 4*<N>, where <N> is the size in bytes
of the components being fetched.
(2) NV_parameter_buffer_object provides an implementation-dependent limit
on the portion of a buffer object that may be fetched via BUFFER and
BUFFER4 variables? Should the same limits apply to the LDC
RESOLVED: Yes. On currently shipping NVIDIA GPUs, the maximum program
parameter buffer size is 16384 32-bit words, or 64KB. Buffers larger
than 64KB may be used, but any fetches accessing memory beyond the first
64KB of a buffer binding will return undefined values.
(3) Should we support fetches of 3-component vectors? If so, what should
be the minimum alignment for the specified offset?
RESOLVED: No, we'll leave 3-component vectors out of this extension.
This limitation can be worked around by either by doing three separate
single-component fetches or a four-component fetch with an appropriate
write mask. The former approach supports indexing in a tightly packed
array of 3-component vectors; the latter would require that array
elements be padded to four components.
(4) Should we support fetches of 8- and 16-bit components?
RESOLVED: Yes, we will support fetches of 8- and 16-bit signed and
unsigned integers.
Fetches of vectors of 8- and 16-bit integers are not supported but may
be emulated by performing shift/mask operations on the results of 32-bit
Fetches of 16-bit floating-point values, or floating-point vectors
thereof, are not supported. A single fp16 fetch may be emulated using a
16-bit unsigned integer fetch and the UP2H instruction to convert the 16
LSBs of the fetch to a floating-point value. The encoding of 16-bit
floating-point values is described in section 2.1.2 of the OpenGL 3.0
(5) Should we support fetches of 64-bit components?
RESOLVED: No; the instruction set provided by NV_gpu_program4 does not
support 64-bit components anywhere. If future instructions support
64-bit components, this restriction should be removed.
(6) How should the operands of the LDC instruction should be specified?
RESOLVED: We will create a new type of buffer variable ("CBUFFER"),
which defines an array of bytes to be fetched form. The type of fetch
to perform is specified by a storage modifier (as in
NV_shader_buffer_load). An offset relative to the buffer binding (in
bytes) may be specified using normal array indexing syntax, and an index
computed at run-time is supported.
Some examples:
CBUFFER buffer[] = { program.buffer[0] };
MOV.S i, 32; # computed offset of 32B
LDC.F32 result, buffer[12]; # (x,0,0,0) from bytes 12..15
LDC.F32X4 result, buffer[16]; # (x,y,z,w) from bytes 16..31
LDC.U8 result, buffer[i.x+3]; # (x,0,0,0) from byte 35
LDC.S32 result, buffer[i.x+12]; # (x,0,0,0) from bytes 44..47
LDC.U32X2 result, buffer[i.x+8]; # (x,y,0,0) from bytes 40..47
LDC.S16 result, buffer[i.x+2]; # (x,0,0,0) from bytes 34..35
We chose to provide the new buffer variable type (CBUFFER) rather than
reusing BUFFER or BUFFER4. For CBUFFER variables, "buffer[12]"
unambiguously specifies a 12-byte offset. For BUFFER or BUFFER4
variables, an operand of "buffer[12]" already has an existing meaning,
implying an offset of 12 words or vectors, which would be 48 or 192
bytes, respectively. Because we want to be able to fetch 8-, and 16-bit
units, having an offset multiplied by four doesn't make sense. We could
have had LDC simply ignore the type of binding and always interpret an
index as a byte offset, but chose the new declaration type to avoid
We also considered an approach where the buffer and offset were
specified in separate operands. That would be similar to texture, where
the coordinates and texture are specified separately. The first operand
would have been interpreted as a unsigned scalar specifying a byte
offset, the second operand would have specified a buffer variable
binding, and a pointer would be obtained by adding the two
operands. This would have looked something like:
BUFFER buffer[] = { program.buffer[0] };
LDC.S32X2 result, offset.x, buffer;
We chose not to implement this approach mainly because this syntax would
require specifying a new type of instruction; the syntax we adopted
simply reuses existing vector operand and indexing mechanisms.
Additionally, the syntax in this extension provides immediate offsets
for "free", which the operand-buffer syntax would not support directly
without additional new syntax. For example, to load a structure with a
pair of two-component vectors using offset-buffer syntax, you would have
to do something like:
BUFFER buffer[] = { program.buffer[0] };
TEMP offset;
LDC.S32X2 result1, offset.x, buffer;
ADD.U offset.x, offset.x, 8; # bump offset to second vector
LDC.S32X2 result2, offset.x, buffer;
(7) How should the fetches in the LDC instruction interact with other
operand modifiers (swizzle, absolute value, negation)? With result
modifiers (condition codes, saturation)?
RESOLVED: These features will be orthogonal. When any of these
modifiers are specified, the base data type to which they apply come
from the storage modifier of the LDC instruction.
The LDC instruction is defined to produce a "base operand vector" from a
memory fetch. This isn't particularly different from normal operands,
where a base operand vector is derived from the binding corresponding to
the operand. In both cases, the components of this vector are swizzled
and have optional absolute value and negation operations performed to
produce a final vector operand, as is the case with other vector
If condition code operations or saturation are specified for the result
vector, these operations are performed using the appropriate data types.
(8) What happens if a non-zero base offset is specified for a CBUFFER
RESOLVED: A subset of the bytes in a buffer object can be specified
using range syntax like the following:
CBUFFER buffer[] = { program.buffer[0][16..31] };
The sub-range need not start at the beginning of the buffer object; in
the example above, it starts 16 bytes into the buffer. When accessing a
parameter buffer variable corresponding to such a sub-range, an array
index is relative to the base of the sub-range. So the offset of the
sub-range is effectively added to the index used for the LDC operand:
LDC.F32 result, buffer[12]; # (x,0,0,0) from bytes 28..31
(9) What happens if a non-array CBUFFER variable is used?
RESOLVED: A non-array variable may be used with LDC. However, array
indexing isn't supported with non-array variables, so all LDC loads
using that variable will fetch using the same base address.
CBUFFER bufferElement = program.buffer[0][32];
LDC.U8 result, buffer; # (x,0,0,0) from byte 32
LDC.S16 result, buffer; # (x,0,0,0) from bytes 32..33
LDC.F32 result, buffer; # (x,0,0,0) from bytes 32..35
LDC.F32X4 result, buffer; # (x,y,z,w) from bytes 32..47
(10) Should single-component fetches from LDC smear their results across
all four components of the result vector, to allow packing multiple
non-vectors into a single vector?
RESOLVED: No. However, swizzle suffixes on the operand will provide
this capability for free. For example, let's say you wanted to fetch
four scalars from a buffer and pack the results into a single temporary
vector. The swizzle syntax lets you do this by smearing the real
component (always fetched in "x") into the other components:
CBUFFER buffer[] = { program.buffer[0] };
LDC.F32 temp.x, buffer[16];
LDC.F32 temp.y, buffer[28].x;
LDC.F32 temp.z, buffer[32].x;
LDC.F32 temp.w, buffer[40].x;
Revision History
Rev. Date Author Changes
---- -------- -------- -----------------------------------------
1 pbrown Internal revisions.
2 09/09/09 mjk Assigned number