blob: b5d6e3134827fc91feae55377c68f3e53383b8a0 [file] [log] [blame]
Name
NV_gpu_program5
Name Strings
GL_NV_gpu_program5
GL_NV_gpu_program_fp64
Contact
Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)
Status
Shipping.
Version
Last Modified Date: 09/11/2014
NVIDIA Revision: 7
Number
388
Dependencies
OpenGL 2.0 is required.
This extension is written against the OpenGL 3.0 specification.
NV_gpu_program4 and NV_gpu_program4_1 are required.
NV_shader_buffer_load is required.
NV_shader_buffer_store is required.
This extension is written against and interacts with the NV_gpu_program4,
NV_vertex_program4, NV_geometry_program4, and NV_fragment_program4
specifications.
This extension interacts with NV_tessellation_program5.
This extension interacts with ARB_transform_feedback3.
This extension interacts trivially with NV_shader_buffer_load.
This extension interacts trivially with NV_shader_buffer_store.
This extension interacts trivially with NV_parameter_buffer_object2.
This extension interacts trivially with OpenGL 3.3, ARB_texture_swizzle,
and EXT_texture_swizzle.
This extension interacts trivially with ARB_blend_func_extended.
This extension interacts trivially with EXT_shader_image_load_store.
This extension interacts trivially with ARB_shader_subroutine.
If the 64-bit floating-point portion of this extension is not supported,
"GL_NV_gpu_program_fp64" will not be found in the extension string.
Overview
This specification documents the common instruction set and basic
functionality provided by NVIDIA's 5th generation of assembly instruction
sets supporting programmable graphics pipeline stages.
The instruction set builds upon the basic framework provided by the
ARB_vertex_program and ARB_fragment_program extensions to expose
considerably more capable hardware. In addition to new capabilities for
vertex and fragment programs, this extension provides new functionality
for geometry programs as originally described in the NV_geometry_program4
specification, and serves as the basis for the new tessellation control
and evaluation programs described in the NV_tessellation_program5
extension.
Programs using the functionality provided by this extension should begin
with the program headers "!!NVvp5.0" (vertex programs), "!!NVtcp5.0"
(tessellation control programs), "!!NVtep5.0" (tessellation evaluation
programs), "!!NVgp5.0" (geometry programs), and "!!NVfp5.0" (fragment
programs).
This extension provides a variety of new features, including:
* support for 64-bit integer operations;
* the ability to dynamically index into an array of texture units or
program parameter buffers;
* extending texel offset support to allow loading texel offsets from
regular integer operands computed at run-time, instead of requiring
that the offsets be constants encoded in texture instructions;
* extending TXG (texture gather) support to return the 2x2 footprint
from any component of the texture image instead of always returning
the first (x) component;
* extending TXG to support shadow comparisons in conjunction with a
depth texture, via the SHADOW* targets;
* further extending texture gather support to provide a new opcode
(TXGO) that applies a separate texel offset vector to each of the four
samples returned by the instruction;
* bit manipulation instructions, including ones to find the position of
the most or least significant set bit, bitfield insertion and
extraction, and bit reversal;
* a general data conversion instruction (CVT) supporting conversion
between any two data types supported by this extension; and
* new instructions to compute the composite of a set of boolean
conditions a group of shader threads.
This extension also provides some new capabilities for individual program
types, including:
* support for instanced geometry programs, where a geometry program may
be run multiple times for each primitive;
* support for emitting vertices in a geometry program where each vertex
emitted may be directed at a specified vertex stream and captured
using the ARB_transform_feedback3 extension;
* support for interpolating an attribute at a programmable offset
relative to the pixel center (IPAO), at a programmable sample number
(IPAS), or at the fragment's centroid location (IPAC) in a fragment
program;
* support for reading a mask of covered samples in a fragment program;
* support for reading a point sprite coordinate directly in a fragment
program, without overriding a texture coordinate;
* support for reading patch primitives and per-patch attributes
(introduced by ARB_tessellation_shader) in a geometry program; and
* support for multiple output vectors for a single color output in a
fragment program (as used by ARB_blend_func_extended).
This extension also provides optional support for 64-bit-per-component
variables and 64-bit floating-point arithmetic. These features are
supported if and only if "NV_gpu_program_fp64" is found in the extension
string.
This extension incorporates the memory access operations from the
NV_shader_buffer_load and NV_parameter_buffer_object2 extensions,
originally built as add-ons to NV_gpu_program4. It also provides the
following new capabilities:
* support for the features without requiring a separate OPTION keyword;
* support for indexing into an array of constant buffers using the LDC
opcode added by NV_parameter_buffer_object2;
* support for storing into buffer objects at a specified GPU address
using the STORE opcode, an allowing applications to create READ_WRITE
and WRITE_ONLY mappings when making a buffer object resident using the
API mechanisms in the NV_shader_buffer_store extension;
* storage instruction modifiers to allow loading and storing 64-bit
component values;
* support for atomic memory transactions using the ATOM opcode, where
the instruction atomically reads the memory pointed to by a pointer,
performs a specified computation, stores the results of that
computation, and returns the original value read;
* support for memory barrier transactions using the MEMBAR opcode, which
ensures that all memory stores issued prior to the opcode complete
prior to any subsequent memory transactions; and
* a fragment program option to specify that depth and stencil tests are
performed prior to fragment program execution.
Additionally, the assembly program languages supported by this extension
include support for reading, writing, and performing atomic memory
operations on texture image data using the opcodes and mechanisms
documented in the "Dependencies on NV_gpu_program5" section of the
EXT_shader_image_load_store extension.
New Procedures and Functions
None.
New Tokens
Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
GetFloatv, and GetDoublev:
MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV 0x8E5A
MIN_FRAGMENT_INTERPOLATION_OFFSET_NV 0x8E5B
MAX_FRAGMENT_INTERPOLATION_OFFSET_NV 0x8E5C
FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV 0x8E5D
MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV 0x8E5E
MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV 0x8E5F
Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)
Modify Section 2.X.2 of NV_fragment_program4, Program Grammar
(modify the section, updating the program header string for the extended
instruction set)
Fragment programs are required to begin with the header string
"!!NVfp5.0". This header string identifies the subsequent program body as
being a fragment program and indicates that it should be parsed according
to the base NV_gpu_program5 grammar plus the additions below. Program
string parsing begins with the character immediately following the header
string.
(add/change the following rules to the NV_fragment_program4 and
NV_gpu_program5 base grammars)
<SpecialInstruction> ::= "IPAC" <opModifiers> <instResult> ","
<instOperandV>
| "IPAO" <opModifiers> <instResult> ","
<instOperandV> "," <instOperandV>
| "IPAS" <opModifiers> <instResult> ","
<instOperandV> "," <instOperandS>
<interpModifier> ::= "SAMPLE"
<attribBasic> ::= <fragPrefix> "sampleid"
| <fragPrefix> "samplemask"
| <fragPrefix> "pointcoord"
<resultBasic> ::= <resPrefix> "color" <resultOptColorNum>
<resultOptColorType>
| <resPrefix> "samplemask"
<resultOptColorType> ::= ""
| "." <colorType>
Modify Section 2.X.2 of NV_geometry_program4, Program Grammar
(modify the section, updating the program header string for the extended
instruction set)
Geometry programs are required to begin with the header string
"!!NVgp5.0". This header string identifies the subsequent program body as
being a geometry program and indicates that it should be parsed according
to the base NV_gpu_program5 grammar plus the additions below. Program
string parsing begins with the character immediately following the header
string.
(add the following rules to the NV_geometry_program4 and NV_gpu_program5
base grammars)
<declaration> ::= "INVOCATIONS" <int>
<declPrimInType> ::= "PATCHES"
<SpecialInstruction> ::= "EMITS" <instOperandS>
<attribBasic> ::= <primPrefix> "invocation"
| <primPrefix> "vertexcount"
| <attribTessOuter> <optArrayMemAbs>
| <attribTessInner> <optArrayMemAbs>
| <attribPatchGeneric> <optArrayMemAbs>
<attribMulti> ::= <attribTessOuter> <arrayRange>
| <attribTessInner> <arrayRange>
| <attribPatchGeneric> <arrayRange>
<attribTessOuter> ::= <primPrefix> "." "tessouter"
<attribTessInner> ::= <primPrefix> "." "tessinner"
<attribPatchGeneric> ::= <primPrefix> "." "patch" "." "attrib"
Modify Section 2.X.2 of NV_vertex_program4, Program Grammar
(modify the section, updating the program header string for the extended
instruction set)
Vertex programs are required to begin with the header string "!!NVvp5.0".
This header string identifies the subsequent program body as being a
vertex program and indicates that it should be parsed according to the
base NV_gpu_program5 grammar plus the additions below. Program string
parsing begins with the character immediately following the header string.
Modify Section 2.X.2 of NV_gpu_program4, Program Grammar
(add the following grammar rules to the NV_gpu_program4 base grammar;
additional grammar rules usable for assembly programs are documented in
the EXT_shader_image_load_store and ARB_shader_subroutine specifications)
<instruction> ::= <MemInstruction>
<MemInstruction> ::= <ATOMop_instruction>
| <STOREop_instruction>
| <MEMBARop_instruction>
<VECTORop> ::= "BFR"
| "BTC"
| "BTFL"
| "BTFM"
| "PK64"
| "LDC"
| "CVT"
| "TGALL"
| "TGANY"
| "TGEQ"
| "UP64"
<SCALARop> ::= "LOAD"
<BINop> ::= "BFE"
<TRIop> ::= "BFI"
<TEXop_instruction> ::= <TEXop> <opModifiers> <instResult> ","
<instOperandV> "," <instOperandV> ","
<texAccess>
<TEXop> ::= "TXG"
| "LOD"
<TXDop> ::= "TXGO"
<ATOMop_instruction> ::= <ATOMop> <opModifiers> <instResult> ","
<instOperandV> "," <instOperandS>
<ATOMop> ::= "ATOM"
<STOREop_instruction> ::= <STOREop> <opModifiers> <instOperandV> ","
<instOperandS>
<STOREop> ::= "STORE"
<MEMBARop_instruction> ::= <MEMBARop> <opModifiers>
<MEMBARop> ::= "MEMBAR"
<opModifier> ::= "F16"
| "F32"
| "F64"
| "F32X2"
| "F32X4"
| "F64X2"
| "F64X4"
| "S8"
| "S16"
| "S32"
| "S32X2"
| "S32X4"
| "S64"
| "S64X2"
| "S64X4"
| "U8"
| "U16"
| "U32"
| "U32X2"
| "U32X4"
| "U64"
| "U64X2"
| "U64X4"
| "ADD"
| "MIN"
| "MAX"
| "IWRAP"
| "DWRAP"
| "AND"
| "OR"
| "XOR"
| "EXCH"
| "CSWAP"
| "COH"
| "ROUND"
| "CEIL"
| "FLR"
| "TRUNC"
| "PREC"
| "VOL"
<texAccess> ::= <textureUseS> "," <texTarget> <optTexOffset>
| <textureUseV> "," <texTarget> <optTexOffset>
<texTarget> ::= "ARRAYCUBE"
| "SHADOWARRAYCUBE"
<optTexOffset> ::= /* empty */
| <texOffset>
<texOffset> ::= "offset" "(" <instOperandV> ")"
<namingStatement> ::= <TEXTURE_statement>
<BUFFER_statement> ::= <bufferDeclType> <establishName>
<optArraySize> <optArraySize> "="
<bufferMultInit>
<bufferDeclType> ::= "CBUFFER"
<TEXTURE_statement> ::= "TEXTURE" <establishName> <texSingleInit>
| "TEXTURE" <establishName> <optArraySize>
<texMultipleInit>
<texSingleInit> ::= "=" <textureUseDS>
<texMultipleInit> ::= "=" "{" <texItemList> "}"
<texItemList> ::= <textureUseDM>
| <textureUseDM> "," <texItemList>
<bufferBinding> ::= "program" "." "buffer" <arrayRange>
<textureUseS> ::= <textureUseV> <texImageUnitComp>
<textureUseV> ::= <texImageUnit>
| <texVarName> <optArrayMem>
<textureUseDS> ::= "texture" <arrayMemAbs>
<textureUseDM> ::= <textureUseDS>
| "texture" <arrayRange>
<texImageUnitComp> ::= <scalarSuffix>
Modify Section 2.X.3.1, Program Variable Types
(IGNORE if GL_NV_gpu_program_fp64 is not found in the extension string.
Otherwise modify storage size modifiers to guarantee that "LONG"
variables are at least 64 bits in size.)
Explicitly declared variables may optionally have one storage size
modifier. Variables decared as "SHORT" will be represented using at least
16 bits per component. "SHORT" floating-point values will have at least 5
bits of exponent and 10 bits of mantissa. Variables declared as "LONG"
will be represented with at least 64 bits per component. "LONG"
floating-point values will have at least 11 bits of exponent and 52 bits
of mantissa. If no size modifier is provided, the GL will automatically
select component sizes. Implementations are not required to support more
than one component size, so "SHORT", "LONG", and the default could all
refer to the same component size. The "LONG" modifier is supported only
for declarations of temporary variables ("TEMP"), and attribute variables
("ATTRIB") in vertex programs. The "SHORT" modifier is supported only
for declarations of temporary variables and result variables ("OUTPUT").
Modify Section 2.X.3.2 of the NV_fragment_program4 specification, Program
Attribute Variables.
(Add a table entry and relevant text describing the fragment program
input sample mask variable.)
Fragment Attribute Binding Components Underlying State
-------------------------- ---------- ----------------------------
fragment.samplemask (m,-,-,-) fragment coverage mask
fragment.pointcoord (s,t,-,-) fragment point sprite coordinate
If a fragment attribute binding matches "fragment.samplemask", the "x"
component is filled with a coverage mask indicating the set of samples
covered by this fragment. The coverage mask is a bitfield, where bit <n>
is one if the sample number <n> is covered and zero otherwise. If
multisample buffers are not available (SAMPLE_BUFFERS is zero), bit zero
indicates if the center of the pixel corresponding to the fragment is
covered.
If a fragment attribute binding matches "fragment.pointcoord", the "x" and
"y" components are filled with the s and t point sprite coordinates
(section 3.3.1), respectively. The "z" and "w" components are undefined.
If the fragment is generated by any primitive other than a point, or if
point sprites are disabled, all four components of the binding are
undefined.
Modify Section 2.X.3.2 of the NV_geometry_program4 specification, Program
Attribute Variables.
(Add a table entry and relevant text describing the geometry program
invocation attribute and per-patch attributes.)
Geometry Vertex Binding Components Description
----------------------------- ---------- ----------------------------
...
primitive.invocation (id,-,-,-) geometry program invocation
primitive.tessouter[n] (x,-,-,-) outer tess. level n
primitive.tessinner[n] (x,-,-,-) inner tess. level n
primitive.patch.attrib[n] (x,y,z,w) generic patch attribute n
primitive.tessouter[n..o] (x,-,-,-) outer tess. levels n to o
primitive.tessinner[n..o] (x,-,-,-) inner tess. levels n to o
primitive.patch.attrib[n..o] (x,y,z,w) generic patch attrib n to o
primitive.vertexcount (c,-,-,-) vertices in primitive
...
If a geometry attribute binding matches "primitive.invocation", the "x"
component is filled with an integer giving the number of previous
invocations of the geometry program on the primitive being processed. If
the geometry program is invoked only once per primitive (default), this
component will always be zero. If the program is invoked multiple times
(via the INVOCATIONS declaration), the component will be zero on the first
invocation, one on the second, and so forth. The "y", "z", and "w"
components of the variable are always undefined.
If an attribute binding matches "primitive.tessouter[n]", the "x"
component is filled with the per-patch outer tessellation level numbered
<n> of the input patch. <n> must be less than four. The "y", "z", and
"w" components are always undefined. A program will fail to load if this
attribute binding is used and the input primitive type is not PATCHES.
If an attribute binding matches "primitive.tessinner[n]", the "x"
component is filled with the per-patch inner tessellation level numbered
<n> of the input patch. <n> must be less than two. The "y", "z", and "w"
components are always undefined. A program will fail to load if this
attribute binding is used and the input primitive type is not PATCHES.
If an attribute binding matches "primitive.patch.attrib[n]", the "x", "y",
"z", and "w" components are filled with the corresponding components of
the per-patch generic attribute numbered <n> of the input patch. A
program will fail to load if this attribute binding is used and the input
primitive type is not PATCHES.
If an attribute binding matches "primitive.tessouter[n..o]",
"primitive.tessinner[n..o]", or "primitive.patch.attrib[n..o]", a sequence
of 1+<o>-<n> outer tessellation level, inner tessellation level, or
per-patch generic attribute bindings is created. For per-patch generic
attribute bindings, it is as though the sequence
"primitive.patch.attrib[n], primitive.patch.attrib[n+1], ...
primitive.patch.attrib[o]" were specfied. These bindings are available
only in explicit declarations of array variables. A program will fail to
load if <n> is greater than <o> or the input primitive type is not
PATCHES.
If a geometry attribute binding matches "primitive.vertexcount", the "x"
component is filled with the number of vertices in the input primitive
being processed. The "y", "z", and "w" components of the variable are
always undefined.
Modify Section 2.X.3.5, Program Results
(modify Table X.X)
Binding Components Description
----------------------------- ---------- ----------------------------
result.color[n].primary (r,g,b,a) primary color n (SRC_COLOR)
result.color[n].secondary (r,g,b,a) secondary color n (SRC1_COLOR)
Table X.X: Fragment Result Variable Bindings. Components labeled "*"
are unused. "[n]" is optional -- color <n> is used if specified; color
0 is used otherwise.
(add after third paragraph)
If a result variable binding matches "result.color[n].primary" or
"result.color[n].secondary" and the ARB_blend_func_extended option is
specified, updates to the "x", "y", "z", and "w" components of these color
result variables modify the "r", "g", "b", and "a" components of the
SRC_COLOR and SRC1_COLOR color outputs, respectively, for the fragment
output color numbered <n>. If the ARB_blend_func_extended program option
is not specified, the "result.color[n].primary" and
"result.color[n].secondary" bindings are unavailable.
Modify Section 2.X.3.6, Program Parameter Buffers
(modify the description of parameter buffer arrays to require that all
bindings in an array declaration must use the same single buffer *or*
buffer range)
... Program parameter buffer variables may be declared as arrays, but all
bindings assigned to the array must use the same binding point or binding
point range, and must increase consecutively.
(add to the end of the section)
In explicit variable declarations, the bindings in Table X.12.1 of the
form "program.buffer[a..b]" may also be used, and indicate the variable
spans multiple buffer binding points. Such variables must be accessed as
an arrays, with the first index specifying an offset into the range of
buffer object binding points. A buffer index of zero identifies binding
point <a>; an index of <b>-<a>-1 identifies binding point <b>. If such a
variable is declared as an array, a second index must be provided to
identify the individual array element. A program will fail to compile if
such bindings are used when <a> or <b> is negative or greater than or
equal to the number of buffer binding points supported for the program
type, or if <a> is greater than <b>. The bindings in Table X.12.1 may not
be used in implicit variable declarations.
Binding Components Underlying State
----------------------------- ---------- -----------------------------
program.buffer[a..b][c] (x,x,x,x) program parameter buffers a
through b, element c
program.buffer[a..b][c..d] (x,x,x,x) program parameter buffers a
through b, elements b
through c
program.buffer[a..b] (x,x,x,x) program parameter buffers a
through b, all elements
Table X.12.1: Program Parameter Buffer Array Bindings. <a> and <b>
indicate buffer numbers, <c> and <d> indicate individual elements.
When bindings beginning with "program.buffer[a..b]" are used in a variable
declaration, they behave identically to corresponding beginning with
"program.buffer[a]", except that the variable is filled with a separate
set of values for each buffer binding point from <a> to <b> inclusive.
(add new section after Section 2.X.3.7, Program Condition Code Registers
and renumber subsequent sections accordingly)
Section 2.X.3.8, Program Texture Variables
Program texture variables are used as constants during program execution
and refer the texture objects bound to to one or more texture image units.
All texture variables have associated bindings and are read-only during
program execution. Texture variables retain their values across program
invocations, and the set of texture image units to which they refer is
constant. The texture object a variable refers to may be changed by
binding a new texture object to the appropriate target of the
corresponding texture image unit. Texture variables may only be used to
identify a texture object in texture instructions, and may not be used as
operands in any other instruction. Texture variables may be declared
explicitly via the <TEXTURE_statement> grammar rule, or implicitly by
using a texture image unit binding in an instruction.
Texture array variables may be declared as arrays, but the list of
texture image units assigned to the array must increase consectively.
Texture variables identify only a texture image unit; the corresponding
texture target (e.g., 1D, 2D, CUBE) and texture object is identified by
the <texTarget> grammar rule in instructions using the texture variable.
Binding Components Underlying State
--------------- ---------- ------------------------------------------
texture[a] x texture object bound to image unit a
texture[a..b] x texture objects bound to image units a
through b
Table X.12.2: Texture Image Unit Bindings. <a> and <b> indicate
texture image unit numbers.
If a texture binding matches "texture[a]", the texture variable is filled
with a single integer referring to texture image unit <a>.
If a texture binding matches "texture[a..b]", the texture variable is
filled with an array of integers referring to texture image units <a>
through <b>, inclusive. A program will fail to compile if <a> or <b> is
negative or greater than or equal to the number of texture image units
supported, or if <a> is greater than <b>.
Modify Section 2.X.4, Program Execution Environment
(Update the instruction set table to include new columns to indicate the
first ISA supporting the instruction, and to indicate whether the
instruction supports 64-bit floating-point modifiers.)
Instr- Modifiers
uction V F I C S H D Out Inputs Description
------- -- - - - - - - --- -------- --------------------------------
ABS 40 6 6 X X X F v v absolute value
ADD 40 6 6 X X X F v v,v add
AND 40 - 6 X - - S v v,v bitwise and
ATOM 50 - - X - - - s v,su atomic memory transaction
BFE 50 - X X - - S v v,v bitfield extract
BFI 50 - X X - - S v v,v,v bitfield insert
BFR 50 - X X - - S v v bitfield reverse
BRK 40 - - - - - - - c break out of loop instruction
BTC 50 - X X - - S v v bit count
BTFL 50 - X X - - S v v find least significant bit
BTFM 50 - X X - - S v v find most significant bit
CAL 40 - - - - - - - c subroutine call
CEIL 40 6 6 X X X F v vf ceiling
CMP 40 6 6 X X X F v v,v,v compare
CONT 40 - - - - - - - c continue with next loop interation
COS 40 X - X X X F s s cosine with reduction to [-PI,PI]
CVT 50 - - X X - F v v general data type conversion
DDX 40 X - X X X F v v derivative relative to X (fp-only)
DDY 40 X - X X X F v v derivative relative to Y (fp-only)
DIV 40 6 6 X X X F v v,s divide vector components by scalar
DP2 40 X - X X X F s v,v 2-component dot product
DP2A 40 X - X X X F s v,v,v 2-comp. dot product w/scalar add
DP3 40 X - X X X F s v,v 3-component dot product
DP4 40 X - X X X F s v,v 4-component dot product
DPH 40 X - X X X F s v,v homogeneous dot product
DST 40 X - X X X F v v,v distance vector
ELSE 40 - - - - - - - - start if test else block
EMIT 40 - - - - - - - - emit vertex stream 0 (gp-only)
EMITS 50 - X - - - S - s emit vertex to stream (gp-only)
ENDIF 40 - - - - - - - - end if test block
ENDPRIM 40 - - - - - - - - end of primitive (gp-only)
ENDREP 40 - - - - - - - - end of repeat block
EX2 40 X - X X X F s s exponential base 2
FLR 40 6 6 X X X F v vf floor
FRC 40 6 - X X X F v v fraction
I2F 40 - 6 X - - S vf v integer to float
IF 40 - - - - - - - c start of if test block
IPAC 50 X - X X - F v v interpolate at centroid (fp-only)
IPAO 50 X - X X - F v v,v interpolate w/offset (fp-only)
IPAS 50 X - X X - F v v,su interpolate at sample (fp-only)
KIL 40 X X - - X F - vc kill fragment
LDC 40 - - X X - F v v load from constant buffer
LG2 40 X - X X X F s s logarithm base 2
LIT 40 X - X X X F v v compute lighting coefficients
LOAD 40 - - X X - F v su global load
LOD 41 X - X X - F v vf,t compute texture LOD
LRP 40 X - X X X F v v,v,v linear interpolation
MAD 40 6 6 X X X F v v,v,v multiply and add
MAX 40 6 6 X X X F v v,v maximum
MEMBAR 50 - - - - - - - - memory barrier
MIN 40 6 6 X X X F v v,v minimum
MOD 40 - 6 X - - S v v,s modulus vector components by scalar
MOV 40 6 6 X X X F v v move
MUL 40 6 6 X X X F v v,v multiply
NOT 40 - 6 X - - S v v bitwise not
NRM 40 X - X X X F v v normalize 3-component vector
OR 40 - 6 X - - S v v,v bitwise or
PK2H 40 X X - - - F s vf pack two 16-bit floats
PK2US 40 X X - - - F s vf pack two floats as unsigned 16-bit
PK4B 40 X X - - - F s vf pack four floats as signed 8-bit
PK4UB 40 X X - - - F s vf pack four floats as unsigned 8-bit
PK64 50 X X - - - F v v pack 4x32-bit vectors to 2x64
POW 40 X - X X X F s s,s exponentiate
RCC 40 X - X X X F s s reciprocal (clamped)
RCP 40 6 - X X X F s s reciprocal
REP 40 6 6 - - X F - v start of repeat block
RET 40 - - - - - - - c subroutine return
RFL 40 X - X X X F v v,v reflection vector
ROUND 40 6 6 X X X F v vf round to nearest integer
RSQ 40 6 - X X X F s s reciprocal square root
SAD 40 - 6 X - - S vu v,v,vu sum of absolute differences
SCS 40 X - X X X F v s sine/cosine without reduction
SEQ 40 6 6 X X X F v v,v set on equal
SFL 40 6 6 X X X F v v,v set on false
SGE 40 6 6 X X X F v v,v set on greater than or equal
SGT 40 6 6 X X X F v v,v set on greater than
SHL 40 - 6 X - - S v v,s shift left
SHR 40 - 6 X - - S v v,s shift right
SIN 40 X - X X X F s s sine with reduction to [-PI,PI]
SLE 40 6 6 X X X F v v,v set on less than or equal
SLT 40 6 6 X X X F v v,v set on less than
SNE 40 6 6 X X X F v v,v set on not equal
SSG 40 6 - X X X F v v set sign
STORE 50 - - - - - - - v,su global store
STR 40 6 6 X X X F v v,v set on true
SUB 40 6 6 X X X F v v,v subtract
SWZ 40 X - X X X F v v extended swizzle
TEX 40 X X X X - F v vf,t texture sample
TGALL 50 X X X X - F v v test all non-zero in thread group
TGANY 50 X X X X - F v v test any non-zero in thread group
TGEQ 50 X X X X - F v v test all equal in thread group
TRUNC 40 6 6 X X X F v vf truncate (round toward zero)
TXB 40 X X X X - F v vf,t texture sample with bias
TXD 40 X X X X - F v vf,vf,vf,t texture sample w/partials
TXF 40 X X X X - F v vs,t texel fetch
TXFMS 40 X X X X - F v vs,t multisample texel fetch
TXG 41 X X X X - F v vf,t texture gather
TXGO 50 X X X X - F v vf,vs,vs,t texture gather w/per-texel offsets
TXL 40 X X X X - F v vf,t texture sample w/LOD
TXP 40 X X X X - F v vf,t texture sample w/projection
TXQ 40 - - - - - S vs vs,t texture info query
UP2H 40 X X X X - F vf s unpack two 16-bit floats
UP2US 40 X X X X - F vf s unpack two unsigned 16-bit integers
UP4B 40 X X X X - F vf s unpack four signed 8-bit integers
UP4UB 40 X X X X - F vf s unpack four unsigned 8-bit integers
UP64 50 X X X X - F v v unpack 2x64 vectors to 4x32
X2D 40 X - X X X F v v,v,v 2D coordinate transformation
XOR 40 - 6 X - - S v v,v exclusive or
XPD 40 X - X X X F v v,v cross product
Table X.13: Summary of NV_gpu_program5 instructions.
The "V" column indicates the first assembly language in the
NV_gpu_program4 family (if any) supporting the opcode. "41" and "50"
indicate NV_gpu_program4_1 and NV_gpu_program5, respectively.
The "Modifiers" columns specify the set of modifiers allowed for the
instruction:
F = floating-point data type modifiers
I = signed and unsigned integer data type modifiers
C = condition code update modifiers
S = clamping (saturation) modifiers
H = half-precision float data type suffix
D = default data type modifier (F, U, or S)
For the "F" and "I" columns, an "X" indicates support for both unsized
type modifiers and sized type modifiers with fewer than 64 bits. A "6"
indicates support for all modifiers, including 64-bit versions (when
supported).
The input and output columns describe the formats of the operands and
results of the instruction.
v: 4-component vector (data type is inherited from operation)
vf: 4-component vector (data type is always floating-point)
vs: 4-component vector (data type is always signed integer)
vu: 4-component vector (data type is always unsigned integer)
s: scalar (replicated if written to a vector destination;
data type is inherited from operation)
su: scalar (data type is always unsigned integer)
c: condition code test result (e.g., "EQ", "GT1.x")
vc: 4-component vector or condition code test
t: texture
Instructions labeled "fp-only" and "gp-only" are supported only for
fragment and geometry programs, respectively.
Modify Section 2.X.4.1, Program Instruction Modifiers
(Update the discussion of instruction precision modifiers. If
GL_NV_gpu_program_fp64 is not found in the extension string, the "F64"
instruction modifier described below is not supported.)
(add to Table X.14 of the NV_gpu_program4 specification.)
Modifier Description
-------- ---------------------------------------------------
F Floating-point operation
U Fixed-point operation, unsigned operands
S Fixed-point operation, signed operands
...
F32 Floating-point operation, 32-bit precision or
access one 32-bit floating-point value
F64 Floating-point operation, 64-bit precision or
access one 64-bit floating-point value
S32 Fixed-point operation, signed 32-bit operands or
access one 32-bit signed integer value
S64 Fixed-point operation, signed 64-bit operands or
access one 64-bit signed integer value
U32 Fixed-point operation, unsigned 32-bit operands or
access one 32-bit unsigned integer value
U64 Fixed-point operation, unsigned 64-bit operands or
access one 64-bit unsigned integer value
...
F32X2 Access two 32-bit floating-point values
F32X4 Access four 32-bit floating-point values
F64X2 Access two 64-bit floating-point values
F64X4 Access four 64-bit floating-point values
S8 Access one 8-bit signed integer value
S16 Access one 16-bit signed integer value
S32X2 Access two 32-bit signed integer values
S32X4 Access four 32-bit signed integer values
S64 Access one 64-bit signed integer value
S64X2 Access two 64-bit signed integer values
S64X4 Access four 64-bit signed integer values
U8 Access one 8-bit unsigned integer value
U16 Access one 16-bit unsigned integer value
U32 Access one 32-bit unsigned integer value
U32X2 Access two 32-bit unsigned integer values
U32X4 Access four 32-bit unsigned integer values
U64 Access one 64-bit unsigned integer value
U64X2 Access two 64-bit unsigned integer values
U64X4 Access four 64-bit unsigned integer values
ADD Perform add operation for ATOM
MIN Perform minimum operation for ATOM
MAX Perform maximum operation for ATOM
IWRAP Perform wrapping increment for ATOM
DWRAP Perform wrapping decrment for ATOM
AND Perform logical AND operation for ATOM
OR Perform logical OR operation for ATOM
XOR Perform logical XOR operation for ATOM
EXCH Perform exchange operation for ATOM
CSWAP Perform compare-and-swap operation for ATOM
COH Make LOAD and STORE operations use coherent caching
VOL Make LOAD and STORE operations treat memory as volatile
PREC Instruction results should be precise
ROUND Inexact conversion results round to nearest value (even)
CEIL Inexact conversion results round to larger value
FLR Inexact conversion results round to smaller value
TRUNC Inexact conversion results round to value closest to zero
"F", "U", and "S" modifiers are base data type modifiers and specify that
the instruction should operate on floating-point, unsigned integer, or
signed integer values, respectively. For example, "ADD.F", "ADD.U", and
"ADD.S" specify component-wise addition of floating-point, unsigned
integer, or signed integer vectors, respectively. While these modifiers
specify a data type, they do not specify an exact precision at which the
operation is performed. Floating-point and fixed-point operations will
typically be carried out at 32-bit precision, unless otherwise described
in the instruction documentation or overridden by the precision modifiers.
If all operands are represented with less than 32-bit precision (e.g.,
variables with the "SHORT" component size modifier), operations may be
carried out at a precision no less than the precision of the largest
operand used by the instruction. For some instructions, the data type of
some operands or the result are fixed; in these cases, the data type
modifier specifies the data type of the remaining values.
Operands represented with fewer bits than used to perform the instruction
will be promoted to a larger data type. Signed integer operands will be
sign-extended, where the most significant bits are filled with ones if the
operand is negative and zero otherwise. Unsigned integer operands will be
zero-extended, where the most significant bits are always filled with
zeroes. Operands represented with more bits than used to perform the
instruction will be converted to lower precision. Floating-point
overflows result in IEEE infinity encodings; integer overflows result in
the truncation of the most significant bits.
For arithmetic operations, the "F32", "F64", "U32", "U64", "S32", and
"S64" modifiers are precision-specific data type modifiers that specify
that floating-point, unsigned integer, or signed integer operations be
carried out with an internal precision of no less than 32 or 64 bits per
component, respectively. The "F64", "U64", and "S64" modifiers are
supported on only a subset of instructions, as documented in the
instruction table. The base data type of the instruction is trivially
derived from a precision-specific data type modifiers, and an instruction
may not specify both base and precision-specific data type modifiers.
...
"SAT" and "SSAT" are clamping modifiers that generally specify that the
floating-point components of the instruction result should be clamped to
[0,1] or [-1,1], respectively, before updating the condition code and the
destination variable. If no clamping suffix is specified, unclamped
results will be used for condition code updates (if any) and destination
variable writes. Clamping modifiers are not supported on instructions
that do not produce floating-point results, with one exception.
...
For load and store operations, the "F32", "F32X2", "F32X4", "F64",
"F64X2", "F64X4", "S8", "S16", "S32", "S32X2", "S32X4", "S64", "S64X2",
"S64X4", "U8", "U16", "U32", "U32X2", "U32X4", "U64", "U64X2", and "U64X4"
storage modifiers control how data are loaded from or stored to memory.
Storage modifiers are supported by the ATOM, LDC, LOAD, and STORE
instructions and are covered in more detail in the descriptions of these
instructions. These instructions must specify exactly one of these
modifiers, and may not specify any of the base data type modifiers (F,U,S)
described above. The base data types of the result vector of a load
instruction or the first operand of a store instruction are trivially
derived from the storage modifier.
For atomic memory operations performed by the ATOM instruction, the "ADD",
"MIN", "MAX", "IWRAP", "DWRAP", "AND", "OR", "XOR", "EXCH", and "CSWAP"
modifiers specify the operation to perform on the memory being accessed,
and are described in more detail in the description of this instruction.
For load and store operations, the "COH" modifier controls whether the
operation uses a coherent level of the cache hierarchy, as described in
Section 2.X.4.5.
For load and store operations, the "VOL" modifier controls whether the
operation treats the memory being read or written as volatile.
Instructions modified with "VOL" will always read or write the underlying
memory, whether or not previous or subsequent loads and stores access the
same memory.
For arithmetic and logical operations, the "PREC" modifier controls
whether the instruction result should be treated as precise. For
instructions not qualified with ".PREC", the implementation may rearrange
the computations specified by the program instructions to execute more
efficiently, even if it may generate slightly different results in some
cases. For example, an implementation may combine a MUL instruction with
a dependent ADD instruction and generate code to execute a MAD
(multiply-add) instruction instead. The difference in rounding may
produce unacceptable artifacts for some algorithms. When ".PREC" is
specified, the instruction will be executed in a manner that always
generates the same result regardless of the program instructions that
precede or follow the instruction. Note that a ".PREC" modifier does not
affect the processing of any other instruction. For example, tagging an
instruction with ".PREC" does not mean that the instructions used to
generate the instruction's operands will be treated as precise unless
those instructions are also qualified with ".PREC".
For the CVT (data type conversion) instruction, the "F16", "F32", "F64",
"S8", "S16", "S32", "S64", "U8", "U16", "U32", and "U64" storage modifiers
specify the data type of the vector operand and the converted result. Two
storage modifiers must be provided, which specify the data type of the
result and the operand, respectively.
For the CVT (data type conversion) instruction, the "ROUND", "CEIL",
"FLR", and "TRUNC" modifiers specify how to round converted results that
are not directly representable using the data type of the result.
Modify Section 2.X.4.4, Program Texture Access
(Extend the language describing the operation of texel offsets to cover
the new capability to load texel offsets from a register. Otherwise,
this functionality is unchanged from previous extensions.)
<offset> is a 3-component signed integer vector, which can be specified
using constants embedded in the texture instruction according to the
<texOffsetImmed> grammar rule, or taken from a vector operand according to
the <texOffsetVar> grammar rule. The three components of the offset
vector are added to the computed <u>, <v>, and <w> texel locations prior
to sampling. When using a constant offset, one, two, or three components
may be specified in the instruction; if fewer than three are specified,
the remaining offset components are zero. If no offsets are specified,
all three components of the offset are treated as zero. A limited range
of offset values are supported; the minimum and maximum <texOffset> values
are implementation-dependent and given by MIN_PROGRAM_TEXEL_OFFSET_EXT and
MAX_PROGRAM_TEXEL_OFFSET_EXT, respectively. A program will fail to load:
* if the texture target specified in the instruction is 1D, ARRAY1D,
SHADOW1D, or SHADOWARRAY1D, and the second or third component of a
constant offset vector is non-zero;
* if the texture target specified in the instruction is 2D, RECT,
ARRAY2D, SHADOW2D, SHADOWRECT, or SHADOWARRAY2D, and the third
component of a constant offset vector is non-zero;
* if the texture target is CUBE, SHADOWCUBE, ARRAYCUBE, or
SHADOWARRAYCUBE, and any component of a constant offset vector is
non-zero -- texel offsets are not supported for cube map or buffer
textures;
* if any component of the constant offset vector of a TXGO instruction
is non-zero -- non-constant offsets are provided in separate operands;
* if any component of a constant offset vector is less than
MIN_PROGRAM_TEXEL_OFFSET_EXT or greater than
MAX_PROGRAM_TEXEL_OFFSET_EXT;
* if a TXD or TXGO instruction specifies a non-constant texel offset
according to the <texOffsetVar> grammar rule; or
* if any instruction specifies a non-constant texel offset according
to the <texOffsetVar> grammar rule and the texture target is CUBE,
SHADOWCUBE, ARRAYCUBE, or SHADOWARRAYCUBE.
The implementation-dependent minimum and maximum texel offset values apply
to texel offsets are taken from a vector operand, but out-of-bounds or
invalid component values will not prevent program loading since the
offsets may not be computed until the program is executed. Components of
the vector operand not needed for the texture target are ignored. The W
component of the offset vector is always ignored; the Z component of the
offset vector is ignored unless the target is 3D; the Y component is
ignored if the target is 1D, ARRAY1D, SHADOW1D, or SHADOWARRAY1D. If the
value of any non-ignored component of the vector operand is outside
implementation-dependent limits, the results of the texture lookup are
undefined. For all instructions except TXGO, the limits are
MIN_PROGRAM_TEXEL_OFFSET_EXT and MAX_PROGRAM_TEXEL_OFFSET_EXT. For the
TXGO instruction, the limits are MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV and
MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV.
(Modify language describing how the check for using multiple targets on a
single texture image unit works, to account for texture array variables
where a single instruction may access one of multiple textures and the
texture used is not known when the program is loaded.)
A program will fail to load if it attempts to sample from multiple texture
targets (including the SHADOW pseudo-targets) on the same texture image
unit. For example, a program containing any two the following
instructions will fail to load:
TEX out, coord, texture[0], 1D;
TEX out, coord, texture[0], 2D;
TEX out, coord, texture[0], ARRAY2D;
TEX out, coord, texture[0], SHADOW2D;
TEX out, coord, texture[0], 3D;
For the purposes of this test, sampling using a texture variable declared
as an array is treated as though all texture image units bound to the
variable were accessed. A program containing the following
instructions would fail to load:
TEXTURE textures[] = { texture[0..3] };
TEX out, coord, textures[2], 2D; # acts as if all textures are used
TEX out, coord, texture[1], 3D;
(Add language describing texture gather component selection)
The TXG and TXGO instructions provide the ability to assemble a
four-component vector by taking the value of a single component of a
multi-component texture from each of four texels. The component selected
is identified by the <texImageUnitComp> grammar rule. Component selection
is not supported for any other instruction, and a program will fail to
load if <texImageUnitComp> is matched for any texture instruction other
than TXG or TXGO.
Add New Section 2.X.4.5, Program Memory Access
Programs may load from or store to buffer object memory via the ATOM
(atomic global memory operation), LDC (load constant), LOAD (global load),
and STORE (global store) instructions.
Load instructions read 8, 16, 32, 64, 128, or 256 bits of data from a
source address to produce a four-component vector, according to the
storage modifier specified with the instruction. The storage modifier has
three parts:
- a base data type, "F", "S", or "U", specifying that the instruction
fetches floating-point, signed integer, or unsigned integer values,
respectively;
- a component size, specifying that the components fetched by the
instruction have 8, 16, 32, or 64 bits; and
- an optional component count, where "X2" and "X4" indicate that two or
four components be fetched, and no count indicates a single component
fetch.
When the storage modifier specifies that fewer than four components should
be fetched, remaining components are filled with zeroes. When performing
an atomic memory operation (ATOM) or a global load (LOAD), the GPU address
is specified as an instruction operand. When performing a constant buffer
load (LDC), the GPU address is derived by adding the base address of the
bound buffer object to an offset specified as an instruction operand.
Given a GPU address <address> and a storage modifier <modifier>, the
memory load can be described by the following code:
result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
{
result_t_vec result = { 0, 0, 0, 0 };
switch (modifier) {
case F32:
result.x = ((float32_t *)address)[0];
break;
case F32X2:
result.x = ((float32_t *)address)[0];
result.y = ((float32_t *)address)[1];
break;
case F32X4:
result.x = ((float32_t *)address)[0];
result.y = ((float32_t *)address)[1];
result.z = ((float32_t *)address)[2];
result.w = ((float32_t *)address)[3];
break;
case F64:
result.x = ((float64_t *)address)[0];
break;
case F64X2:
result.x = ((float64_t *)address)[0];
result.y = ((float64_t *)address)[1];
break;
case F64X4:
result.x = ((float64_t *)address)[0];
result.y = ((float64_t *)address)[1];
result.z = ((float64_t *)address)[2];
result.w = ((float64_t *)address)[3];
break;
case S8:
result.x = ((int8_t *)address)[0];
break;
case S16:
result.x = ((int16_t *)address)[0];
break;
case S32:
result.x = ((int32_t *)address)[0];
break;
case S32X2:
result.x = ((int32_t *)address)[0];
result.y = ((int32_t *)address)[1];
break;
case S32X4:
result.x = ((int32_t *)address)[0];
result.y = ((int32_t *)address)[1];
result.z = ((int32_t *)address)[2];
result.w = ((int32_t *)address)[3];
break;
case S64:
result.x = ((int64_t *)address)[0];
break;
case S64X2:
result.x = ((int64_t *)address)[0];
result.y = ((int64_t *)address)[1];
break;
case S64X4:
result.x = ((int64_t *)address)[0];
result.y = ((int64_t *)address)[1];
result.z = ((int64_t *)address)[2];
result.w = ((int64_t *)address)[3];
break;
case U8:
result.x = ((uint8_t *)address)[0];
break;
case U16:
result.x = ((uint16_t *)address)[0];
break;
case U32:
result.x = ((uint32_t *)address)[0];
break;
case U32X2:
result.x = ((uint32_t *)address)[0];
result.y = ((uint32_t *)address)[1];
break;
case U32X4:
result.x = ((uint32_t *)address)[0];
result.y = ((uint32_t *)address)[1];
result.z = ((uint32_t *)address)[2];
result.w = ((uint32_t *)address)[3];
break;
case U64:
result.x = ((uint64_t *)address)[0];
break;
case U64X2:
result.x = ((uint64_t *)address)[0];
result.y = ((uint64_t *)address)[1];
break;
case U64X4:
result.x = ((uint64_t *)address)[0];
result.y = ((uint64_t *)address)[1];
result.z = ((uint64_t *)address)[2];
result.w = ((uint64_t *)address)[3];
break;
}
return result;
}
Store instructions write the contents of a four-component vector operand
into 8, 16, 32, 64, 128, or 256 bits, according to the storage modifier
specified with the instruction. The storage modifiers supported by stores
are identical to those supported for loads. Given a GPU address
<address>, a vector operand <operand> containing the data to be stored,
and a storage modifier <modifier>, the memory store can be described by
the following code:
void BufferMemoryStore(char *address, operand_t_vec operand,
OpModifier modifier)
{
switch (modifier) {
case F32:
((float32_t *)address)[0] = operand.x;
break;
case F32X2:
((float32_t *)address)[0] = operand.x;
((float32_t *)address)[1] = operand.y;
break;
case F32X4:
((float32_t *)address)[0] = operand.x;
((float32_t *)address)[1] = operand.y;
((float32_t *)address)[2] = operand.z;
((float32_t *)address)[3] = operand.w;
break;
case F64:
((float64_t *)address)[0] = operand.x;
break;
case F64X2:
((float64_t *)address)[0] = operand.x;
((float64_t *)address)[1] = operand.y;
break;
case F64X4:
((float64_t *)address)[0] = operand.x;
((float64_t *)address)[1] = operand.y;
((float64_t *)address)[2] = operand.z;
((float64_t *)address)[3] = operand.w;
break;
case S8:
((int8_t *)address)[0] = operand.x;
break;
case S16:
((int16_t *)address)[0] = operand.x;
break;
case S32:
((int32_t *)address)[0] = operand.x;
break;
case S32X2:
((int32_t *)address)[0] = operand.x;
((int32_t *)address)[1] = operand.y;
break;
case S32X4:
((int32_t *)address)[0] = operand.x;
((int32_t *)address)[1] = operand.y;
((int32_t *)address)[2] = operand.z;
((int32_t *)address)[3] = operand.w;
break;
case S64:
((int64_t *)address)[0] = operand.x;
break;
case S64X2:
((int64_t *)address)[0] = operand.x;
((int64_t *)address)[1] = operand.y;
break;
case S64X4:
((int64_t *)address)[0] = operand.x;
((int64_t *)address)[1] = operand.y;
((int64_t *)address)[2] = operand.z;
((int64_t *)address)[3] = operand.w;
break;
case U8:
((uint8_t *)address)[0] = operand.x;
break;
case U16:
((uint16_t *)address)[0] = operand.x;
break;
case U32:
((uint32_t *)address)[0] = operand.x;
break;
case U32X2:
((uint32_t *)address)[0] = operand.x;
((uint32_t *)address)[1] = operand.y;
break;
case U32X4:
((uint32_t *)address)[0] = operand.x;
((uint32_t *)address)[1] = operand.y;
((uint32_t *)address)[2] = operand.z;
((uint32_t *)address)[3] = operand.w;
break;
case U64:
((uint64_t *)address)[0] = operand.x;
break;
case U64X2:
((uint64_t *)address)[0] = operand.x;
((uint64_t *)address)[1] = operand.y;
break;
case U64X4:
((uint64_t *)address)[0] = operand.x;
((uint64_t *)address)[1] = operand.y;
((uint64_t *)address)[2] = operand.z;
((uint64_t *)address)[3] = operand.w;
break;
}
}
If a global load or store accesses a memory address that does not
correspond to a buffer object made resident by MakeBufferResidentNV, the
results of the operation are undefined and may produce a fault resulting
in application termination. If a load accesses a buffer object made
resident with an <access> parameter of WRITE_ONLY, or if a store accesses
a buffer object made resident with an <access> parameter of READ_ONLY, the
results of the operation are also undefined and may lead to application
termination.
The address used for global memory loads or stores or offset used for
constant buffer loads must be aligned to the fetch size corresponding to
the storage opcode modifier. For S8 and U8, the offset has no alignment
requirements. For S16 and U16, the offset must be a multiple of two basic
machine units. For F32, S32, and U32, the offset must be a multiple of
four. For F32X2, F64, S32X2, S64, U32X2, and U64, the offset must be a
multiple of eight. For F32X4, F64X2, S32X4, S64X2, U32X4, and U64X2, the
offset must be a multiple of sixteen. For F64X4, S64X4, and U64X4, the
offset must be a multiple of thirty-two. If an offset is not correctly
aligned, the values returned by a buffer memory load will be undefined,
and the effects of a buffer memory store will also be undefined.
Global and image memory accesses in assembly programs are weakly ordered
and may require synchronization relative to other operations in the OpenGL
pipeline. The ordering and synchronization mehcanisms described in
Section 2.14.X (of the EXT_shader_image_load_store extension
specification) for shaders using the OpenGL Shading Language apply equally
to loads, stores, and atomics performed in assembly programs.
Modify Section 2.X.6.Y of the NV_fragment_program4 specification
(add new option section)
+ Early Per-Fragment Tests (NV_early_fragment_tests)
If a fragment program specifies the "NV_early_fragment_tests" option, the
depth and stencil tests will be performed prior to fragment program
invocation, as described in Section 3.X.
Modify Section 2.X.7.Y of the NV_geometry_program4 specification
(Simply add the new input primitive type "PATCHES" to the list of tokens
allowed by the "PRIMITIVE_IN" declaration.)
- Input Primitive Type (PRIMITIVE_IN)
The PRIMITIVE_IN statement declares the type of primitives seen by a
geometry program. The single argument must be one of "POINTS", "LINES",
"LINES_ADJACENCY", "TRIANGLES", "TRIANGLES_ADJACENCY", or "PATCHES".
(Add a new optional program declaration to declare a geometry shader that
is run <N> times per primitive.)
Geometry programs support three types of mandatory declaration statements,
as described below. Each of the three must be included exactly once in
the geometry program.
...
Geometry programs also support one optional declaration statement.
- Program Invocation Count (INVOCATIONS)
The INVOCATIONS statement declares the number of times the geometry
program is run on each primitive processed. The single argument must be a
positive integer less than or equal to the value of the
implementation-dependent limit MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV. Each
invocation of the geometry program will have the same inputs and outputs
except for the built-in input variable "primitive.invocation". This
variable will be an integer between 0 and <n>-1, where <n> is the declared
number of invocations. If omitted, the program invocation count is one.
Section 2.X.8.Z, ATOM: Atomic Global Memory Operation
The ATOM instruction performs an atomic global memory operation by reading
from memory at the address specified by the second unsigned integer scalar
operand, computing a new value based on the value read from memory and the
first (vector) operand, and then writing the result back to the same
memory address. The memory transaction is atomic, guaranteeing that no
other write to the memory accessed will occur between the time it is read
and written by the ATOM instruction. The result of the ATOM instruction
is the scalar value read from memory.
The ATOM instruction has two required instruction modifiers. The atomic
modifier specifies the type of operation to be performed. The storage
modifier specifies the size and data type of the operand read from memory
and the base data type of the operation used to compute the value to be
written to memory.
atomic storage
modifier modifiers operation
-------- ------------------ --------------------------------------
ADD U32, S32, U64 compute a sum
MIN U32, S32 compute minimum
MAX U32, S32 compute maximum
IWRAP U32 increment memory, wrapping at operand
DWRAP U32 decrement memory, wrapping at operand
AND U32, S32 compute bit-wise AND
OR U32, S32 compute bit-wise OR
XOR U32, S32 compute bit-wise XOR
EXCH U32, S32, U64 exchange memory with operand
CSWAP U32, S32, U64 compare-and-swap
Table X.Y, Supported atomic and storage modifiers for the ATOM
instruction.
Not all storage modifiers are supported by ATOM, and the set of modifiers
allowed for any given instruction depends on the atomic modifier
specified. Table X.Y enumerates the set of atomic modifiers supported by
the ATOM instruction, and the storage modifiers allowed for each.
tmp0 = VectorLoad(op0);
address = ScalarLoad(op1);
result = BufferMemoryLoad(address, storageModifier);
switch (atomicModifier) {
case ADD:
writeval = tmp0.x + result;
break;
case MIN:
writeval = min(tmp0.x, result);
break;
case MAX:
writeval = max(tmp0.x, result);
break;
case IWRAP:
writeval = (result >= tmp0.x) ? 0 : result+1;
break;
case DWRAP:
writeval = (result == 0 || result > tmp0.x) ? tmp0.x : result-1;
break;
case AND:
writeval = tmp0.x & result;
break;
case OR:
writeval = tmp0.x | result;
break;
case XOR:
writeval = tmp0.x ^ result;
break;
case EXCH:
break;
case CSWAP:
if (result == tmp0.x) {
writeval = tmp0.y;
} else {
return result; // no memory store
}
break;
}
BufferMemoryStore(address, writeval, storageModifier);
ATOM performs a scalar atomic operation. The <y>, <z>, and <w> components
of the result vector are undefined.
ATOM supports no base data type modifiers, but requires exactly one
storage modifier. The base data types of the result vector, and the first
(vector) operand are derived from the storage modifier. The second
operand is always interpreted as a scalar unsigned integer.
Section 2.X.8.Z, BFE: Bitfield Extract
The BFE instruction extracts a selected set of performs a component-wise
bit extraction of the second vector operand to yield a result vector. For
each component, the number of bits extracted is given by the x component
of the first vector operand, and the bit number of the least significant
bit extracted is given by the y component of the first vector operand.
tmp0 = VectorLoad(op0);
tmp1 = VectorLoad(op1);
result.x = BitfieldExtract(tmp0.x, tmp0.y, tmp1.x);
result.y = BitfieldExtract(tmp0.x, tmp0.y, tmp1.y);
result.z = BitfieldExtract(tmp0.x, tmp0.y, tmp1.z);
result.w = BitfieldExtract(tmp0.x, tmp0.y, tmp1.w);
If the number of bits to extract is zero, zero is returned. The results
of bitfield extraction are undefined
* if the number of bits to extract or the starting offset is negative,
* if the sum of the number of bits to extract and the starting offset
is greater than the total number of bits in the operand/result, or
* if the starting offset is greater than or equal to the total number of
bits in the operand/result.
Type BitfieldExtract(Type bits, Type offset, Type value)
{
if (bits < 0 || offset < 0 || offset >= TotalBits(Type) ||
bits + offset > TotalBits(Type)) {
/* result undefined */
} else if (bits == 0) {
return 0;
} else {
return (value << (TotalBits(Type) - (bits+offset))) >>
(TotalBits(type) - bits);
}
}
BFE supports only signed and unsigned integer data type modifiers. For
signed integer data types, the extracted value is sign-extended (i.e.,
filled with ones if the most significant bit extracted is one and filled
with zeroes otherwise). For unsigned integer data types, the extracted
value is zero-extended.
Section 2.X.8.Z, BFI: Bitfield Insert
The BFI instruction performs a component-wise bitfield insertion of the
second vector operand into the third vector operand to yield a result
vector. For each component, the <n> least significant bits are extracted
from the corresponding component of the second vector operand, where <n>
is given by the x component of the first vector operand. Those bits are
merged into the corresponding component of the third vector operand,
replacing bits <b> through <b>+<n>-1, to produce the result. The bit
offset <b> is specified by the y component of the first operand.
tmp0 = VectorLoad(op0);
tmp1 = VectorLoad(op1);
tmp2 = VectorLoad(op2);
result.x = BitfieldInsert(op0.x, op0.y, tmp1.x, tmp2.x);
result.y = BitfieldInsert(op0.x, op0.y, tmp1.y, tmp2.y);
result.z = BitfieldInsert(op0.x, op0.y, tmp1.z, tmp2.z);
result.w = BitfieldInsert(op0.x, op0.y, tmp1.w, tmp2.w);
The results of bitfield insertion are undefined
* if the number of bits to insert or the starting offset is negative,
* if the sum of the number of bits to insert and the starting offset
is greater than the total number of bits in the operand/result, or
* if the starting offset is greater than or equal to the total number of
bits in the operand/result.
Type BitfieldInsert(Type bits, Type offset, Type src, Type dst)
{
if (bits < 0 || offset < 0 || offset >= TotalBits(type) ||
bits + offset > TotalBits(Type)) {
/* result undefined */
} else if (bits == TotalBits(Type)) {
return src;
} else {
Type mask = ((1 << bits) - 1) << offset;
return ((src << offset) & mask) | (dst & (~mask));
}
}
BFI supports only signed and unsigned integer data type modifiers. If no
type modifier is specified, the operand and result vectors are treated as
signed integers.
Section 2.X.8.Z, BFR: Bitfield Reverse
The BFR instruction performs a component-wise bit reversal of the single
vector operand to produce a result vector. Bit reversal is performed by
exchanging the most and least significant bits, the second-most and
second-least significant bits, and so on.
tmp0 = VectorLoad(op0);
result.x = BitReverse(tmp0.x);
result.y = BitReverse(tmp0.y);
result.z = BitReverse(tmp0.z);
result.w = BitReverse(tmp0.w);
BFR supports only signed and unsigned integer data type modifiers. If no
type modifier is specified, the operand and result vectors are treated as
signed integers.
Section 2.X.8.Z, BTC: Bit Count
The BTC instruction performs a component-wise bit count of the single
source vector to yield a result vector. Each component of the result
vector contains the number of one bits in the corresponding component of
the source vector.
tmp0 = VectorLoad(op0);
result.x = BitCount(tmp0.x);
result.y = BitCount(tmp0.y);
result.z = BitCount(tmp0.z);
result.w = BitCount(tmp0.w);
BTC supports only signed and unsigned integer data type modifiers. If no
type modifier is specified, both operands and the result are treated as
signed integers.
Section 2.X.8.Z, BTFL: Find Least Significant Bit
The BTFL instruction searches for the least significant bit of each
component of the single source vector, yielding a result vector comprising
the bit number of the located bit for each component.
tmp0 = VectorLoad(op0);
result.x = FindLSB(tmp0.x);
result.y = FindLSB(tmp0.y);
result.z = FindLSB(tmp0.z);
result.w = FindLSB(tmp0.w);
BTFL supports only signed and unsigned integer data type modifiers. For
unsigned integer data types, the search will yield the bit number of the
least significant one bit in each component, or the maximum integer (all
bits are ones) if the source vector component is zero. For signed data
types, the search will yield the bit number of the least significant one
bit in each component, or -1 if the source vector component is zero. If
no type modifier is specified, both operands and the result are treated as
signed integers.
Section 2.X.8.Z, BTFM: Find Most Significant Bit
The BTFM instruction searches for the most significant bit of each
component of the single source vector, yielding a result vector comprising
the bit number of the located bit for each component.
tmp0 = VectorLoad(op0);
result.x = FindMSB(tmp0.x);
result.y = FindMSB(tmp0.y);
result.z = FindMSB(tmp0.z);
result.w = FindMSB(tmp0.w);
BTFM supports only signed and unsigned integer data type modifiers. For
unsigned integer data types, the search will yield the bit number of the
most significant one bit in each component , or the maximum integer (all
bits are ones) if the source vector component is zero. For signed data
types, the search will yield the bit number of the most significant one
bit if the source value is positive, the bit number of the most
significant zero bit if the source value is negative, or -1 if the source
value is zero. If no type modifier is specified, both operands and the
result are treated as signed integers.
Section 2.X.8.Z, CVT: Data Type Conversion
The CVT instruction converts each component of the single source vector
from one specified data type to another to yield a result vector.
tmp0 = VectorLoad(op0);
result = DataTypeConvert(tmp0);
The CVT instruction requires two storage modifiers. The first specifies
the data type of the result components; the second specifies the data type
of the operand components. The supported storage modifiers are F16, F32,
F64, S8, S16, S32, S64, U8, U16, U32, and U64. A storage modifier of
"F16" indicates a source or destination that is treated as having a
floating-point type, but whose sixteen least significant bits describe a
16-bit floating-point value using the encoding provided in Section 2.1.2.
If the component size of the source register doesn't match the size of the
specified operand data type, the source register components are first
interpreted as a value with the same base data type as the operand and
converted to the operand data type. The operand components are then
converted to the result data type. Finally, if the component size of the
destination register doesn't match the specified result data type, the
result components are converted to values of the same base data type with
a size matching the result register's component size.
Data type conversion is performed by first converting the source
components to an infinite-precision value of the destination data type,
and then converting to the result data type. When converting between
floating-point and integer values, integer values are never interpreted as
being normalized to [0,1] or [-1,+1]. Converting the floating-point
special values -INF, +INF, and NaN to integers will yield undefined
results.
When converting from a non-integral floating-point value to an integer,
one of the two integers closest in value to the floating-point value are
chosen according to the rounding instruction modifier. If "CEIL" or "FLR"
is specified, the larger or smaller value, respectively is chosen. If
"TRUNC" is specified, the value nearest to zero is chosen. If "ROUND" is
specified, if one integer is nearer in value to the original
floating-point value, it is chosen; otherwise, the even integer is chosen.
"ROUND" is used if no rounding modifier is specified.
When converting from the infinite-precision intermediate value to the
destination data type:
* Floating-point values not exactly representable in the destination
data are rounded to one of the two nearest values in the destination
type according to the rounding modifier. Note that the results of
float-to-float conversion are not automatically rounded to integer
values, even if a rounding modifier such as CEIL or FLR is specified.
* Integer values are clamped to the closest value representable in the
result data type if the "SAT" (saturation) modifier is specified.
* Integer values drop the most significant bits if the "SAT" modifier is
not specified.
Negation and absolute value operators are not supported on the source
operand; a program using such operators will fail to compile.
CVT supports no data type modifiers; the type of the operand and result
vectors is fully specified by the required storage modifiers.
Section 2.X.8.Z, EMIT: Emit Vertex
(Modify the description of the EMIT opcode to deal with the interaction
with multiple vertex streams added by ARB_transform_feedback3. For more
information on vertex streams, see ARB_transform_feedback3.)
The EMIT instruction emits a new vertex to be added to the current output
primitive for vertex stream zero. The attributes of the emitted vertex
are given by the current values of the vertex result variables. After the
EMIT instruction completes, a new vertex is started and all result
variables become undefined.
Section 2.X.8.Z, EMITS: Emit Vertex to Stream
(Add new geometry program opcode; the EMITS instruction is not supported
for any other program types. For more information on vertex streams, see
ARB_transform_feedback3.)
The EMITS instruction emits a new vertex to be added to the current output
primitive for the vertex stream specified by the single signed integer
scalar operand. The attributes of the emitted vertex are given by the
current values of the vertex result variables. After the EMITS
instruction completes, a new vertex is started and all result variables
become undefined.
If the specified stream is negative or greater than or equal to the
implementation-dependent number of vertex streams
(MAX_VERTEX_STREAMS_NV), the results of the instruction are undefined.
Section 2.X.8.Z, IPAC: Interpolate at Centroid
The IPAC instruction generates a result vector by evaluating the fragment
attribute named by the single vector operand at the centroid location.
The result vector would be identical to the value obtained by a MOV
instruction if the attribute variable were declared using the CENTROID
modifier.
When interpolating an attribute variable with this instruction, the
CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT
and NOPERSPECTIVE variable modifiers operate normally.
tmp0 = Interpolate(op0, x_pixel + x_centroid, y_pixel + x_centroid);
result = tmp0;
IPAC supports only floating-point data type modifiers. A program will
fail to load if it contains an IPAC instruction whose single operand is
not a fragment program attribute variable or matches the "fragment.facing"
or "primitive.id" binding.
Section 2.X.8.Z, IPAO: Interpolate with Offset
The IPAO instruction generates a result vector by evaluating the fragment
attribute named by the single vector operand at an offset from the pixel
center given by the x and y components of the second vector operand. The
z and w components of the second vector operand are ignored. The (x,y)
position used for interpolating the attribute variable is obtained by
adding the (x,y) offsets in the second vector operand to the (x,y)
position of the pixel center.
The range of offsets supported by the IPAO instruction is
implementation-dependent. The position used to interpolate the attribute
variable is undefined if the x or y component of the second operand is
less than MIN_FRAGMENT_INTERPOLATION_OFFSET_NV or greater than
MAX_FRAGMENT_INTERPOLATION_OFFSET_NV. Additionally, the granularity of
offsets may be limited. The (x,y) value may be snapped to a fixed
sub-pixel grid with the number of subpixel bits given by
FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV.
When interpolating an attribute variable with this instruction, the
CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT
and NOPERSPECTIVE variable modifiers operate normally.
tmp1 = VectorLoad(op1);
tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x);
result = tmp0;
IPAO supports only floating-point data type modifiers. A program will
fail to load if it contains an IPAO instruction whose first operand is not
a fragment program attribute variable or matches the "fragment.facing" or
"primitive.id" binding.
Section 2.X.8.Z, IPAS: Interpolate at Sample Location
The IPAS instruction generates a result vector by evaluating the fragment
attribute named by the single vector operand at the location of the
pixel's sample whose sample number is given by the second integer scalar
operand. If multisample buffers are not available (SAMPLE_BUFFERS is
zero), the attribute will be evaluated at the pixel center. If the sample
number given by the second operand does not exist, the position used to
interpolate the attribute is undefined.
When interpolating an attribute variable with this instruction, the
CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT
and NOPERSPECTIVE variable modifiers operate normally.
sample = ScalarLoad(op1);
tmp1 = SampleOffset(sample);
tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x);
result = tmp0;
IPAS supports only floating-point data type modifiers. A program will
fail to load if it contains an IPAO instruction whose first operand is not
a fragment program attribute variable or matches the "fragment.facing" or
"primitive.id" binding.
Section 2.X.8.Z, LDC: Load from Constant Buffer
The LDC instruction loads a vector operand from a buffer object to yield a
result vector. The operand used for the LDC instruction must correspond
to a parameter buffer variable declared using the "CBUFFER" statement; a
program will fail to load if any other type of operand is used in an LDC
instruction.
result = BufferMemoryLoad(&op0, storageModifier);
A base operand vector is fetched from memory as described in Section
2.X.4.5, with the GPU address derived from the binding corresponding to
the operand. A final operand vector is derived from the base operand
vector by applying swizzle, negation, and absolute value operand modifiers
as described in Section 2.X.4.2.
The amount of memory in any given buffer object binding accessible by the
LDC instruction may be limited. If any component fetched by the LDC
instruction extends 4*<n> or more basic machine units from the beginning
of the buffer object binding, where <n> is the implementation-dependent
constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that
component will be undefined.
LDC supports no base data type modifiers, but requires exactly one storage
modifier. The base data types of the operand and result vectors are
derived from the storage modifier.
Section 2.X.8.Z, LOAD: Global Load
The LOAD instruction generates a result vector by reading an address from
the single unsigned integer scalar operand and fetching data from buffer
object memory, as described in Section 2.X.4.5.
address = ScalarLoad(op0);
result = BufferMemoryLoad(address, storageModifier);
LOAD supports no base data type modifiers, but requires exactly one
storage modifier. The base data type of the result vector is derived from
the storage modifier. The single scalar operand is always interpreted as
an unsigned integer.
Section 2.X.8.Z, MEMBAR: Memory Barrier
The MEMBAR instruction synchronizes memory transactions to ensure that
memory transactions resulting from any instruction executed by the thread
prior to the MEMBAR instruction complete prior to any memory transactions
issued after the instruction.
MEMBAR has no operands and generates no result.
Section 2.X.8.Z, PK64: Pack 64-Bit Component
The PK64 instruction reads the four components of the single vector
operand as 32-bit values, packs the bit representations of these into a
pair of 64-bit values, and replicates those to produce a four-component
result vector. The "x" and "y" components of the operand are packed to
produce the "x" and "z" components of the result vector; the "z" and "w"
components of the operand are packed to produce the "y" and "w" components
of the result vector. The PK64 instruction can be reversed by the UP64
instruction below.
This instruction is intended to allow a program to reconstruct 64-bit
integer or floating-point values generated by the application but passed
to the GL as two 32-bit values taken from adjacent words in memory. The
ability to use this technique depends on how the 64-bit value is stored in
memory. For "little-endian" processors, first 32-bit value would hold the
with the least significant 32 bits of the 64-bit value. For "big-endian"
processors, the first 32-bit value holds the most significant 32 bits of
the 64-bit value. This reconstruction assumes that the first 32-bit word
comes from the x component of the operand and the second 32-bit word comes
from the y component. The method used to construct a 64-bit value from a
pair of 32-bit values depends on the processor type.
tmp = VectorLoad(op0);
if (underlying system is little-endian) {
result.x = RawBits(tmp.x) | (RawBits(tmp.y) << 32);
result.y = RawBits(tmp.z) | (RawBits(tmp.w) << 32);
result.z = RawBits(tmp.x) | (RawBits(tmp.y) << 32);
result.w = RawBits(tmp.z) | (RawBits(tmp.w) << 32);
} else {
result.x = RawBits(tmp.y) | (RawBits(tmp.x) << 32);
result.y = RawBits(tmp.w) | (RawBits(tmp.z) << 32);
result.z = RawBits(tmp.y) | (RawBits(tmp.x) << 32);
result.w = RawBits(tmp.w) | (RawBits(tmp.z) << 32);
}
PK64 supports integer and floating-point data type modifiers, which
specify the base data type of the operand and result. The single vector
operand is always treated as having 32-bit components, and the result is
treated as a vector with 64-bit components. The encoding performed by
PK64 can be reversed using the UP64 instruction.
A program will fail to load if it contains a PK64 instruction that writes
its results to a variable not declared as "LONG".
Section 2.X.8.Z, STORE: Global Store
The STORE instruction reads an address from the second unsigned integer
scalar operand and writes the contents of the first vector operand to
buffer object memory at that address, as described in Section 2.X.4.5.
This instruction generates no result.
tmp0 = VectorLoad(op0);
address = ScalarLoad(op1);
BufferMemoryStore(address, tmp0, storageModifier);
STORE supports no base data type modifiers, but requires exactly one
storage modifier. The base data type of the vector components of the
first operand is derived from the storage modifier. The second operand is
always interpreted as an unsigned integer scalar.
Section 2.X.8.Z, TEX: Texture Sample
(Modify the instruction pseudo-code to account for texel offsets no
longer need to be immediate arguments.)
tmp = VectorLoad(op0);
if (instruction has variable texel offset) {
itmp = VectorLoad(op1);
} else {
itmp = instruction.texelOffset;
}
ddx = ComputePartialsX(tmp);
ddy = ComputePartialsY(tmp);
lambda = ComputeLOD(ddx, ddy);
result = TextureSample(tmp, lambda, ddx, ddy, itmp);
Section 2.X.8.Z, TGALL: Test for All Non-Zero in a Thread Group
The TGALL instruction produces a result vector by reading a vector operand
for each active thread in the current thread group and comparing each
component to zero. A result vector component contains a TRUE value
(described below) if the value of the corresponding component in the
operand vector is non-zero for all active threads, and a FALSE value
otherwise.
An implementation may choose to arrange programs threads into thread
groups, and execute an instruction simultaneously for each thread in the
group. If the TGALL instruction is contained inside conditional flow
control blocks and not all threads in the group execute the instruction,
the operand values for threads not executing the instruction have no
bearing on the value returned. The method used to arrange threads into
groups is undefined.
tmp = VectorLoad(op0);
result = { TRUE, TRUE, TRUE, TRUE };
for (all active threads) {
if ([thread]tmp.x == 0) result.x = FALSE;
if ([thread]tmp.y == 0) result.y = FALSE;
if ([thread]tmp.z == 0) result.z = FALSE;
if ([thread]tmp.w == 0) result.w = FALSE;
}
TGALL supports all data type modifiers. For floating-point data types,
the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data
types, the TRUE value is -1 and the FALSE value is 0. For unsigned
integer data types, the TRUE value is the maximum integer value (all bits
are ones) and the FALSE value is zero.
Section 2.X.8.Z, TGANY: Test for Any Non-Zero in a Thread Group
The TGANY instruction produces a result vector by reading a vector operand
for each active thread in the current thread group and comparing each
component to zero. A result vector component contains a TRUE value
(described below) if the value of the corresponding component in the
operand vector is non-zero for any active thread, and a FALSE value
otherwise.
An implementation may choose to arrange programs threads into thread
groups, and execute an instruction simultaneously for each thread in the
group. If the TGANY instruction is contained inside conditional flow
control blocks and not all threads in the group execute the instruction,
the operand values for threads not executing the instruction have no
bearing on the value returned. The method used to arrange threads into
groups is undefined.
tmp = VectorLoad(op0);
result = { FALSE, FALSE, FALSE, FALSE };
for (all active threads) {
if ([thread]tmp.x != 0) result.x = TRUE;
if ([thread]tmp.y != 0) result.y = TRUE;
if ([thread]tmp.z != 0) result.z = TRUE;
if ([thread]tmp.w != 0) result.w = TRUE;
}
TGANY supports all data type modifiers. For floating-point data types,
the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data
types, the TRUE value is -1 and the FALSE value is 0. For unsigned
integer data types, the TRUE value is the maximum integer value (all bits
are ones) and the FALSE value is zero.
Section 2.X.8.Z, TGEQ: Test for All Equal Values in a Thread Group
The TGEQ instruction produces a result vector by reading a vector operand
for each active thread in the current thread group and comparing each
component to zero. A result vector component contains a TRUE value
(described below) if the value of the corresponding component in the
operand vector is the same for all active threads, and a FALSE value
otherwise.
An implementation may choose to arrange programs threads into thread
groups, and execute an instruction simultaneously for each thread in the
group. If the TGEQ instruction is contained inside conditional flow
control blocks and not all threads in the group execute the instruction,
the operand values for threads not executing the instruction have no
bearing on the value returned. The method used to arrange threads into
groups is undefined.
tmp = VectorLoad(op0);
tgall = { TRUE, TRUE, TRUE, TRUE };
tgany = { FALSE, FALSE, FALSE, FALSE };
for (all active threads) {
if ([thread]tmp.x == 0) tgall.x = FALSE; else tgany.x = TRUE;
if ([thread]tmp.y == 0) tgall.y = FALSE; else tgany.y = TRUE;
if ([thread]tmp.z == 0) tgall.z = FALSE; else tgany.z = TRUE;
if ([thread]tmp.w == 0) tgall.w = FALSE; else tgany.w = TRUE;
}
result.x = (tgall.x == tgany.x) ? TRUE : FALSE;
result.y = (tgall.y == tgany.y) ? TRUE : FALSE;
result.z = (tgall.z == tgany.z) ? TRUE : FALSE;
result.w = (tgall.w == tgany.w) ? TRUE : FALSE;
TGEQ supports all data type modifiers. For floating-point data types, the
TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data
types, the TRUE value is -1 and the FALSE value is 0. For unsigned
integer data types, the TRUE value is the maximum integer value (all bits
are ones) and the FALSE value is zero.
Section 2.X.8.Z, TXB: Texture Sample with Bias
(Modify the instruction pseudo-code to account for texel offsets no
longer need to be immediate arguments.)
tmp = VectorLoad(op0);
if (instruction has variable texel offset) {
itmp = VectorLoad(op1);
} else {
itmp = instruction.texelOffset;
}
ddx = ComputePartialsX(tmp);
ddy = ComputePartialsY(tmp);
lambda = ComputeLOD(ddx, ddy);
result = TextureSample(tmp, lambda + tmp.w, ddx, ddy, itmp);
Section 2.X.8.Z, TXG: Texture Gather
(Update the TXG opcode description from NV_gpu_program4_1 specification.
This version adds two capabilities: any component of a multi-component
texture can be selected by tacking on a component name to the texture
variable passed to identify the texture unit, and depth compares are
supported if a SHADOW target is specified.)
The TXG instruction takes the four components of a single floating-point
vector operand as a texture coordinate, determines a set of four texels to
sample from the base level of detail of the specified texture image, and
returns one component from each texel in a four-component result vector.
To determine the four texels to sample, the minification and magnification
filters are ignored and the rules for LINEAR filter are applied to the
base level of the texture image to determine the texels T_i0_j1, T_i1_j1,
T_i1_j0, and T_i0_j0, as defined in equations 3.23 through 3.25. The
texels are then converted to texture source colors (Rs,Gs,Bs,As) according
to table 3.21, followed by application of the texture swizzle as described
in section 3.8.13. A four-component vector is returned by taking one of
the four components of the swizzled texture source colors from each of the
four selected texels. The component is selected using the
<texImageUnitComp> grammar rule, by adding a scalar suffix
(".x", ".y", ".z", ".w") to the identified texture; if no scalar suffix
is provided, the first component is selected.
TXG only operates on 2D, SHADOW2D, CUBE, SHADOWCUBE, ARRAY2D,
SHADOWARRAY2D, ARRAYCUBE, SHADOWARRAYCUBE, RECT, and SHADOWRECT texture
targets; a program will fail to compile if any other texture target is
used.
When using a "SHADOW" texture target, component selection is ignored.
Instead, depth comparisons are performed on the depth values for each of
the four selected texels, and 0/1 values are returned based on the results
of the comparison.
As with other texture accesses, the results of a texture gather operation
are undefined if the texture target in the instruction is incompatible
with the selected texture's base internal format and depth compare mode.
tmp = VectorLoad(op0);
ddx = (0,0,0);
ddy = (0,0,0);
lambda = 0;
if (instruction has variable texel offset) {
itmp = VectorLoad(op1);
} else {
itmp = instruction.texelOffset;
}
result.x = TextureSample_i0j1(tmp, lambda, ddx, ddy, itmp).<comp>;
result.y = TextureSample_i1j1(tmp, lambda, ddx, ddy, itmp).<comp>;
result.z = TextureSample_i1j0(tmp, lambda, ddx, ddy, itmp).<comp>;
result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
In this pseudocode, "<comp>" refers to the texel component selected by the
<texImageUnitComp> grammar rule, as described above.
TXG supports all three data type modifiers. The single operand is always
treated as a floating-point vector; the results are interpreted according
to the data type modifier.
Section 2.X.8.Z, TXGO: Texture Gather with Per-Texel Offsets
Like the TXG instruction, the TXGO instruction takes the four components
of its first floating-point vector operand as a texture coordinate,
determines a set of four texels to sample from the base level of detail of
the specified texture image, and returns one component from each texel in
a four-component result vector. The second and third vector operands are
taken as signed four-component integer vectors providing the x and y
components of the offsets, respectively, used to determine the location of
each of the four texels. To determine the four texels to sample, each of
the four independent offsets is used in conjunction with the specified
texture coordinate to select a texel. The minification and magnification
filters are ignored and the rules for LINEAR filtering are used to select
the texel T_i0_j0, as defined in equations 3.23 through 3.25, from the
base level of the texture image. The texels are then converted to texture
source colors (Rs,Gs,Bs,As) according to table 3.21, followed by
application of the texture swizzle as described in section 3.8.13. A
four-component vector is returned by taking one of the four components
of the swizzled texture source colors from each of the four selected
texels. The component is selected using the <texImageUnitComp> grammar
rule, by adding a scalar suffix (".x", ".y", ".z", ".w") to the identified
texture; if no scalar suffix is provided, the first component is selected.
TXGO only operates on 2D, SHADOW2D, ARRAY2D, SHADOWARRAY2D, RECT, and
SHADOWRECT texture targets; a program will fail to compile if any other
texture target is used.
When using a "SHADOW" texture target, component selection is ignored.
Instead, depth comparisons are performed on the depth values for each of
the four selected texels, and 0/1 values are returned based on the results
of the comparison.
As with other texture accesses, the results of a texture gather operation
are undefined if the texture target in the instruction is incompatible
with the selected texture's base internal format and depth compare mode.
tmp = VectorLoad(op0);
itmp1 = VectorLoad(op1);
itmp2 = VectorLoad(op2);
ddx = (0,0,0);
ddy = (0,0,0);
lambda = 0;
itmp = (op1.x, op2.x);
result.x = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
itmp = (op1.y, op2.y);
result.y = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
itmp = (op1.z, op2.z);
result.z = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
itmp = (op1.w, op2.w);
result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
In this pseudocode, "<comp>" refers to the texel component selected by the
<texImageUnitComp> grammar rule, as described above.
If TEXTURE_WRAP_S or TEXTURE_WRAP_T are either CLAMP or MIRROR_CLAMP_EXT,
the results of the TXGO instruction are undefined.
Note: The TXG instruction is equivalent to the TXGO instruction with X
and Y offset vectors of (0,1,1,0) and (0,0,-1,-1), respectively.
TXGO supports all three data type modifiers. The first operand is always
treated as a floating-point vector and the second and third operands are
always treated as a signed integer vector; the results are interpreted
according to the data type modifier.
Section 2.X.8.Z, TXL: Texture Sample with LOD
(Modify the instruction pseudo-code to account for texel offsets no
longer need to be immediate arguments.)
tmp = VectorLoad(op0);
if (instruction has variable texel offset) {
itmp = VectorLoad(op1);
} else {
itmp = instruction.texelOffset;
}
ddx = (0,0,0);
ddy = (0,0,0);
result = TextureSample(tmp, tmp.w, ddx, ddy, itmp);
Section 2.X.8.Z, TXP: Texture Sample with Projection
(Modify the instruction pseudo-code to account for texel offsets no
longer need to be immediate arguments.)
tmp0 = VectorLoad(op0);
tmp0.x = tmp0.x / tmp0.w;
tmp0.y = tmp0.y / tmp0.w;
tmp0.z = tmp0.z / tmp0.w;
if (instruction has variable texel offset) {
itmp = VectorLoad(op1);
} else {
itmp = instruction.texelOffset;
}
ddx = ComputePartialsX(tmp);
ddy = ComputePartialsY(tmp);
lambda = ComputeLOD(ddx, ddy);
result = TextureSample(tmp, lambda, ddx, ddy, itmp);
Section 2.X.8.Z, UP64: Unpack 64-bit Component
The UP64 instruction produces a vector result with 32-bit components by
unpacking the bits of the "x" and "y" components of a 64-bit vector
operand. The "x" component of the operand is unpacked to produce the "x"
and "y" components of the result vector; the "y" component is unpacked to
produce the "z" and "w" components of the result vector.
This instruction is intended to allow a program to pass 64-bit integer or
floating-point values to an application using two 32-bit values stored in
adjacent words in memory, which will be read by the application as single
64-bit values. The ability to use this technique depends on how the
64-bit value is stored in memory. For "little-endian" processors, the
first 32-bit value would hold the with the least significant 32 bits of
the 64-bit value. For "big-endian" processors, the first 32-bit value
holds the most significant 32 bits of the 64-bit value. This
reconstruction assumes that the first 32-bit word comes from the "x"
component of the operand and the second 32-bit word comes from the "y"
component. The method used to unpack a 64-bit value into a pair of 32-bit
values depends on the processor type.
tmp = VectorLoad(op0);
if (underlying system is little-endian) {
result.x = (RawBits(tmp.x) >> 0) & 0xFFFFFFFF;
result.y = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF;
result.z = (RawBits(tmp.y) >> 0) & 0xFFFFFFFF;
result.w = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF;
} else {
result.x = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF;
result.y = (RawBits(tmp.x) >> 0) & 0xFFFFFFFF;
result.z = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF;
result.w = (RawBits(tmp.y) >> 0) & 0xFFFFFFFF;
}
UP64 supports integer and floating-point data type modifiers, which
specify the base data type of the operand and result. The single operand
vector always has 64-bit components. The result is treated as a vector
with 32-bit components. The encoding performed by UP64 can be reversed
using the PK64 instruction.
A program will fail to load if it contains a UP64 instruction whose
operand is a variable not declared as "LONG".
Modify Section 2.14.6.1 of the NV_geometry_program4 specification,
Geometry Program Input Primitives
(add patches to the list of supported input primitive types)
The supported input primitive types are: ...
Patches (PATCHES)
Geometry programs that operate on patches are valid only for the
PATCHES_NV primitive type. There are a variable number of vertices
available for each program invocation, depending on the number of input
vertices in the primitive itself. For a patch with <n> vertices,
"vertex[0]" refers to the first vertex of the patch, and "vertex[<n>-1]"
refers to the last vertex.
Modify Section 2.14.6.2 of the NV_geometry_program4 specification,
Geometry Program Output Primitives
(Add a new paragraph limiting the use of the EMITS opcode to geometry
programs with a POINTS output primitive type at the end of the section.
This limitation may be removed in future specifications.)
Geometry programs may write to multiple vertex streams only if the
specified output primitive type is POINTS. A program will fail to load if
it contains and EMITS instruction and the output primitive type specified
by the PRIMITIVE_OUT declaration is not POINTS.
Modify Section 2.14.6.4 of the NV_geometry_program4 specification,
Geometry Program Output Limits
(Modify the limitation on the total number of components emitted by a
geometry program from NV_gpu_program4 to be per-invocation. If a that
limit is 4096 and a program has 16 invocations, each of the 16 program
invocation can emit up to 4096 total components.)
There are two implementation-dependent limits that limit the total number
of vertices that each invocation of a program can emit. First, the vertex
limit may not exceed the value of MAX_PROGRAM_OUTPUT_VERTICES_NV. Second,
product of the vertex limit and the number of result variable components
written by the program (PROGRAM_RESULT_COMPONENTS_NV, as described in
section 2.X.3.5 of NV_gpu_program4) may not exceed the value of
MAX_PROGRAM_TOTAL_OUTPUT_COMPONENTS_NV. A geometry program will fail to
load if its maximum vertex count or maximum total component count exceeds
the implementation-dependent limit. The limits may be queried by calling
GetProgramiv with a <target> of GEOMETRY_PROGRAM_NV. Note that the
maximum number of vertices that a geometry program can emit may be much
lower than MAX_PROGRAM_OUTPUT_VERTICES_NV if the program writes a large
number of result variable components. If a geometry program has multiple
invocations (via the "INVOCATIONS" declaration), the program will load
successfully as long as no single invocation exceeds the total component
count limit, even if the total output of all invocations combined exceeds
the limit.
Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization)
Modify Section 3.X, Early Per-Fragment Tests, as documented in the
EXT_shader_image_load_store specification
(add new paragraph at the end of a section, describing how early fragment
tests work when assembly fragment programs are active)
If an assembly fragment program is active, early depth tests are
considered enabled if and only if the fragment program source included the
NV_early_fragment_tests option.
Add to Section 3.11.4.5 of ARB_fragment_program (Fragment Program):
Section 3.11.4.5.3, ARB_blend_func_extended Option
If a fragment program specifies the "ARB_blend_func_extended" option, dual
source color outputs as described in ARB_blend_func_extended are made
available through the use of the "result.color[n].primary" and
"result.color[n].secondary" result bindings, corresponding to SRC_COLOR
and SRC1_COLOR, respectively, for the fragment color output numbered <n>.
Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment
Operations and the Frame Buffer)
Modify Section 4.4.3, Rendering When an Image of a Bound Texture Object
is Also Attached to the Framebuffer, p. 288
(Replace the complicated set of conditions with the following)
Specifically, the values of rendered fragments are undefined if any
shader stage fetches texels from a given mipmap level, cubemap face, and
array layer of a texture if that same mipmap level, cubemap face, and
array layer of the texture can be written to via fragment shader outputs,
even if the reads and writes are not in the same Draw call. However, an
application can insert MemoryBarrier(TEXTURE_FETCH_BARRIER_BIT_NV) between
Draw calls that have such read/write hazards in order to guarantee that
writes have completed and caches have been invalidated, as described in
section 2.20.X.
Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)
None.
Additions to Chapter 6 of the OpenGL 3.0 Specification (State and
State Requests)
None.
Additions to Appendix A of the OpenGL 3.0 Specification (Invariance)
None.
Additions to the AGL/GLX/WGL Specifications
None.
GLX Protocol
None.
Errors
None, other than new conditions by which a program string would fail to
load.
New State
None.
New Implementation Dependent State
Minimum
Get Value Type Get Command Value Description Sec. Attrib
-------------------------------- ---- --------------- ------- --------------------- ------ ------
MAX_GEOMETRY_PROGRAM_ Z+ GetIntegerv 32 Maximum number of GP 2.X.6.Y -
INVOCATIONS_NV invocations per prim.
MIN_FRAGMENT_INTERPOLATION_ R GetFloatv -0.5 Max. negative offset 2.X.8.Z -
OFFSET_NV for IPAO instruction.
MAX_FRAGMENT_INTERPOLATION_ R GetFloatv +0.5 Max. positive offset 2.X.8.Z -
OFFSET_NV for IPAO instruction.
FRAGMENT_PROGRAM_INTERPOLATION_ Z+ GetIntegerv 4 Subpixel bit count 2.X.8.Z -
OFFSET_BITS_NV for IPAO instruction
Dependencies on NV_gpu_program4, NV_vertex_program4, NV_geometry_program4, and
NV_fragment_program4
This extension is written against the NV_gpu_program4 family of
extensions, and introduces new instruction set features and inputs/outputs
described here. These features are available only if the extension is
supported and the appropriate program header string is used ("!!NVvp5.0"
for vertex programs, "!!NVgp5.0" for geometry programs, and "!!NVfp5.0"
for fragment programs.) When loading a program with an older header (e.g.,
"!!NVvp4.0"), the instruction set features described in this extension are
not available. The features in this extension build upon those documented
in full in NV_gpu_program4.
Dependencies on NV_tessellation_program5
This extension provides the basic assembly instruction set constructs for
tessellation programs. If this extension is supported, tessellation
control and evaluation programs are supported, as described in the
NV_tessellation_program5 specification. There is no separate extension
string for tessellation programs; such support is implied by this
extension.
Dependencies on ARB_transform_feedback3
The concept of multiple vertex streams emitted by a geometry shader is
introduced by ARB_transform_feedback3, as is the description of how they
operate and implementation-dependent limits on the number of streams.
This extension simply provides a mechanism to emit a vertex to more than
one stream. If ARB_transform_feedback3 is not supported, language
describing the EMITS opcode and the restriction on PRIMITIVE_OUT when
EMITS is used should be removed.
Dependencies on NV_shader_buffer_load
The programmability functionality provided by NV_shader_buffer_load is
also incorporated by this extension. Any assembly program using a program
header corresponding to this or any subsequent extension (e.g.,
"!!NVfp5.0") may use the LOAD opcode without needing to declare "OPTION
NV_shader_buffer_load".
NV_shader_buffer_load is required by this extension, which means that the
API mechanisms documented there allowing applications to make a buffer
resident and query its GPU address are available to any applications using
this extension.
In addition to the basic functionality in NV_shader_buffer_load, this
extension provides the ability to load 64-bit integers and floating-point
values using the "S64", "S64X2", "S64X4", "U64", "U64X2", "U64X4", "F64",
"F64X2", and "F64X4" opcode modifiers.
Dependencies on NV_shader_buffer_store
This extension provides assembly programmability support for the
NV_shader_buffer_store, which provides the API mechanisms allowing buffer
object to be stored to. NV_shader_buffer_store does not have a separate
extension string entry, and will always be supported if this extension is
present.
Dependencies on NV_parameter_buffer_object2
The programmability functionality provided by NV_parameter_buffer_object2
is also incorporated by this extension. Any assembly program using a
program header corresponding to this or any subsequent extension (e.g.,
"!!NVfp5.0") may use the LDC opcode without needing to declare "OPTION
NV_parameter_buffer_object2".
In addition to the basic functionality in NV_parameter_buffer_object2,
this extension provides the ability to load 64-bit integers and
floating-point values using the "S64", "S64X2", "S64X4", "U64", "U64X2",
"U64X4", "F64", "F64X2", and "F64X4" opcode modifiers.
Dependencies on OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle
If OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle are not
supported, remove the swizzling step from the definition of TXG and TXGO.
Dependencies on ARB_blend_func_extended
If ARB_blend_func_extended is not supported, references to the dual source
color output bindings (result.color.primary and result.color.secondary)
should be removed.
Dependencies on EXT_shader_image_load_store
EXT_shader_image_load_store provides OpenGL Shading Language mechanisms to
load/store to buffer and texture image memory, including spec language
describing memory access ordering and synchronization, a built-in function
(MemoryBarrierEXT) controlling synchronization of memory operations, and
spec language describing early fragment tests that can be enabled via GLSL
fragment shader source. These sections of the EXT_shader_image_load_store
specification apply equally to the assembly program memory accesses
provided by this extension. If EXT_shader_image_load_store is not
supported, the sections of that specification describing these features
should be considered to be added to this extension.
EXT_shader_image_load_store additionally provides and documents assembly
language support for image loads, stores, and atomics as described in the
"Dependencies on NV_gpu_program5" section of EXT_shader_image_load_store.
The features described there are automatically supported for all
NV_gpu_program5 assembly programs without requiring any additional
"OPTION" line.
Dependencies on ARB_shader_subroutine
ARB_shader_subroutine provides and documents assembly language support for
subroutines as described in the "Dependencies on NV_gpu_program5" section
of ARB_shader_subroutine. The features described there are automatically
supported for all NV_gpu_program5 assembly programs without requiring any
additional "OPTION" line.
Issues
(1) Are there any restrictions or performance concerns involving the
support for indexing textures or parameter buffers?
RESOLVED: There are no significant functional limitations. Textures
and parameter buffers accessed with an index must be declared as arrays,
so the assembler knows which textures might be accessed this way.
Additionally, accessing an array of textures or parameter buffers with
an out-of-bounds index will yield undefined results.
In particular, there is no limitation on the values used for indexing --
they are not required to be true constants and are not required to have
the same value for all vertices/fragments in a primitive. However,
using divergent texture or parameter buffer indices may have performance
concerns. We expect that GPU implementations of this extension will run
multiple program threads in parallel (SIMD). If different threads in a
thread group have different indices, it will be necessary to do lookups
in more than one texture at once. This is likely to result in some
thread serialization. We expect that indexed texture or parameter
buffer access where all indices in a thread group match will perform
identically to non-indexed accesses.
(2) Which texture instructions support programmable texel offsets, and
what offset limits apply?
RESOLVED: Most texture instructions (TEX, TXB, TXF, TXG, TXL, TXP)
support both constant texel offsets as provided by NV_gpu_program4 and
programmable texel offsets. TXD supports only constant offsets. TXGO
does not support non-zero or programmable offsets in the texture portion
of the instruction, but provides full support for programmable offsets
via two of the three vector arguments in the regular instruction.
For example,
TEX result, coord, texture[0], 2D, (-1,-1);
uses the NV_gpu_program4 mechanism applies a constant texel offset of
(-1,-1) to the texture coordinates. With programmable offsets, the
following code applies the same offset.
TEMP offxy;
MOV offxy, {-1, -1};
TEX result, coord, texture[0], offset(offxy);
Of course, the programmable form allows the offsets to be computed in
the program and does not require constant values.
For most texture instructions, the range of allowable offsets is
[MIN_PROGRAM_TEXEL_OFFSET_EXT, MAX_PROGRAM_TEXEL_OFFSET_EXT] for both
constant and programmable texel offsets. Constant offsets can be
checked when the program is loaded, and out-of-bounds offsets cause the
program to fail to load. Programmable offsets can not have a
load-time range check; out-of-bounds offsets produce undefined results.
Additionally, the new TXGO instruction has a separate (likely larger)
allowable offset range, [MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV,
MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV], that applies to the offset
vectors passed in its second and third operand.
In the initial implementation of this extension, the range limits are
[-8,+7] for most instructions and [-32,+31] for TXGO.
(3) What is TXGO (texture gather with separate offsets) good for?
RESOLVED: TXGO allows for efficiently sampling a single-component
texture with a variety of offsets that need not be contiguous.
For example, a shadow mapping algorithm using a high-resolution shadow
map may have pixels whose footpoint covers a large number of texels in
the shadow map. Such pixels could do a single lookup into a
lower-resolution texture (using mipmapping), but quality problems will
arise. Alternately, a shader could perform a large number of texture
lookups using either NEAREST or LINEAR filtering from the
high-resolution texture. NEAREST filtering will require a separate
lookup for each texel accessed; LINEAR filtering may require somewhat
fewer lookups, but all accesses cover a 2x2 portion of the texture. The
TXG instruction added to NV_gpu_program4_1 allows a 2x2 block of texels
to be returned in a single instruction in case the program wants to do
something other than linear filtering with the samples. The TXGO allows
a program to do semi-random sampling of the texture without requiring
that each sample cover a 2x2 block of texels. For example, the TXGO
instruction would allow a program to the four texels A, H, J, O from the
4x4 block depicted below:
TXGO result, coord, {-1,+2,0,+1}, {-1,0,+1,+2}, texture[0], 2D;
The "equivalent" TXG instruction would only sample the four center
texels F, G, J, and K
TXG result, coord, texture[0], 2D;
All sixteen texels of the footprint could be sampled with four TXG
instructions,
TXG result0, coord, texture[0], 2D, (-1,-1);
TXG result1, coord, texture[0], 2D, (-1,+1);
TXG result2, coord, texture[0], 2D, (+1,-1);
TXG result3, coord, texture[0], 2D, (+1,+1);
but accessing a smaller number of samples spread across the footprint
with fewer instructions may produce results that are good enough.
The figure here depicts a texture with texel (0,0) shown in the
upper-left corner. If you insist on a lower-left origin, please look at
this figure while standing on your head.
(0,0) +-+-+-+-+
|A|B|C|D|
+-+-+-+-+
|E|F|G|H|
+-+-+-+-+
|I|J|K|L|
+-+-+-+-+
|M|N|O|P|
+-+-+-+-+ (4,4)
(4) Why are the results of TXGO (texture gather with separate offsets)
undefined if the wrap mode is CLAMP or MIRROR_CLAMP_EXT?
RESOLVED: The CLAMP and MIRROR_CLAMP_EXT wrap modes are fairly
different from other wrap modes. After adding any instruction offsets,
the spec says to pre-clamp the (u,v) coordinates to [0,texture_size]
before generating the footprint. If such clamping occurs on one edge
for a normal texture filtering operation, the footprint ends up being
half border texels, half edge texels, and the clamping effectively
forces the interpolation weights used for texture filtering to 50/50.
We expect the TXG instruction to be used in cases where an application
may want to do custom filtering, and is in control of its own filtering
weights. Coordinate clamping as above will affect the footprint used
for filtering, but not the weights. In the NV_gpu_program4_1 spec, we
defined the TXG/CLAMP combination to simply return the "normal"
footprint produced after the pre-clamp operation above. Any adjustment
of weights due to clamping is the responsibility of the application. We
don't expect this to be a common operation, because CLAMP_TO_EDGE or
CLAMP_TO_BORDER are much more sensible wrap modes.
The hardware implementing TXGO is anticipated to extract all four
samples in a single pass. However, the spec language is defined for
simplicity to perform four separate "gather" operations with the four
provided offsets, extract a single sample from each, and combine the
four samples into a vector. This would require four separate pre-clamp
operations, which was deemed too costly to implement in hardware for a
wrap mode that doesn't work well with texture gather operations. Even
if such hardware were built, it still wouldn't obtain a footprint
resembling the half-border, half-edge footprint for simple TXGO offsets
-- that would require different per-texel clamping rules for the four
samples. We chose to leave the results of this operation undefined.
(5) Should double-precision floating-point support be required or
optional? If optional, how?
RESOLVED: Double-precision floating-point support will be optional in
case low-end GPUs supporting the remainder of these instruction features
choose to cut costs by removing the silicon necessary to implement
64-bit floating-point arithmetic.
(6) While this extension supports double-precision computation, how can
you provide high-precision inputs and outputs to the GPU programs?
RESOLVED: The underlying hardware implementing this extension does not
provide full support for 64-bit floats, even though DOUBLE is a standard
data type provided by the GL. For example, when specifying a vertex
array with a data type of DOUBLE, the vertex attribute components will
end up being converted to 32-bit floats (FLOAT) by the driver before
being passed to the hardware, and the extra precision in the original
64-bit float values will be lost.
For vertex attributes, the EXT_vertex_attrib_64bit and
NV_vertex_attrib_integer_64bit extensions provide the ability to specify
64-bit vertex attribute components using the VertexAttribL* and
VertexAttribLPointer APIs. Such attributes can be read in a vertex
program using a "LONG ATTRIB" declaration:
LONG ATTRIB vector64;
The LONG modifier can only be used vertex program inputs, and can not be
used for inputs of any program type or outputs of any program type.
For other cases, this extension provides the PK64 and UP64 instructions
that provide a mechanism to pass 64-bit components using consecutive
32-bit components. For example, a 3-component vector with 64-bit
components can be passed to a vertex shader using multiple vertex
attributes without using the VertexAttribL APIs with the following code:
/* Pass the X/Y components in vertex attribute 0 (X/Y/Z/W). Use
stride to skip over Z. */
glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble),
(GLdouble *) buffer);
/* Pass the Z components in vertex attribute 1 (X/Y). Use stride to
skip over original X/Y components. */
glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble),
(GLdouble *) buffer + 2);
In this example, the vertex program would use the PK64 instruction to
reconstruct the 64-bit value for each component as follows:
LONG TEMP reconstructed;
PK64 reconstructed.xy, vertex.attrib[0];
PK64 reconstructed.z, vertex.attrib[1];
A similar technique can be used to pass 64-bit values computed by a GPU
program, using transform feedback or writes to a color buffer. The UP64
instruction would be used to convert the 64-bit computed value into two
32-bit values, which would be written to adjacent components.
Note also that the original hardware implementation of this extension
does not support interpolation of 64-bit floating-point values. If an
application desires to pass a 64-bit floating-point value from a vertex
or geometry program to a fragment program, and doesn't require
interpolation, the PK64/UP64 techniques can be combined. For example,
the vertex shader could unpack a 3-component vector with 64-bit
components into a four-component and a two-component 32-bit vector:
LONG TEMP result64;
RESULT result32[2] = { result.attrib[0..1] };
UP64 result32[0], result64.xyxy;
UP64 result32[1].xy, result64.z;
The fragment program would read and reconstruct using PK64:
LONG TEMP input64;
FLAT ATTRIB input32[3] = { fragment.attrib[0..1] };
PK64 input64.xy, input32[0];
PK64 input64.z, input32[1];
Note that such inputs must be declared as "FLAT" in the fragment program
to prevent the hardware from trying to do floating-point interpolation
on the separate 32-bit halves of the value being passed. Such
interpolation would produce complete garbage.
(7) What are instanced geometry programs useful for?
RESOLVED: Instanced geometry programs allow geometry programs that
perform regular operations to run more efficiently.
Consider a simple example of an algorithm that uses geometry programs to
render primitives to a cube map in a single pass. Without instanced
geometry programs, the geometry program to render triangles to the cube
map would do something like:
for (face = 0; face < 6; face++) {
for (vertex = 0; vertex < 3; vertex++) {
project vertex <vertex> onto face <face>, output position
compute/copy attributes of emitted <vertex> to outputs
output <face> to result.layer
emit the projected vertex
}
end the primitive (next triangle)
}
This algorithm would output 18 vertices per input triangle, three for
each cube face. The six triangles emitted would be rasterized, one per
face. Geometry programs that emit a large number of attributes have
often posed performance challenges, since all the attributes must be
stored somewhere until the emitted primitives. Large storage
requirements may limit the number of threads that can be run in parallel
and reduce overall performance.
Instanced geometry programs allow this example to be restructured to run
with six separate threads, one per face. Each thread projects the
triangle to only a single face (identified by the invocation number) and
emits only 3 vertices. The reduced storage requirements allow more
geometry program threads to be run in parallel, with greater overall
efficiency.
Additionally, the total number of attributes that can be emitted by a
single geometry program invocation is limited. However, for instanced
geometry shaders, that limit applies to each of <N> program invocations
which allows for a larger total output. For example, if the GL
implementation supports only 1024 components of output per program
invocation, the 18-vertex algorithm above could emit no more than 56
components per vertex. The same algorithm implemented as a 3-vertex
6-invocation geometry program could theoretically allow for 341
components per vertex.
(8) What are the special interpolation opcodes (IPAC, IPAO, IPAS) good
for, and how do they work?
RESOLVED: The interpolation opcodes allow programs to control the
frequency and location at which fragment inputs are sampled. Limited
control has been provided in previous extensions, but the support was
more limited. NV_gpu_program4 had an interpolation modifier (CENTROID)
that allowed attributes to be sampled inside the primitive, but that was
a per-attribute modifier -- you could only sample any given attribute at
one location. NV_gpu_program4_1 added a new interpolation modifier
(SAMPLE) that directed that fragment programs be run once per sample,
and that the specified attributes be interpolated at the sample
location. Per-sample interpolation can produce higher quality, but the
performance cost is significant since more fragment program invocations
are required.
This extension provides additional control over interpolation, and
allows programs to interpolate attributes at different locations without
necessarily requiring the performance hit of per-sample invocation.
The IPAC instruction allows an attribute to be sampled at the centroid
location, while still allowing the same attribute to be sampled
elsewhere. The IPAS instruction allows the attribute to be sampled at a
number sample location, as per-sample interpolation would do. Multiple
IPAS instructions with different sample numbers allows a program to
sample an attribute at multiple sample points in the pixel and then
combine the samples in a programmable manner, which may allow for higher
quality than simply interpolating at a single representative point in
the pixel. The IPAO instruction allows the attribute to be sampled at
an arbitrary (x,y) offset relative to the pixel center. The range of
supported (x,y) values is limited, and the limits in the initial
implementation are not large enough to permit sampling the attribute
outside the pixel.
Note that previous instruction sets allowed shaders to fake IPAC,
IPAS, and IPAO by a sequence such as:
TEMP ddx, ddy, offset, interp;
MOV interp, fragment.attrib[0]; # start with center
DDX ddx, fragment.attrib[0];
MAD interp, offset.x, ddx, interp; # add offset.x * dA/dx
DDY ddx, fragment.attrib[0];
MAD interp, offset.y, ddy, interp; # add offset.y * dA/dy
However, this method does not apply perspective correction. The quality
of the results may be unacceptable, particularly for primitives that are
nearly perpendicular to the screen.
The semantics of the first operand of these instructions is different
from normal assembly instructions. Operands are normally evaluated by
loading the value of the corresponding variable and applying any
swizzle/negation/absolute value modifier before the instruction is
executed. In the IPAC/IPAO/IPAS instructions, the value of the
attribute is evaluated by the instruction itself. Swizzles, negation,
and absolute value modifiers are still allowed, and are applied after
the attribute values are interpolated.
(9) When using a program that issues global stores (via the STORE
instruction), what amount of execution ordering is guaranteed? How
can an application ensure that writes executed in a shader have
completed and will be visible to other operations using the buffer
object in question?
RESOLVED: There are very few automatic guarantees for potential
write/read or write/write conflicts. Program invocations will run in
generally run in arbitrary order, and applications can't rely on
read/write order to match primitive order.
To get consistent results when buffers are read and written using
multiple pipeline stages, manual synchronization using the
MemoryBarrierEXT() API documented in EXT_shader_image_load_store or some
other synchronization primitive is necessary.
(10) Unlike most other shader features, the STORE opcode allows for
externally-visible side effects from executing a program. How does
this capability interact with other features of the GL?
RESOLVED: First, some GL implementations support a variety of "early Z"
optimizations designed to minimize unnecessary fragment processing work,
such as executing an expensive fragment program on a fragment that will
eventually fail the depth test. Such optimizations have been valid
because fragment programs had no side effects. That is no longer the
case, and such optimizations may not be employed if the fragment program
performs a global store. However, we provide a new "early depth and
stencil test" enable that allows applications to deterministically
control depth and stencil testing. If enabled, depth testing is always
performed prior to fragment program execution. Fragment programs will
never be run on fragments that fail any of these tests.
Second, we are permitting global stores in all program types; however,
the number of program invocations is not well-defined for some program
types. For example, a GL implementation may choose to combine multiple
instances of identical vertices (e.g., duplicate indices in
DrawElements, immediate-mode vertices with identical data) into one
single vertex program invocation, or it may run a vertex program on each
separately. Similarly, the tessellation primitive generator will
generate independent primitives with duplicated vertices, which may or
may not be combined for tessellation evaluation program execution.
Fragment program execution also has several issues described in more
detail below.
(11) What issues arise when running fragment programs doing global stores?
RESOLVED: The order of per-fragment operations in the existing OpenGL
3.0 specification can be fairly loose, because previously-defined
fragment programs, shaders, and fixed-function fragment processing had
no side effects. With side effects, the order of operations must be
defined more tightly. In particular, the pixel ownership and scissor
tests are specified to be performed prior to fragment program execution,
and we provide an option to perform depth and stencil tests early as
well.
OpenGL implementations sometimes run fragment programs on "helper"
pixels that have no coverage in order to be able to compute sane partial
deriviatives for fragment program instructions (DDX, DDY) or automatic
level-of-detail calculation for texturing. In this approach,
derivatives are approximated by computing the difference in a quantity
computed for a given fragment at (x,y) and a fragment at a neighboring
pixel. When a fragment program is executed on a "helper" pixel, global
stores have no effect. Helper pixels aren't explicitly mentioned in the
spec body; instead, partial derivatives are obtained by magic.
If a fragment program contains a KIL instruction, compilers may not
reorder code where an ATOM or STORE execution is executed before a KIL
instruction that logically precedes it in flow control. Once a fragment
is killed, subsequent atomics or stores should never be executed.
Multisample rasterization poses several issues for fragment programs
with global stores. The number of times a fragment program is executed
for multisample rendering is not fully specified, which gives
implementations a number of different choices -- pure multisample (only
runs once), pure supersample (runs once per covered sample), or modes in
between. There are some ways for an application to indirectly control
the behavior -- for example, fragment programs specifying per-sample
attribute interpolation are guaranteed to run once per covered sample.
Note that when rendering to a multisample buffer, a pair of adjacent
triangles may cause a fragment program to be executed more than once at
a given (x,y) with different sets of samples covered. This can also
occur in the interior of a quadrilateral or polygon primitive.
Implementations are permitted to split quads and polygons with >3
vertices into triangles, creating interior edges that split a pixel.
(12) What happens if early fragment tests are enabled, the early depth
test passes, and a fragment program that computes a new depth value
is executed?
RESOLVED: The depth value produced by the fragment program has no
effect if early fragment tests are enabled. The depth value computed by
a fragment program is used only by the post-fragment program stencil and
depth tests, and those tests always have no effect when early depth
testing is enabled.
(13) How do early fragment tests interact with occlusion queries?
RESOLVED: When early fragment tests are enabled, sample counting for
occlusion queries also happens prior to fragment program execution.
Enabling early fragment tests can change the overall sample count,
because samples killed by alpha test and alpha to coverage will still be
counted if early fragment tests are enabled.
(14) What happens if a program performs a global store to a GPU address
corresponding to a read-only buffer mapping? What if it performs a
global read to a write-only mapping?
RESOLVED: Implementations may choose implement full memory protection,
in which case accesses using the wrong type of memory mapping will fault
and lead to termination of the application.
However, full memory protection is not required in this extension --
implementations may choose to substitute a read-write mapping in place
of a read-only or write-only mapping. As a result, we specify the
result of such invalid loads and stores to be undefined.
Note that if a program erroneously writes to nominally read-only
mappings, the results may be weird. If the implementation substitutes a
read-write mapping, such invalid writes are likely to proceed normally.
However, if the application later makes a buffer object non-resident and
the memory manager of the GL implementation needs to move the buffer,
the GL may assume that the contents of the buffer have not been modified
and thus discard the new values written by the (invalid) global store
instructions.
(15) What performance considerations apply to atomics?
RESOLVED: Atomics can be useful for operations like locking, or for
maintaining counters. Note that high-performance GPUs may have hundreds
of program threads in flight at once, and may also have some SIMD
characteristics (where threads are grouped and run as a unit). Using
ATOM instructions with a single memory address to implement a critical
section will result in serial execution -- only one of the hundreds of
threads can execute code in the critical section at a time.
When a global operation would be done under a lock, it may be possible
to improve performance if the algorithm can be parallelized to have
multiple critical sections. For example, an application could allocate
an array of shared resources, each protected by its own lock, and use
the LSBs of the primitive ID or some function of the screen-space (x,y)
to determine which resource in the array to use.
(16) The atomic instruction ATOM returns the old contents of memory into
the result register. Should we provide a version of this opcodes
that doesn't return a value?
RESOLVED: No. In theory, atomics that don't return any values can
perform better (because the program may not need to allocate resources
to hold a result or wait for the result. However, a new opcode isn't
required to obtain this behavior -- a compiler can recognize that the
result of an ATOM instruction is written to a "dummy" temporary that
isn't read by subsequent instructions:
TEMP junk;
ATOM.ADD.U32 junk, address, 1;
The compiler can also recognize that the result will always be discarded
if a conditional write mask of "(FL)" is used.
ATOM.ADD.U32 not_junk (FL), address, 1;
(17) How do we ensure that memory access made by multiple program
invocations of possibly different types are coherent?
RESOLVED: Atomic instructions allow program invocations to coordinate
using shared global memory addresses. However, memory transactions,
including atomics, are not guaranteed to land in the order specified in
the program; they may be reordered by the compiler, cached in different
memory hierarchies, and stored in a distributed memory system where
later stores to one "partition" might be completed prior to earlier
stores to another. The MEMBAR instruction helps control memory
transaction ordering by ensuring that all memory transactions prior to
the barrier complete before any after the barrier. Additionally the
".COH" modifier ensures that memory transactions using the modifier are
cached coherently and will be visible to other shader invocations.
(18) How do the TXG and TXGO opcodes work with sRGB textures?
RESOLVED. Gamma-correction is applied to the texture source color
before "gathering" and hence applies to all four components, unless
the texture swizzle of the selected component is ALPHA in which case
no gamma-correction is applied.
(19) How can render-to-texture algorithms take advantage of
MemoryBarrierEXT, nominally provided for global memory transactions?
RESOLVED: Many algorithms use RTT to ping-pong between two allocations,
using the result of one rendering pass as the input to the next.
Existing mechanisms require expensive FBO Binds, DrawBuffer changes, or
FBO attachment changes to safely swap the render target and texture. With
memory barriers, layered geometry shader rendering, and texture arrays,
an application can very cheaply ping-pong between two layers of a single
texture. i.e.
X = 0;
// Bind the array texture to a texture unit
// Attach the array texture to an FBO using FramebufferTextureARB
while (!done) {
// Stuff X in a constant, vertex attrib, etc.
Draw -
Texturing from layer X;
Writing gl_Layer = 1 - X in the geometry shader;
MemoryBarrierNV(TEXTURE_FETCH_BARRIER_BIT_NV);
X = 1 - X;
}
However, be warned that this requires geometry shaders and hence adds
the overhead that all geometry must pass through an additional program
stage, so an application using large amounts of geometry could become
geometry-limited or more shader-limited.
(20) What is the ".PREC" instruction modifier good for?
RESOLVED: ".PREC" provides some invariance guarantees is useful for
certain algorithms. Using ".PREC", it is possible to ensure that an
algorithm can be written to produce identical results on subtly
different inputs. For example, the order of vertices visible to a
geometry or tessellation shader used to subdivide primitive edges might
present an edge shared between two primitives in one direction for one
primitive and the other direction for the adjacent primitive. Even if
the weights are identical in the two cases, there may be cracking if the
computations are being done in an order-dependent manner. If the
position of a new vertex were evaluation with code below with
limited-precision floating-point math, it's not necessarily the case
that we will get the same result for inputs (a,b,c) and (c,b,a) in the
following code:
ADD result, a, b;
ADD result, result, c;
There are two problems with this code: the rounding errors will be
different and the implementation is free to rearrange the computation
order. The code can be rewritten as follows with ".PREC" and a
symmetric evaluation order to ensure a precise result with the inputs
reversed:
ADD result, a, c;
ADD.PREC result, result, b;
Note that in this example, the first instruction doesn't need the
".PREC" qualifier because the second instruction requires that the
implementation compute <a>+<c>, which will be done reliably if <a> and
<c> are inputs. If <a> and <c> were results of other computations, the
first add and possibly the dependent computations may also need to be
tagged with ".PREC" to ensure reliable results.
The ".PREC" modifier will disable certain optimization and thus carries
a performance cost.
(21) What are the TGALL, TGANY, TGEQ instructions good for?
RESOLVED: If an implementation performs SIMD thread execution,
divergent branching may result in reduced performance if the "if" and
"else" blocks of an "if" statement are executed sequentially. For
example, an algorithm may have both a "fast path" that performs a
computation quickly for a subset of all cases and a "fast path" that
performs a computation quickly but correctly. When performing SIMD
execution, code like the following:
SNE.S.CC cc.x, condition.x;
IF NE.x;
# do fast path
ELSE;
# do slow path
ENDIF;
may end up executing *both* the fast and slow paths for a SIMD thread
group if <condition> diverges, and may execute more slowly than simply
executing the slow path unconditionally. These instructions allow code
like:
# Condition code matches NE if and only if condition.x is non-zero
# for all threads.
TGALL.S.CC cc.x, condition.x;
IF NE.x;
# do fast path
ELSE;
# do slow path
ENDIF;
that executes the fast path if and only if it can be used for *all*
threads in the group. For thread groups where <condition> diverges,
this algorithm would unconditionally run the slow path, but would never
run both in sequence.
Revision History
Rev. Date Author Changes
---- -------- -------- -----------------------------------------
7 09/11/14 pbrown Minor typo fixes.
6 07/04/13 pbrown Add missing language describing the
<texImageUnitComp> grammar rule for component
selection in TXG and TXGO instructions.
5 09/23/10 pbrown Add missing constants for {MIN,MAX}_PROGRAM_
TEXTURE_GATHER_OFFSET_NV (same as ARB/core).
Add missing description for "su" in the opcode
table; fix a couple operand order bugs for
STORE.
4 06/22/10 pbrown Specify that the y/z/w component of the ATOM
results are undefined, as is the case with
ATOMIM from EXT_shader_image_load_store.
3 04/13/10 pbrown Remove F32 support from ATOM.ADD.
2 03/22/10 pbrown Various wording updates to the spec overview,
dependencies, issues, and body. Remove various
spec language that has been refactored into the
EXT_shader_image_load_store specification.
1 pbrown Internal revisions.