| Name |
| |
| NV_gpu_program5 |
| |
| Name Strings |
| |
| GL_NV_gpu_program5 |
| GL_NV_gpu_program_fp64 |
| |
| Contact |
| |
| Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com) |
| |
| Status |
| |
| Shipping. |
| |
| Version |
| |
| Last Modified Date: 09/11/2014 |
| NVIDIA Revision: 7 |
| |
| Number |
| |
| 388 |
| |
| Dependencies |
| |
| OpenGL 2.0 is required. |
| |
| This extension is written against the OpenGL 3.0 specification. |
| |
| NV_gpu_program4 and NV_gpu_program4_1 are required. |
| |
| NV_shader_buffer_load is required. |
| |
| NV_shader_buffer_store is required. |
| |
| This extension is written against and interacts with the NV_gpu_program4, |
| NV_vertex_program4, NV_geometry_program4, and NV_fragment_program4 |
| specifications. |
| |
| This extension interacts with NV_tessellation_program5. |
| |
| This extension interacts with ARB_transform_feedback3. |
| |
| This extension interacts trivially with NV_shader_buffer_load. |
| |
| This extension interacts trivially with NV_shader_buffer_store. |
| |
| This extension interacts trivially with NV_parameter_buffer_object2. |
| |
| This extension interacts trivially with OpenGL 3.3, ARB_texture_swizzle, |
| and EXT_texture_swizzle. |
| |
| This extension interacts trivially with ARB_blend_func_extended. |
| |
| This extension interacts trivially with EXT_shader_image_load_store. |
| |
| This extension interacts trivially with ARB_shader_subroutine. |
| |
| If the 64-bit floating-point portion of this extension is not supported, |
| "GL_NV_gpu_program_fp64" will not be found in the extension string. |
| |
| Overview |
| |
| This specification documents the common instruction set and basic |
| functionality provided by NVIDIA's 5th generation of assembly instruction |
| sets supporting programmable graphics pipeline stages. |
| |
| The instruction set builds upon the basic framework provided by the |
| ARB_vertex_program and ARB_fragment_program extensions to expose |
| considerably more capable hardware. In addition to new capabilities for |
| vertex and fragment programs, this extension provides new functionality |
| for geometry programs as originally described in the NV_geometry_program4 |
| specification, and serves as the basis for the new tessellation control |
| and evaluation programs described in the NV_tessellation_program5 |
| extension. |
| |
| Programs using the functionality provided by this extension should begin |
| with the program headers "!!NVvp5.0" (vertex programs), "!!NVtcp5.0" |
| (tessellation control programs), "!!NVtep5.0" (tessellation evaluation |
| programs), "!!NVgp5.0" (geometry programs), and "!!NVfp5.0" (fragment |
| programs). |
| |
| This extension provides a variety of new features, including: |
| |
| * support for 64-bit integer operations; |
| |
| * the ability to dynamically index into an array of texture units or |
| program parameter buffers; |
| |
| * extending texel offset support to allow loading texel offsets from |
| regular integer operands computed at run-time, instead of requiring |
| that the offsets be constants encoded in texture instructions; |
| |
| * extending TXG (texture gather) support to return the 2x2 footprint |
| from any component of the texture image instead of always returning |
| the first (x) component; |
| |
| * extending TXG to support shadow comparisons in conjunction with a |
| depth texture, via the SHADOW* targets; |
| |
| * further extending texture gather support to provide a new opcode |
| (TXGO) that applies a separate texel offset vector to each of the four |
| samples returned by the instruction; |
| |
| * bit manipulation instructions, including ones to find the position of |
| the most or least significant set bit, bitfield insertion and |
| extraction, and bit reversal; |
| |
| * a general data conversion instruction (CVT) supporting conversion |
| between any two data types supported by this extension; and |
| |
| * new instructions to compute the composite of a set of boolean |
| conditions a group of shader threads. |
| |
| This extension also provides some new capabilities for individual program |
| types, including: |
| |
| * support for instanced geometry programs, where a geometry program may |
| be run multiple times for each primitive; |
| |
| * support for emitting vertices in a geometry program where each vertex |
| emitted may be directed at a specified vertex stream and captured |
| using the ARB_transform_feedback3 extension; |
| |
| * support for interpolating an attribute at a programmable offset |
| relative to the pixel center (IPAO), at a programmable sample number |
| (IPAS), or at the fragment's centroid location (IPAC) in a fragment |
| program; |
| |
| * support for reading a mask of covered samples in a fragment program; |
| |
| * support for reading a point sprite coordinate directly in a fragment |
| program, without overriding a texture coordinate; |
| |
| * support for reading patch primitives and per-patch attributes |
| (introduced by ARB_tessellation_shader) in a geometry program; and |
| |
| * support for multiple output vectors for a single color output in a |
| fragment program (as used by ARB_blend_func_extended). |
| |
| This extension also provides optional support for 64-bit-per-component |
| variables and 64-bit floating-point arithmetic. These features are |
| supported if and only if "NV_gpu_program_fp64" is found in the extension |
| string. |
| |
| This extension incorporates the memory access operations from the |
| NV_shader_buffer_load and NV_parameter_buffer_object2 extensions, |
| originally built as add-ons to NV_gpu_program4. It also provides the |
| following new capabilities: |
| |
| * support for the features without requiring a separate OPTION keyword; |
| |
| * support for indexing into an array of constant buffers using the LDC |
| opcode added by NV_parameter_buffer_object2; |
| |
| * support for storing into buffer objects at a specified GPU address |
| using the STORE opcode, an allowing applications to create READ_WRITE |
| and WRITE_ONLY mappings when making a buffer object resident using the |
| API mechanisms in the NV_shader_buffer_store extension; |
| |
| * storage instruction modifiers to allow loading and storing 64-bit |
| component values; |
| |
| * support for atomic memory transactions using the ATOM opcode, where |
| the instruction atomically reads the memory pointed to by a pointer, |
| performs a specified computation, stores the results of that |
| computation, and returns the original value read; |
| |
| * support for memory barrier transactions using the MEMBAR opcode, which |
| ensures that all memory stores issued prior to the opcode complete |
| prior to any subsequent memory transactions; and |
| |
| * a fragment program option to specify that depth and stencil tests are |
| performed prior to fragment program execution. |
| |
| Additionally, the assembly program languages supported by this extension |
| include support for reading, writing, and performing atomic memory |
| operations on texture image data using the opcodes and mechanisms |
| documented in the "Dependencies on NV_gpu_program5" section of the |
| EXT_shader_image_load_store extension. |
| |
| New Procedures and Functions |
| |
| None. |
| |
| New Tokens |
| |
| Accepted by the <pname> parameter of GetBooleanv, GetIntegerv, |
| GetFloatv, and GetDoublev: |
| |
| MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV 0x8E5A |
| MIN_FRAGMENT_INTERPOLATION_OFFSET_NV 0x8E5B |
| MAX_FRAGMENT_INTERPOLATION_OFFSET_NV 0x8E5C |
| FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV 0x8E5D |
| MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV 0x8E5E |
| MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV 0x8E5F |
| |
| |
| Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation) |
| |
| Modify Section 2.X.2 of NV_fragment_program4, Program Grammar |
| |
| (modify the section, updating the program header string for the extended |
| instruction set) |
| |
| Fragment programs are required to begin with the header string |
| "!!NVfp5.0". This header string identifies the subsequent program body as |
| being a fragment program and indicates that it should be parsed according |
| to the base NV_gpu_program5 grammar plus the additions below. Program |
| string parsing begins with the character immediately following the header |
| string. |
| |
| (add/change the following rules to the NV_fragment_program4 and |
| NV_gpu_program5 base grammars) |
| |
| <SpecialInstruction> ::= "IPAC" <opModifiers> <instResult> "," |
| <instOperandV> |
| | "IPAO" <opModifiers> <instResult> "," |
| <instOperandV> "," <instOperandV> |
| | "IPAS" <opModifiers> <instResult> "," |
| <instOperandV> "," <instOperandS> |
| |
| <interpModifier> ::= "SAMPLE" |
| |
| <attribBasic> ::= <fragPrefix> "sampleid" |
| | <fragPrefix> "samplemask" |
| | <fragPrefix> "pointcoord" |
| |
| <resultBasic> ::= <resPrefix> "color" <resultOptColorNum> |
| <resultOptColorType> |
| | <resPrefix> "samplemask" |
| |
| <resultOptColorType> ::= "" |
| | "." <colorType> |
| |
| |
| Modify Section 2.X.2 of NV_geometry_program4, Program Grammar |
| |
| (modify the section, updating the program header string for the extended |
| instruction set) |
| |
| Geometry programs are required to begin with the header string |
| "!!NVgp5.0". This header string identifies the subsequent program body as |
| being a geometry program and indicates that it should be parsed according |
| to the base NV_gpu_program5 grammar plus the additions below. Program |
| string parsing begins with the character immediately following the header |
| string. |
| |
| (add the following rules to the NV_geometry_program4 and NV_gpu_program5 |
| base grammars) |
| |
| <declaration> ::= "INVOCATIONS" <int> |
| |
| <declPrimInType> ::= "PATCHES" |
| |
| <SpecialInstruction> ::= "EMITS" <instOperandS> |
| |
| <attribBasic> ::= <primPrefix> "invocation" |
| | <primPrefix> "vertexcount" |
| | <attribTessOuter> <optArrayMemAbs> |
| | <attribTessInner> <optArrayMemAbs> |
| | <attribPatchGeneric> <optArrayMemAbs> |
| |
| <attribMulti> ::= <attribTessOuter> <arrayRange> |
| | <attribTessInner> <arrayRange> |
| | <attribPatchGeneric> <arrayRange> |
| |
| <attribTessOuter> ::= <primPrefix> "." "tessouter" |
| |
| <attribTessInner> ::= <primPrefix> "." "tessinner" |
| |
| <attribPatchGeneric> ::= <primPrefix> "." "patch" "." "attrib" |
| |
| |
| Modify Section 2.X.2 of NV_vertex_program4, Program Grammar |
| |
| (modify the section, updating the program header string for the extended |
| instruction set) |
| |
| Vertex programs are required to begin with the header string "!!NVvp5.0". |
| This header string identifies the subsequent program body as being a |
| vertex program and indicates that it should be parsed according to the |
| base NV_gpu_program5 grammar plus the additions below. Program string |
| parsing begins with the character immediately following the header string. |
| |
| |
| Modify Section 2.X.2 of NV_gpu_program4, Program Grammar |
| |
| (add the following grammar rules to the NV_gpu_program4 base grammar; |
| additional grammar rules usable for assembly programs are documented in |
| the EXT_shader_image_load_store and ARB_shader_subroutine specifications) |
| |
| <instruction> ::= <MemInstruction> |
| |
| <MemInstruction> ::= <ATOMop_instruction> |
| | <STOREop_instruction> |
| | <MEMBARop_instruction> |
| |
| <VECTORop> ::= "BFR" |
| | "BTC" |
| | "BTFL" |
| | "BTFM" |
| | "PK64" |
| | "LDC" |
| | "CVT" |
| | "TGALL" |
| | "TGANY" |
| | "TGEQ" |
| | "UP64" |
| |
| <SCALARop> ::= "LOAD" |
| |
| <BINop> ::= "BFE" |
| |
| <TRIop> ::= "BFI" |
| |
| <TEXop_instruction> ::= <TEXop> <opModifiers> <instResult> "," |
| <instOperandV> "," <instOperandV> "," |
| <texAccess> |
| |
| <TEXop> ::= "TXG" |
| | "LOD" |
| |
| <TXDop> ::= "TXGO" |
| |
| <ATOMop_instruction> ::= <ATOMop> <opModifiers> <instResult> "," |
| <instOperandV> "," <instOperandS> |
| |
| <ATOMop> ::= "ATOM" |
| |
| <STOREop_instruction> ::= <STOREop> <opModifiers> <instOperandV> "," |
| <instOperandS> |
| |
| <STOREop> ::= "STORE" |
| |
| <MEMBARop_instruction> ::= <MEMBARop> <opModifiers> |
| |
| <MEMBARop> ::= "MEMBAR" |
| |
| <opModifier> ::= "F16" |
| | "F32" |
| | "F64" |
| | "F32X2" |
| | "F32X4" |
| | "F64X2" |
| | "F64X4" |
| | "S8" |
| | "S16" |
| | "S32" |
| | "S32X2" |
| | "S32X4" |
| | "S64" |
| | "S64X2" |
| | "S64X4" |
| | "U8" |
| | "U16" |
| | "U32" |
| | "U32X2" |
| | "U32X4" |
| | "U64" |
| | "U64X2" |
| | "U64X4" |
| | "ADD" |
| | "MIN" |
| | "MAX" |
| | "IWRAP" |
| | "DWRAP" |
| | "AND" |
| | "OR" |
| | "XOR" |
| | "EXCH" |
| | "CSWAP" |
| | "COH" |
| | "ROUND" |
| | "CEIL" |
| | "FLR" |
| | "TRUNC" |
| | "PREC" |
| | "VOL" |
| |
| <texAccess> ::= <textureUseS> "," <texTarget> <optTexOffset> |
| | <textureUseV> "," <texTarget> <optTexOffset> |
| |
| <texTarget> ::= "ARRAYCUBE" |
| | "SHADOWARRAYCUBE" |
| |
| <optTexOffset> ::= /* empty */ |
| | <texOffset> |
| |
| <texOffset> ::= "offset" "(" <instOperandV> ")" |
| |
| <namingStatement> ::= <TEXTURE_statement> |
| |
| <BUFFER_statement> ::= <bufferDeclType> <establishName> |
| <optArraySize> <optArraySize> "=" |
| <bufferMultInit> |
| |
| <bufferDeclType> ::= "CBUFFER" |
| |
| <TEXTURE_statement> ::= "TEXTURE" <establishName> <texSingleInit> |
| | "TEXTURE" <establishName> <optArraySize> |
| <texMultipleInit> |
| |
| <texSingleInit> ::= "=" <textureUseDS> |
| |
| <texMultipleInit> ::= "=" "{" <texItemList> "}" |
| |
| <texItemList> ::= <textureUseDM> |
| | <textureUseDM> "," <texItemList> |
| |
| <bufferBinding> ::= "program" "." "buffer" <arrayRange> |
| |
| <textureUseS> ::= <textureUseV> <texImageUnitComp> |
| |
| <textureUseV> ::= <texImageUnit> |
| | <texVarName> <optArrayMem> |
| |
| <textureUseDS> ::= "texture" <arrayMemAbs> |
| |
| <textureUseDM> ::= <textureUseDS> |
| | "texture" <arrayRange> |
| |
| <texImageUnitComp> ::= <scalarSuffix> |
| |
| |
| Modify Section 2.X.3.1, Program Variable Types |
| |
| (IGNORE if GL_NV_gpu_program_fp64 is not found in the extension string. |
| Otherwise modify storage size modifiers to guarantee that "LONG" |
| variables are at least 64 bits in size.) |
| |
| Explicitly declared variables may optionally have one storage size |
| modifier. Variables decared as "SHORT" will be represented using at least |
| 16 bits per component. "SHORT" floating-point values will have at least 5 |
| bits of exponent and 10 bits of mantissa. Variables declared as "LONG" |
| will be represented with at least 64 bits per component. "LONG" |
| floating-point values will have at least 11 bits of exponent and 52 bits |
| of mantissa. If no size modifier is provided, the GL will automatically |
| select component sizes. Implementations are not required to support more |
| than one component size, so "SHORT", "LONG", and the default could all |
| refer to the same component size. The "LONG" modifier is supported only |
| for declarations of temporary variables ("TEMP"), and attribute variables |
| ("ATTRIB") in vertex programs. The "SHORT" modifier is supported only |
| for declarations of temporary variables and result variables ("OUTPUT"). |
| |
| |
| Modify Section 2.X.3.2 of the NV_fragment_program4 specification, Program |
| Attribute Variables. |
| |
| (Add a table entry and relevant text describing the fragment program |
| input sample mask variable.) |
| |
| Fragment Attribute Binding Components Underlying State |
| -------------------------- ---------- ---------------------------- |
| fragment.samplemask (m,-,-,-) fragment coverage mask |
| fragment.pointcoord (s,t,-,-) fragment point sprite coordinate |
| |
| If a fragment attribute binding matches "fragment.samplemask", the "x" |
| component is filled with a coverage mask indicating the set of samples |
| covered by this fragment. The coverage mask is a bitfield, where bit <n> |
| is one if the sample number <n> is covered and zero otherwise. If |
| multisample buffers are not available (SAMPLE_BUFFERS is zero), bit zero |
| indicates if the center of the pixel corresponding to the fragment is |
| covered. |
| |
| If a fragment attribute binding matches "fragment.pointcoord", the "x" and |
| "y" components are filled with the s and t point sprite coordinates |
| (section 3.3.1), respectively. The "z" and "w" components are undefined. |
| If the fragment is generated by any primitive other than a point, or if |
| point sprites are disabled, all four components of the binding are |
| undefined. |
| |
| Modify Section 2.X.3.2 of the NV_geometry_program4 specification, Program |
| Attribute Variables. |
| |
| (Add a table entry and relevant text describing the geometry program |
| invocation attribute and per-patch attributes.) |
| |
| Geometry Vertex Binding Components Description |
| ----------------------------- ---------- ---------------------------- |
| ... |
| primitive.invocation (id,-,-,-) geometry program invocation |
| primitive.tessouter[n] (x,-,-,-) outer tess. level n |
| primitive.tessinner[n] (x,-,-,-) inner tess. level n |
| primitive.patch.attrib[n] (x,y,z,w) generic patch attribute n |
| primitive.tessouter[n..o] (x,-,-,-) outer tess. levels n to o |
| primitive.tessinner[n..o] (x,-,-,-) inner tess. levels n to o |
| primitive.patch.attrib[n..o] (x,y,z,w) generic patch attrib n to o |
| primitive.vertexcount (c,-,-,-) vertices in primitive |
| |
| ... |
| |
| If a geometry attribute binding matches "primitive.invocation", the "x" |
| component is filled with an integer giving the number of previous |
| invocations of the geometry program on the primitive being processed. If |
| the geometry program is invoked only once per primitive (default), this |
| component will always be zero. If the program is invoked multiple times |
| (via the INVOCATIONS declaration), the component will be zero on the first |
| invocation, one on the second, and so forth. The "y", "z", and "w" |
| components of the variable are always undefined. |
| |
| If an attribute binding matches "primitive.tessouter[n]", the "x" |
| component is filled with the per-patch outer tessellation level numbered |
| <n> of the input patch. <n> must be less than four. The "y", "z", and |
| "w" components are always undefined. A program will fail to load if this |
| attribute binding is used and the input primitive type is not PATCHES. |
| |
| If an attribute binding matches "primitive.tessinner[n]", the "x" |
| component is filled with the per-patch inner tessellation level numbered |
| <n> of the input patch. <n> must be less than two. The "y", "z", and "w" |
| components are always undefined. A program will fail to load if this |
| attribute binding is used and the input primitive type is not PATCHES. |
| |
| If an attribute binding matches "primitive.patch.attrib[n]", the "x", "y", |
| "z", and "w" components are filled with the corresponding components of |
| the per-patch generic attribute numbered <n> of the input patch. A |
| program will fail to load if this attribute binding is used and the input |
| primitive type is not PATCHES. |
| |
| If an attribute binding matches "primitive.tessouter[n..o]", |
| "primitive.tessinner[n..o]", or "primitive.patch.attrib[n..o]", a sequence |
| of 1+<o>-<n> outer tessellation level, inner tessellation level, or |
| per-patch generic attribute bindings is created. For per-patch generic |
| attribute bindings, it is as though the sequence |
| "primitive.patch.attrib[n], primitive.patch.attrib[n+1], ... |
| primitive.patch.attrib[o]" were specfied. These bindings are available |
| only in explicit declarations of array variables. A program will fail to |
| load if <n> is greater than <o> or the input primitive type is not |
| PATCHES. |
| |
| If a geometry attribute binding matches "primitive.vertexcount", the "x" |
| component is filled with the number of vertices in the input primitive |
| being processed. The "y", "z", and "w" components of the variable are |
| always undefined. |
| |
| |
| Modify Section 2.X.3.5, Program Results |
| |
| (modify Table X.X) |
| |
| Binding Components Description |
| ----------------------------- ---------- ---------------------------- |
| result.color[n].primary (r,g,b,a) primary color n (SRC_COLOR) |
| result.color[n].secondary (r,g,b,a) secondary color n (SRC1_COLOR) |
| |
| Table X.X: Fragment Result Variable Bindings. Components labeled "*" |
| are unused. "[n]" is optional -- color <n> is used if specified; color |
| 0 is used otherwise. |
| |
| (add after third paragraph) |
| |
| If a result variable binding matches "result.color[n].primary" or |
| "result.color[n].secondary" and the ARB_blend_func_extended option is |
| specified, updates to the "x", "y", "z", and "w" components of these color |
| result variables modify the "r", "g", "b", and "a" components of the |
| SRC_COLOR and SRC1_COLOR color outputs, respectively, for the fragment |
| output color numbered <n>. If the ARB_blend_func_extended program option |
| is not specified, the "result.color[n].primary" and |
| "result.color[n].secondary" bindings are unavailable. |
| |
| |
| Modify Section 2.X.3.6, Program Parameter Buffers |
| |
| (modify the description of parameter buffer arrays to require that all |
| bindings in an array declaration must use the same single buffer *or* |
| buffer range) |
| |
| ... Program parameter buffer variables may be declared as arrays, but all |
| bindings assigned to the array must use the same binding point or binding |
| point range, and must increase consecutively. |
| |
| (add to the end of the section) |
| |
| In explicit variable declarations, the bindings in Table X.12.1 of the |
| form "program.buffer[a..b]" may also be used, and indicate the variable |
| spans multiple buffer binding points. Such variables must be accessed as |
| an arrays, with the first index specifying an offset into the range of |
| buffer object binding points. A buffer index of zero identifies binding |
| point <a>; an index of <b>-<a>-1 identifies binding point <b>. If such a |
| variable is declared as an array, a second index must be provided to |
| identify the individual array element. A program will fail to compile if |
| such bindings are used when <a> or <b> is negative or greater than or |
| equal to the number of buffer binding points supported for the program |
| type, or if <a> is greater than <b>. The bindings in Table X.12.1 may not |
| be used in implicit variable declarations. |
| |
| Binding Components Underlying State |
| ----------------------------- ---------- ----------------------------- |
| program.buffer[a..b][c] (x,x,x,x) program parameter buffers a |
| through b, element c |
| program.buffer[a..b][c..d] (x,x,x,x) program parameter buffers a |
| through b, elements b |
| through c |
| program.buffer[a..b] (x,x,x,x) program parameter buffers a |
| through b, all elements |
| |
| Table X.12.1: Program Parameter Buffer Array Bindings. <a> and <b> |
| indicate buffer numbers, <c> and <d> indicate individual elements. |
| |
| When bindings beginning with "program.buffer[a..b]" are used in a variable |
| declaration, they behave identically to corresponding beginning with |
| "program.buffer[a]", except that the variable is filled with a separate |
| set of values for each buffer binding point from <a> to <b> inclusive. |
| |
| (add new section after Section 2.X.3.7, Program Condition Code Registers |
| and renumber subsequent sections accordingly) |
| |
| Section 2.X.3.8, Program Texture Variables |
| |
| Program texture variables are used as constants during program execution |
| and refer the texture objects bound to to one or more texture image units. |
| All texture variables have associated bindings and are read-only during |
| program execution. Texture variables retain their values across program |
| invocations, and the set of texture image units to which they refer is |
| constant. The texture object a variable refers to may be changed by |
| binding a new texture object to the appropriate target of the |
| corresponding texture image unit. Texture variables may only be used to |
| identify a texture object in texture instructions, and may not be used as |
| operands in any other instruction. Texture variables may be declared |
| explicitly via the <TEXTURE_statement> grammar rule, or implicitly by |
| using a texture image unit binding in an instruction. |
| |
| Texture array variables may be declared as arrays, but the list of |
| texture image units assigned to the array must increase consectively. |
| |
| Texture variables identify only a texture image unit; the corresponding |
| texture target (e.g., 1D, 2D, CUBE) and texture object is identified by |
| the <texTarget> grammar rule in instructions using the texture variable. |
| |
| Binding Components Underlying State |
| --------------- ---------- ------------------------------------------ |
| texture[a] x texture object bound to image unit a |
| texture[a..b] x texture objects bound to image units a |
| through b |
| |
| Table X.12.2: Texture Image Unit Bindings. <a> and <b> indicate |
| texture image unit numbers. |
| |
| If a texture binding matches "texture[a]", the texture variable is filled |
| with a single integer referring to texture image unit <a>. |
| |
| If a texture binding matches "texture[a..b]", the texture variable is |
| filled with an array of integers referring to texture image units <a> |
| through <b>, inclusive. A program will fail to compile if <a> or <b> is |
| negative or greater than or equal to the number of texture image units |
| supported, or if <a> is greater than <b>. |
| |
| |
| Modify Section 2.X.4, Program Execution Environment |
| |
| (Update the instruction set table to include new columns to indicate the |
| first ISA supporting the instruction, and to indicate whether the |
| instruction supports 64-bit floating-point modifiers.) |
| |
| Instr- Modifiers |
| uction V F I C S H D Out Inputs Description |
| ------- -- - - - - - - --- -------- -------------------------------- |
| ABS 40 6 6 X X X F v v absolute value |
| ADD 40 6 6 X X X F v v,v add |
| AND 40 - 6 X - - S v v,v bitwise and |
| ATOM 50 - - X - - - s v,su atomic memory transaction |
| BFE 50 - X X - - S v v,v bitfield extract |
| BFI 50 - X X - - S v v,v,v bitfield insert |
| BFR 50 - X X - - S v v bitfield reverse |
| BRK 40 - - - - - - - c break out of loop instruction |
| BTC 50 - X X - - S v v bit count |
| BTFL 50 - X X - - S v v find least significant bit |
| BTFM 50 - X X - - S v v find most significant bit |
| CAL 40 - - - - - - - c subroutine call |
| CEIL 40 6 6 X X X F v vf ceiling |
| CMP 40 6 6 X X X F v v,v,v compare |
| CONT 40 - - - - - - - c continue with next loop interation |
| COS 40 X - X X X F s s cosine with reduction to [-PI,PI] |
| CVT 50 - - X X - F v v general data type conversion |
| DDX 40 X - X X X F v v derivative relative to X (fp-only) |
| DDY 40 X - X X X F v v derivative relative to Y (fp-only) |
| DIV 40 6 6 X X X F v v,s divide vector components by scalar |
| DP2 40 X - X X X F s v,v 2-component dot product |
| DP2A 40 X - X X X F s v,v,v 2-comp. dot product w/scalar add |
| DP3 40 X - X X X F s v,v 3-component dot product |
| DP4 40 X - X X X F s v,v 4-component dot product |
| DPH 40 X - X X X F s v,v homogeneous dot product |
| DST 40 X - X X X F v v,v distance vector |
| ELSE 40 - - - - - - - - start if test else block |
| EMIT 40 - - - - - - - - emit vertex stream 0 (gp-only) |
| EMITS 50 - X - - - S - s emit vertex to stream (gp-only) |
| ENDIF 40 - - - - - - - - end if test block |
| ENDPRIM 40 - - - - - - - - end of primitive (gp-only) |
| ENDREP 40 - - - - - - - - end of repeat block |
| EX2 40 X - X X X F s s exponential base 2 |
| FLR 40 6 6 X X X F v vf floor |
| FRC 40 6 - X X X F v v fraction |
| I2F 40 - 6 X - - S vf v integer to float |
| IF 40 - - - - - - - c start of if test block |
| IPAC 50 X - X X - F v v interpolate at centroid (fp-only) |
| IPAO 50 X - X X - F v v,v interpolate w/offset (fp-only) |
| IPAS 50 X - X X - F v v,su interpolate at sample (fp-only) |
| KIL 40 X X - - X F - vc kill fragment |
| LDC 40 - - X X - F v v load from constant buffer |
| LG2 40 X - X X X F s s logarithm base 2 |
| LIT 40 X - X X X F v v compute lighting coefficients |
| LOAD 40 - - X X - F v su global load |
| LOD 41 X - X X - F v vf,t compute texture LOD |
| LRP 40 X - X X X F v v,v,v linear interpolation |
| MAD 40 6 6 X X X F v v,v,v multiply and add |
| MAX 40 6 6 X X X F v v,v maximum |
| MEMBAR 50 - - - - - - - - memory barrier |
| MIN 40 6 6 X X X F v v,v minimum |
| MOD 40 - 6 X - - S v v,s modulus vector components by scalar |
| MOV 40 6 6 X X X F v v move |
| MUL 40 6 6 X X X F v v,v multiply |
| NOT 40 - 6 X - - S v v bitwise not |
| NRM 40 X - X X X F v v normalize 3-component vector |
| OR 40 - 6 X - - S v v,v bitwise or |
| PK2H 40 X X - - - F s vf pack two 16-bit floats |
| PK2US 40 X X - - - F s vf pack two floats as unsigned 16-bit |
| PK4B 40 X X - - - F s vf pack four floats as signed 8-bit |
| PK4UB 40 X X - - - F s vf pack four floats as unsigned 8-bit |
| PK64 50 X X - - - F v v pack 4x32-bit vectors to 2x64 |
| POW 40 X - X X X F s s,s exponentiate |
| RCC 40 X - X X X F s s reciprocal (clamped) |
| RCP 40 6 - X X X F s s reciprocal |
| REP 40 6 6 - - X F - v start of repeat block |
| RET 40 - - - - - - - c subroutine return |
| RFL 40 X - X X X F v v,v reflection vector |
| ROUND 40 6 6 X X X F v vf round to nearest integer |
| RSQ 40 6 - X X X F s s reciprocal square root |
| SAD 40 - 6 X - - S vu v,v,vu sum of absolute differences |
| SCS 40 X - X X X F v s sine/cosine without reduction |
| SEQ 40 6 6 X X X F v v,v set on equal |
| SFL 40 6 6 X X X F v v,v set on false |
| SGE 40 6 6 X X X F v v,v set on greater than or equal |
| SGT 40 6 6 X X X F v v,v set on greater than |
| SHL 40 - 6 X - - S v v,s shift left |
| SHR 40 - 6 X - - S v v,s shift right |
| SIN 40 X - X X X F s s sine with reduction to [-PI,PI] |
| SLE 40 6 6 X X X F v v,v set on less than or equal |
| SLT 40 6 6 X X X F v v,v set on less than |
| SNE 40 6 6 X X X F v v,v set on not equal |
| SSG 40 6 - X X X F v v set sign |
| STORE 50 - - - - - - - v,su global store |
| STR 40 6 6 X X X F v v,v set on true |
| SUB 40 6 6 X X X F v v,v subtract |
| SWZ 40 X - X X X F v v extended swizzle |
| TEX 40 X X X X - F v vf,t texture sample |
| TGALL 50 X X X X - F v v test all non-zero in thread group |
| TGANY 50 X X X X - F v v test any non-zero in thread group |
| TGEQ 50 X X X X - F v v test all equal in thread group |
| TRUNC 40 6 6 X X X F v vf truncate (round toward zero) |
| TXB 40 X X X X - F v vf,t texture sample with bias |
| TXD 40 X X X X - F v vf,vf,vf,t texture sample w/partials |
| TXF 40 X X X X - F v vs,t texel fetch |
| TXFMS 40 X X X X - F v vs,t multisample texel fetch |
| TXG 41 X X X X - F v vf,t texture gather |
| TXGO 50 X X X X - F v vf,vs,vs,t texture gather w/per-texel offsets |
| TXL 40 X X X X - F v vf,t texture sample w/LOD |
| TXP 40 X X X X - F v vf,t texture sample w/projection |
| TXQ 40 - - - - - S vs vs,t texture info query |
| UP2H 40 X X X X - F vf s unpack two 16-bit floats |
| UP2US 40 X X X X - F vf s unpack two unsigned 16-bit integers |
| UP4B 40 X X X X - F vf s unpack four signed 8-bit integers |
| UP4UB 40 X X X X - F vf s unpack four unsigned 8-bit integers |
| UP64 50 X X X X - F v v unpack 2x64 vectors to 4x32 |
| X2D 40 X - X X X F v v,v,v 2D coordinate transformation |
| XOR 40 - 6 X - - S v v,v exclusive or |
| XPD 40 X - X X X F v v,v cross product |
| |
| Table X.13: Summary of NV_gpu_program5 instructions. |
| |
| The "V" column indicates the first assembly language in the |
| NV_gpu_program4 family (if any) supporting the opcode. "41" and "50" |
| indicate NV_gpu_program4_1 and NV_gpu_program5, respectively. |
| |
| The "Modifiers" columns specify the set of modifiers allowed for the |
| instruction: |
| |
| F = floating-point data type modifiers |
| I = signed and unsigned integer data type modifiers |
| C = condition code update modifiers |
| S = clamping (saturation) modifiers |
| H = half-precision float data type suffix |
| D = default data type modifier (F, U, or S) |
| |
| For the "F" and "I" columns, an "X" indicates support for both unsized |
| type modifiers and sized type modifiers with fewer than 64 bits. A "6" |
| indicates support for all modifiers, including 64-bit versions (when |
| supported). |
| |
| The input and output columns describe the formats of the operands and |
| results of the instruction. |
| |
| v: 4-component vector (data type is inherited from operation) |
| vf: 4-component vector (data type is always floating-point) |
| vs: 4-component vector (data type is always signed integer) |
| vu: 4-component vector (data type is always unsigned integer) |
| s: scalar (replicated if written to a vector destination; |
| data type is inherited from operation) |
| su: scalar (data type is always unsigned integer) |
| c: condition code test result (e.g., "EQ", "GT1.x") |
| vc: 4-component vector or condition code test |
| t: texture |
| |
| Instructions labeled "fp-only" and "gp-only" are supported only for |
| fragment and geometry programs, respectively. |
| |
| |
| Modify Section 2.X.4.1, Program Instruction Modifiers |
| |
| (Update the discussion of instruction precision modifiers. If |
| GL_NV_gpu_program_fp64 is not found in the extension string, the "F64" |
| instruction modifier described below is not supported.) |
| |
| (add to Table X.14 of the NV_gpu_program4 specification.) |
| |
| Modifier Description |
| -------- --------------------------------------------------- |
| F Floating-point operation |
| U Fixed-point operation, unsigned operands |
| S Fixed-point operation, signed operands |
| ... |
| F32 Floating-point operation, 32-bit precision or |
| access one 32-bit floating-point value |
| F64 Floating-point operation, 64-bit precision or |
| access one 64-bit floating-point value |
| S32 Fixed-point operation, signed 32-bit operands or |
| access one 32-bit signed integer value |
| S64 Fixed-point operation, signed 64-bit operands or |
| access one 64-bit signed integer value |
| U32 Fixed-point operation, unsigned 32-bit operands or |
| access one 32-bit unsigned integer value |
| U64 Fixed-point operation, unsigned 64-bit operands or |
| access one 64-bit unsigned integer value |
| ... |
| F32X2 Access two 32-bit floating-point values |
| F32X4 Access four 32-bit floating-point values |
| F64X2 Access two 64-bit floating-point values |
| F64X4 Access four 64-bit floating-point values |
| S8 Access one 8-bit signed integer value |
| S16 Access one 16-bit signed integer value |
| S32X2 Access two 32-bit signed integer values |
| S32X4 Access four 32-bit signed integer values |
| S64 Access one 64-bit signed integer value |
| S64X2 Access two 64-bit signed integer values |
| S64X4 Access four 64-bit signed integer values |
| U8 Access one 8-bit unsigned integer value |
| U16 Access one 16-bit unsigned integer value |
| U32 Access one 32-bit unsigned integer value |
| U32X2 Access two 32-bit unsigned integer values |
| U32X4 Access four 32-bit unsigned integer values |
| U64 Access one 64-bit unsigned integer value |
| U64X2 Access two 64-bit unsigned integer values |
| U64X4 Access four 64-bit unsigned integer values |
| |
| ADD Perform add operation for ATOM |
| MIN Perform minimum operation for ATOM |
| MAX Perform maximum operation for ATOM |
| IWRAP Perform wrapping increment for ATOM |
| DWRAP Perform wrapping decrment for ATOM |
| AND Perform logical AND operation for ATOM |
| OR Perform logical OR operation for ATOM |
| XOR Perform logical XOR operation for ATOM |
| EXCH Perform exchange operation for ATOM |
| CSWAP Perform compare-and-swap operation for ATOM |
| |
| COH Make LOAD and STORE operations use coherent caching |
| VOL Make LOAD and STORE operations treat memory as volatile |
| |
| PREC Instruction results should be precise |
| |
| ROUND Inexact conversion results round to nearest value (even) |
| CEIL Inexact conversion results round to larger value |
| FLR Inexact conversion results round to smaller value |
| TRUNC Inexact conversion results round to value closest to zero |
| |
| |
| "F", "U", and "S" modifiers are base data type modifiers and specify that |
| the instruction should operate on floating-point, unsigned integer, or |
| signed integer values, respectively. For example, "ADD.F", "ADD.U", and |
| "ADD.S" specify component-wise addition of floating-point, unsigned |
| integer, or signed integer vectors, respectively. While these modifiers |
| specify a data type, they do not specify an exact precision at which the |
| operation is performed. Floating-point and fixed-point operations will |
| typically be carried out at 32-bit precision, unless otherwise described |
| in the instruction documentation or overridden by the precision modifiers. |
| If all operands are represented with less than 32-bit precision (e.g., |
| variables with the "SHORT" component size modifier), operations may be |
| carried out at a precision no less than the precision of the largest |
| operand used by the instruction. For some instructions, the data type of |
| some operands or the result are fixed; in these cases, the data type |
| modifier specifies the data type of the remaining values. |
| |
| Operands represented with fewer bits than used to perform the instruction |
| will be promoted to a larger data type. Signed integer operands will be |
| sign-extended, where the most significant bits are filled with ones if the |
| operand is negative and zero otherwise. Unsigned integer operands will be |
| zero-extended, where the most significant bits are always filled with |
| zeroes. Operands represented with more bits than used to perform the |
| instruction will be converted to lower precision. Floating-point |
| overflows result in IEEE infinity encodings; integer overflows result in |
| the truncation of the most significant bits. |
| |
| For arithmetic operations, the "F32", "F64", "U32", "U64", "S32", and |
| "S64" modifiers are precision-specific data type modifiers that specify |
| that floating-point, unsigned integer, or signed integer operations be |
| carried out with an internal precision of no less than 32 or 64 bits per |
| component, respectively. The "F64", "U64", and "S64" modifiers are |
| supported on only a subset of instructions, as documented in the |
| instruction table. The base data type of the instruction is trivially |
| derived from a precision-specific data type modifiers, and an instruction |
| may not specify both base and precision-specific data type modifiers. |
| |
| ... |
| |
| "SAT" and "SSAT" are clamping modifiers that generally specify that the |
| floating-point components of the instruction result should be clamped to |
| [0,1] or [-1,1], respectively, before updating the condition code and the |
| destination variable. If no clamping suffix is specified, unclamped |
| results will be used for condition code updates (if any) and destination |
| variable writes. Clamping modifiers are not supported on instructions |
| that do not produce floating-point results, with one exception. |
| |
| ... |
| |
| For load and store operations, the "F32", "F32X2", "F32X4", "F64", |
| "F64X2", "F64X4", "S8", "S16", "S32", "S32X2", "S32X4", "S64", "S64X2", |
| "S64X4", "U8", "U16", "U32", "U32X2", "U32X4", "U64", "U64X2", and "U64X4" |
| storage modifiers control how data are loaded from or stored to memory. |
| Storage modifiers are supported by the ATOM, LDC, LOAD, and STORE |
| instructions and are covered in more detail in the descriptions of these |
| instructions. These instructions must specify exactly one of these |
| modifiers, and may not specify any of the base data type modifiers (F,U,S) |
| described above. The base data types of the result vector of a load |
| instruction or the first operand of a store instruction are trivially |
| derived from the storage modifier. |
| |
| For atomic memory operations performed by the ATOM instruction, the "ADD", |
| "MIN", "MAX", "IWRAP", "DWRAP", "AND", "OR", "XOR", "EXCH", and "CSWAP" |
| modifiers specify the operation to perform on the memory being accessed, |
| and are described in more detail in the description of this instruction. |
| |
| For load and store operations, the "COH" modifier controls whether the |
| operation uses a coherent level of the cache hierarchy, as described in |
| Section 2.X.4.5. |
| |
| For load and store operations, the "VOL" modifier controls whether the |
| operation treats the memory being read or written as volatile. |
| Instructions modified with "VOL" will always read or write the underlying |
| memory, whether or not previous or subsequent loads and stores access the |
| same memory. |
| |
| For arithmetic and logical operations, the "PREC" modifier controls |
| whether the instruction result should be treated as precise. For |
| instructions not qualified with ".PREC", the implementation may rearrange |
| the computations specified by the program instructions to execute more |
| efficiently, even if it may generate slightly different results in some |
| cases. For example, an implementation may combine a MUL instruction with |
| a dependent ADD instruction and generate code to execute a MAD |
| (multiply-add) instruction instead. The difference in rounding may |
| produce unacceptable artifacts for some algorithms. When ".PREC" is |
| specified, the instruction will be executed in a manner that always |
| generates the same result regardless of the program instructions that |
| precede or follow the instruction. Note that a ".PREC" modifier does not |
| affect the processing of any other instruction. For example, tagging an |
| instruction with ".PREC" does not mean that the instructions used to |
| generate the instruction's operands will be treated as precise unless |
| those instructions are also qualified with ".PREC". |
| |
| For the CVT (data type conversion) instruction, the "F16", "F32", "F64", |
| "S8", "S16", "S32", "S64", "U8", "U16", "U32", and "U64" storage modifiers |
| specify the data type of the vector operand and the converted result. Two |
| storage modifiers must be provided, which specify the data type of the |
| result and the operand, respectively. |
| |
| For the CVT (data type conversion) instruction, the "ROUND", "CEIL", |
| "FLR", and "TRUNC" modifiers specify how to round converted results that |
| are not directly representable using the data type of the result. |
| |
| |
| Modify Section 2.X.4.4, Program Texture Access |
| |
| (Extend the language describing the operation of texel offsets to cover |
| the new capability to load texel offsets from a register. Otherwise, |
| this functionality is unchanged from previous extensions.) |
| |
| <offset> is a 3-component signed integer vector, which can be specified |
| using constants embedded in the texture instruction according to the |
| <texOffsetImmed> grammar rule, or taken from a vector operand according to |
| the <texOffsetVar> grammar rule. The three components of the offset |
| vector are added to the computed <u>, <v>, and <w> texel locations prior |
| to sampling. When using a constant offset, one, two, or three components |
| may be specified in the instruction; if fewer than three are specified, |
| the remaining offset components are zero. If no offsets are specified, |
| all three components of the offset are treated as zero. A limited range |
| of offset values are supported; the minimum and maximum <texOffset> values |
| are implementation-dependent and given by MIN_PROGRAM_TEXEL_OFFSET_EXT and |
| MAX_PROGRAM_TEXEL_OFFSET_EXT, respectively. A program will fail to load: |
| |
| * if the texture target specified in the instruction is 1D, ARRAY1D, |
| SHADOW1D, or SHADOWARRAY1D, and the second or third component of a |
| constant offset vector is non-zero; |
| |
| * if the texture target specified in the instruction is 2D, RECT, |
| ARRAY2D, SHADOW2D, SHADOWRECT, or SHADOWARRAY2D, and the third |
| component of a constant offset vector is non-zero; |
| |
| * if the texture target is CUBE, SHADOWCUBE, ARRAYCUBE, or |
| SHADOWARRAYCUBE, and any component of a constant offset vector is |
| non-zero -- texel offsets are not supported for cube map or buffer |
| textures; |
| |
| * if any component of the constant offset vector of a TXGO instruction |
| is non-zero -- non-constant offsets are provided in separate operands; |
| |
| * if any component of a constant offset vector is less than |
| MIN_PROGRAM_TEXEL_OFFSET_EXT or greater than |
| MAX_PROGRAM_TEXEL_OFFSET_EXT; |
| |
| * if a TXD or TXGO instruction specifies a non-constant texel offset |
| according to the <texOffsetVar> grammar rule; or |
| |
| * if any instruction specifies a non-constant texel offset according |
| to the <texOffsetVar> grammar rule and the texture target is CUBE, |
| SHADOWCUBE, ARRAYCUBE, or SHADOWARRAYCUBE. |
| |
| The implementation-dependent minimum and maximum texel offset values apply |
| to texel offsets are taken from a vector operand, but out-of-bounds or |
| invalid component values will not prevent program loading since the |
| offsets may not be computed until the program is executed. Components of |
| the vector operand not needed for the texture target are ignored. The W |
| component of the offset vector is always ignored; the Z component of the |
| offset vector is ignored unless the target is 3D; the Y component is |
| ignored if the target is 1D, ARRAY1D, SHADOW1D, or SHADOWARRAY1D. If the |
| value of any non-ignored component of the vector operand is outside |
| implementation-dependent limits, the results of the texture lookup are |
| undefined. For all instructions except TXGO, the limits are |
| MIN_PROGRAM_TEXEL_OFFSET_EXT and MAX_PROGRAM_TEXEL_OFFSET_EXT. For the |
| TXGO instruction, the limits are MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV and |
| MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV. |
| |
| |
| (Modify language describing how the check for using multiple targets on a |
| single texture image unit works, to account for texture array variables |
| where a single instruction may access one of multiple textures and the |
| texture used is not known when the program is loaded.) |
| |
| A program will fail to load if it attempts to sample from multiple texture |
| targets (including the SHADOW pseudo-targets) on the same texture image |
| unit. For example, a program containing any two the following |
| instructions will fail to load: |
| |
| TEX out, coord, texture[0], 1D; |
| TEX out, coord, texture[0], 2D; |
| TEX out, coord, texture[0], ARRAY2D; |
| TEX out, coord, texture[0], SHADOW2D; |
| TEX out, coord, texture[0], 3D; |
| |
| For the purposes of this test, sampling using a texture variable declared |
| as an array is treated as though all texture image units bound to the |
| variable were accessed. A program containing the following |
| instructions would fail to load: |
| |
| TEXTURE textures[] = { texture[0..3] }; |
| TEX out, coord, textures[2], 2D; # acts as if all textures are used |
| TEX out, coord, texture[1], 3D; |
| |
| (Add language describing texture gather component selection) |
| |
| The TXG and TXGO instructions provide the ability to assemble a |
| four-component vector by taking the value of a single component of a |
| multi-component texture from each of four texels. The component selected |
| is identified by the <texImageUnitComp> grammar rule. Component selection |
| is not supported for any other instruction, and a program will fail to |
| load if <texImageUnitComp> is matched for any texture instruction other |
| than TXG or TXGO. |
| |
| |
| Add New Section 2.X.4.5, Program Memory Access |
| |
| Programs may load from or store to buffer object memory via the ATOM |
| (atomic global memory operation), LDC (load constant), LOAD (global load), |
| and STORE (global store) instructions. |
| |
| Load instructions read 8, 16, 32, 64, 128, or 256 bits of data from a |
| source address to produce a four-component vector, according to the |
| storage modifier specified with the instruction. The storage modifier has |
| three parts: |
| |
| - a base data type, "F", "S", or "U", specifying that the instruction |
| fetches floating-point, signed integer, or unsigned integer values, |
| respectively; |
| |
| - a component size, specifying that the components fetched by the |
| instruction have 8, 16, 32, or 64 bits; and |
| |
| - an optional component count, where "X2" and "X4" indicate that two or |
| four components be fetched, and no count indicates a single component |
| fetch. |
| |
| When the storage modifier specifies that fewer than four components should |
| be fetched, remaining components are filled with zeroes. When performing |
| an atomic memory operation (ATOM) or a global load (LOAD), the GPU address |
| is specified as an instruction operand. When performing a constant buffer |
| load (LDC), the GPU address is derived by adding the base address of the |
| bound buffer object to an offset specified as an instruction operand. |
| Given a GPU address <address> and a storage modifier <modifier>, the |
| memory load can be described by the following code: |
| |
| result_t_vec BufferMemoryLoad(char *address, OpModifier modifier) |
| { |
| result_t_vec result = { 0, 0, 0, 0 }; |
| switch (modifier) { |
| case F32: |
| result.x = ((float32_t *)address)[0]; |
| break; |
| case F32X2: |
| result.x = ((float32_t *)address)[0]; |
| result.y = ((float32_t *)address)[1]; |
| break; |
| case F32X4: |
| result.x = ((float32_t *)address)[0]; |
| result.y = ((float32_t *)address)[1]; |
| result.z = ((float32_t *)address)[2]; |
| result.w = ((float32_t *)address)[3]; |
| break; |
| case F64: |
| result.x = ((float64_t *)address)[0]; |
| break; |
| case F64X2: |
| result.x = ((float64_t *)address)[0]; |
| result.y = ((float64_t *)address)[1]; |
| break; |
| case F64X4: |
| result.x = ((float64_t *)address)[0]; |
| result.y = ((float64_t *)address)[1]; |
| result.z = ((float64_t *)address)[2]; |
| result.w = ((float64_t *)address)[3]; |
| break; |
| case S8: |
| result.x = ((int8_t *)address)[0]; |
| break; |
| case S16: |
| result.x = ((int16_t *)address)[0]; |
| break; |
| case S32: |
| result.x = ((int32_t *)address)[0]; |
| break; |
| case S32X2: |
| result.x = ((int32_t *)address)[0]; |
| result.y = ((int32_t *)address)[1]; |
| break; |
| case S32X4: |
| result.x = ((int32_t *)address)[0]; |
| result.y = ((int32_t *)address)[1]; |
| result.z = ((int32_t *)address)[2]; |
| result.w = ((int32_t *)address)[3]; |
| break; |
| case S64: |
| result.x = ((int64_t *)address)[0]; |
| break; |
| case S64X2: |
| result.x = ((int64_t *)address)[0]; |
| result.y = ((int64_t *)address)[1]; |
| break; |
| case S64X4: |
| result.x = ((int64_t *)address)[0]; |
| result.y = ((int64_t *)address)[1]; |
| result.z = ((int64_t *)address)[2]; |
| result.w = ((int64_t *)address)[3]; |
| break; |
| case U8: |
| result.x = ((uint8_t *)address)[0]; |
| break; |
| case U16: |
| result.x = ((uint16_t *)address)[0]; |
| break; |
| case U32: |
| result.x = ((uint32_t *)address)[0]; |
| break; |
| case U32X2: |
| result.x = ((uint32_t *)address)[0]; |
| result.y = ((uint32_t *)address)[1]; |
| break; |
| case U32X4: |
| result.x = ((uint32_t *)address)[0]; |
| result.y = ((uint32_t *)address)[1]; |
| result.z = ((uint32_t *)address)[2]; |
| result.w = ((uint32_t *)address)[3]; |
| break; |
| case U64: |
| result.x = ((uint64_t *)address)[0]; |
| break; |
| case U64X2: |
| result.x = ((uint64_t *)address)[0]; |
| result.y = ((uint64_t *)address)[1]; |
| break; |
| case U64X4: |
| result.x = ((uint64_t *)address)[0]; |
| result.y = ((uint64_t *)address)[1]; |
| result.z = ((uint64_t *)address)[2]; |
| result.w = ((uint64_t *)address)[3]; |
| break; |
| } |
| return result; |
| } |
| |
| Store instructions write the contents of a four-component vector operand |
| into 8, 16, 32, 64, 128, or 256 bits, according to the storage modifier |
| specified with the instruction. The storage modifiers supported by stores |
| are identical to those supported for loads. Given a GPU address |
| <address>, a vector operand <operand> containing the data to be stored, |
| and a storage modifier <modifier>, the memory store can be described by |
| the following code: |
| |
| void BufferMemoryStore(char *address, operand_t_vec operand, |
| OpModifier modifier) |
| { |
| switch (modifier) { |
| case F32: |
| ((float32_t *)address)[0] = operand.x; |
| break; |
| case F32X2: |
| ((float32_t *)address)[0] = operand.x; |
| ((float32_t *)address)[1] = operand.y; |
| break; |
| case F32X4: |
| ((float32_t *)address)[0] = operand.x; |
| ((float32_t *)address)[1] = operand.y; |
| ((float32_t *)address)[2] = operand.z; |
| ((float32_t *)address)[3] = operand.w; |
| break; |
| case F64: |
| ((float64_t *)address)[0] = operand.x; |
| break; |
| case F64X2: |
| ((float64_t *)address)[0] = operand.x; |
| ((float64_t *)address)[1] = operand.y; |
| break; |
| case F64X4: |
| ((float64_t *)address)[0] = operand.x; |
| ((float64_t *)address)[1] = operand.y; |
| ((float64_t *)address)[2] = operand.z; |
| ((float64_t *)address)[3] = operand.w; |
| break; |
| case S8: |
| ((int8_t *)address)[0] = operand.x; |
| break; |
| case S16: |
| ((int16_t *)address)[0] = operand.x; |
| break; |
| case S32: |
| ((int32_t *)address)[0] = operand.x; |
| break; |
| case S32X2: |
| ((int32_t *)address)[0] = operand.x; |
| ((int32_t *)address)[1] = operand.y; |
| break; |
| case S32X4: |
| ((int32_t *)address)[0] = operand.x; |
| ((int32_t *)address)[1] = operand.y; |
| ((int32_t *)address)[2] = operand.z; |
| ((int32_t *)address)[3] = operand.w; |
| break; |
| case S64: |
| ((int64_t *)address)[0] = operand.x; |
| break; |
| case S64X2: |
| ((int64_t *)address)[0] = operand.x; |
| ((int64_t *)address)[1] = operand.y; |
| break; |
| case S64X4: |
| ((int64_t *)address)[0] = operand.x; |
| ((int64_t *)address)[1] = operand.y; |
| ((int64_t *)address)[2] = operand.z; |
| ((int64_t *)address)[3] = operand.w; |
| break; |
| case U8: |
| ((uint8_t *)address)[0] = operand.x; |
| break; |
| case U16: |
| ((uint16_t *)address)[0] = operand.x; |
| break; |
| case U32: |
| ((uint32_t *)address)[0] = operand.x; |
| break; |
| case U32X2: |
| ((uint32_t *)address)[0] = operand.x; |
| ((uint32_t *)address)[1] = operand.y; |
| break; |
| case U32X4: |
| ((uint32_t *)address)[0] = operand.x; |
| ((uint32_t *)address)[1] = operand.y; |
| ((uint32_t *)address)[2] = operand.z; |
| ((uint32_t *)address)[3] = operand.w; |
| break; |
| case U64: |
| ((uint64_t *)address)[0] = operand.x; |
| break; |
| case U64X2: |
| ((uint64_t *)address)[0] = operand.x; |
| ((uint64_t *)address)[1] = operand.y; |
| break; |
| case U64X4: |
| ((uint64_t *)address)[0] = operand.x; |
| ((uint64_t *)address)[1] = operand.y; |
| ((uint64_t *)address)[2] = operand.z; |
| ((uint64_t *)address)[3] = operand.w; |
| break; |
| } |
| } |
| |
| If a global load or store accesses a memory address that does not |
| correspond to a buffer object made resident by MakeBufferResidentNV, the |
| results of the operation are undefined and may produce a fault resulting |
| in application termination. If a load accesses a buffer object made |
| resident with an <access> parameter of WRITE_ONLY, or if a store accesses |
| a buffer object made resident with an <access> parameter of READ_ONLY, the |
| results of the operation are also undefined and may lead to application |
| termination. |
| |
| The address used for global memory loads or stores or offset used for |
| constant buffer loads must be aligned to the fetch size corresponding to |
| the storage opcode modifier. For S8 and U8, the offset has no alignment |
| requirements. For S16 and U16, the offset must be a multiple of two basic |
| machine units. For F32, S32, and U32, the offset must be a multiple of |
| four. For F32X2, F64, S32X2, S64, U32X2, and U64, the offset must be a |
| multiple of eight. For F32X4, F64X2, S32X4, S64X2, U32X4, and U64X2, the |
| offset must be a multiple of sixteen. For F64X4, S64X4, and U64X4, the |
| offset must be a multiple of thirty-two. If an offset is not correctly |
| aligned, the values returned by a buffer memory load will be undefined, |
| and the effects of a buffer memory store will also be undefined. |
| |
| Global and image memory accesses in assembly programs are weakly ordered |
| and may require synchronization relative to other operations in the OpenGL |
| pipeline. The ordering and synchronization mehcanisms described in |
| Section 2.14.X (of the EXT_shader_image_load_store extension |
| specification) for shaders using the OpenGL Shading Language apply equally |
| to loads, stores, and atomics performed in assembly programs. |
| |
| |
| Modify Section 2.X.6.Y of the NV_fragment_program4 specification |
| |
| (add new option section) |
| |
| + Early Per-Fragment Tests (NV_early_fragment_tests) |
| |
| If a fragment program specifies the "NV_early_fragment_tests" option, the |
| depth and stencil tests will be performed prior to fragment program |
| invocation, as described in Section 3.X. |
| |
| |
| Modify Section 2.X.7.Y of the NV_geometry_program4 specification |
| |
| (Simply add the new input primitive type "PATCHES" to the list of tokens |
| allowed by the "PRIMITIVE_IN" declaration.) |
| |
| - Input Primitive Type (PRIMITIVE_IN) |
| |
| The PRIMITIVE_IN statement declares the type of primitives seen by a |
| geometry program. The single argument must be one of "POINTS", "LINES", |
| "LINES_ADJACENCY", "TRIANGLES", "TRIANGLES_ADJACENCY", or "PATCHES". |
| |
| |
| (Add a new optional program declaration to declare a geometry shader that |
| is run <N> times per primitive.) |
| |
| Geometry programs support three types of mandatory declaration statements, |
| as described below. Each of the three must be included exactly once in |
| the geometry program. |
| |
| ... |
| |
| Geometry programs also support one optional declaration statement. |
| |
| - Program Invocation Count (INVOCATIONS) |
| |
| The INVOCATIONS statement declares the number of times the geometry |
| program is run on each primitive processed. The single argument must be a |
| positive integer less than or equal to the value of the |
| implementation-dependent limit MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV. Each |
| invocation of the geometry program will have the same inputs and outputs |
| except for the built-in input variable "primitive.invocation". This |
| variable will be an integer between 0 and <n>-1, where <n> is the declared |
| number of invocations. If omitted, the program invocation count is one. |
| |
| |
| Section 2.X.8.Z, ATOM: Atomic Global Memory Operation |
| |
| The ATOM instruction performs an atomic global memory operation by reading |
| from memory at the address specified by the second unsigned integer scalar |
| operand, computing a new value based on the value read from memory and the |
| first (vector) operand, and then writing the result back to the same |
| memory address. The memory transaction is atomic, guaranteeing that no |
| other write to the memory accessed will occur between the time it is read |
| and written by the ATOM instruction. The result of the ATOM instruction |
| is the scalar value read from memory. |
| |
| The ATOM instruction has two required instruction modifiers. The atomic |
| modifier specifies the type of operation to be performed. The storage |
| modifier specifies the size and data type of the operand read from memory |
| and the base data type of the operation used to compute the value to be |
| written to memory. |
| |
| atomic storage |
| modifier modifiers operation |
| -------- ------------------ -------------------------------------- |
| ADD U32, S32, U64 compute a sum |
| MIN U32, S32 compute minimum |
| MAX U32, S32 compute maximum |
| IWRAP U32 increment memory, wrapping at operand |
| DWRAP U32 decrement memory, wrapping at operand |
| AND U32, S32 compute bit-wise AND |
| OR U32, S32 compute bit-wise OR |
| XOR U32, S32 compute bit-wise XOR |
| EXCH U32, S32, U64 exchange memory with operand |
| CSWAP U32, S32, U64 compare-and-swap |
| |
| Table X.Y, Supported atomic and storage modifiers for the ATOM |
| instruction. |
| |
| Not all storage modifiers are supported by ATOM, and the set of modifiers |
| allowed for any given instruction depends on the atomic modifier |
| specified. Table X.Y enumerates the set of atomic modifiers supported by |
| the ATOM instruction, and the storage modifiers allowed for each. |
| |
| tmp0 = VectorLoad(op0); |
| address = ScalarLoad(op1); |
| result = BufferMemoryLoad(address, storageModifier); |
| switch (atomicModifier) { |
| case ADD: |
| writeval = tmp0.x + result; |
| break; |
| case MIN: |
| writeval = min(tmp0.x, result); |
| break; |
| case MAX: |
| writeval = max(tmp0.x, result); |
| break; |
| case IWRAP: |
| writeval = (result >= tmp0.x) ? 0 : result+1; |
| break; |
| case DWRAP: |
| writeval = (result == 0 || result > tmp0.x) ? tmp0.x : result-1; |
| break; |
| case AND: |
| writeval = tmp0.x & result; |
| break; |
| case OR: |
| writeval = tmp0.x | result; |
| break; |
| case XOR: |
| writeval = tmp0.x ^ result; |
| break; |
| case EXCH: |
| break; |
| case CSWAP: |
| if (result == tmp0.x) { |
| writeval = tmp0.y; |
| } else { |
| return result; // no memory store |
| } |
| break; |
| } |
| BufferMemoryStore(address, writeval, storageModifier); |
| |
| ATOM performs a scalar atomic operation. The <y>, <z>, and <w> components |
| of the result vector are undefined. |
| |
| ATOM supports no base data type modifiers, but requires exactly one |
| storage modifier. The base data types of the result vector, and the first |
| (vector) operand are derived from the storage modifier. The second |
| operand is always interpreted as a scalar unsigned integer. |
| |
| |
| Section 2.X.8.Z, BFE: Bitfield Extract |
| |
| The BFE instruction extracts a selected set of performs a component-wise |
| bit extraction of the second vector operand to yield a result vector. For |
| each component, the number of bits extracted is given by the x component |
| of the first vector operand, and the bit number of the least significant |
| bit extracted is given by the y component of the first vector operand. |
| |
| tmp0 = VectorLoad(op0); |
| tmp1 = VectorLoad(op1); |
| result.x = BitfieldExtract(tmp0.x, tmp0.y, tmp1.x); |
| result.y = BitfieldExtract(tmp0.x, tmp0.y, tmp1.y); |
| result.z = BitfieldExtract(tmp0.x, tmp0.y, tmp1.z); |
| result.w = BitfieldExtract(tmp0.x, tmp0.y, tmp1.w); |
| |
| If the number of bits to extract is zero, zero is returned. The results |
| of bitfield extraction are undefined |
| |
| * if the number of bits to extract or the starting offset is negative, |
| * if the sum of the number of bits to extract and the starting offset |
| is greater than the total number of bits in the operand/result, or |
| * if the starting offset is greater than or equal to the total number of |
| bits in the operand/result. |
| |
| Type BitfieldExtract(Type bits, Type offset, Type value) |
| { |
| if (bits < 0 || offset < 0 || offset >= TotalBits(Type) || |
| bits + offset > TotalBits(Type)) { |
| /* result undefined */ |
| } else if (bits == 0) { |
| return 0; |
| } else { |
| return (value << (TotalBits(Type) - (bits+offset))) >> |
| (TotalBits(type) - bits); |
| } |
| } |
| |
| BFE supports only signed and unsigned integer data type modifiers. For |
| signed integer data types, the extracted value is sign-extended (i.e., |
| filled with ones if the most significant bit extracted is one and filled |
| with zeroes otherwise). For unsigned integer data types, the extracted |
| value is zero-extended. |
| |
| |
| Section 2.X.8.Z, BFI: Bitfield Insert |
| |
| The BFI instruction performs a component-wise bitfield insertion of the |
| second vector operand into the third vector operand to yield a result |
| vector. For each component, the <n> least significant bits are extracted |
| from the corresponding component of the second vector operand, where <n> |
| is given by the x component of the first vector operand. Those bits are |
| merged into the corresponding component of the third vector operand, |
| replacing bits <b> through <b>+<n>-1, to produce the result. The bit |
| offset <b> is specified by the y component of the first operand. |
| |
| tmp0 = VectorLoad(op0); |
| tmp1 = VectorLoad(op1); |
| tmp2 = VectorLoad(op2); |
| result.x = BitfieldInsert(op0.x, op0.y, tmp1.x, tmp2.x); |
| result.y = BitfieldInsert(op0.x, op0.y, tmp1.y, tmp2.y); |
| result.z = BitfieldInsert(op0.x, op0.y, tmp1.z, tmp2.z); |
| result.w = BitfieldInsert(op0.x, op0.y, tmp1.w, tmp2.w); |
| |
| The results of bitfield insertion are undefined |
| |
| * if the number of bits to insert or the starting offset is negative, |
| * if the sum of the number of bits to insert and the starting offset |
| is greater than the total number of bits in the operand/result, or |
| * if the starting offset is greater than or equal to the total number of |
| bits in the operand/result. |
| |
| Type BitfieldInsert(Type bits, Type offset, Type src, Type dst) |
| { |
| if (bits < 0 || offset < 0 || offset >= TotalBits(type) || |
| bits + offset > TotalBits(Type)) { |
| /* result undefined */ |
| } else if (bits == TotalBits(Type)) { |
| return src; |
| } else { |
| Type mask = ((1 << bits) - 1) << offset; |
| return ((src << offset) & mask) | (dst & (~mask)); |
| } |
| } |
| |
| BFI supports only signed and unsigned integer data type modifiers. If no |
| type modifier is specified, the operand and result vectors are treated as |
| signed integers. |
| |
| |
| Section 2.X.8.Z, BFR: Bitfield Reverse |
| |
| The BFR instruction performs a component-wise bit reversal of the single |
| vector operand to produce a result vector. Bit reversal is performed by |
| exchanging the most and least significant bits, the second-most and |
| second-least significant bits, and so on. |
| |
| tmp0 = VectorLoad(op0); |
| result.x = BitReverse(tmp0.x); |
| result.y = BitReverse(tmp0.y); |
| result.z = BitReverse(tmp0.z); |
| result.w = BitReverse(tmp0.w); |
| |
| BFR supports only signed and unsigned integer data type modifiers. If no |
| type modifier is specified, the operand and result vectors are treated as |
| signed integers. |
| |
| |
| Section 2.X.8.Z, BTC: Bit Count |
| |
| The BTC instruction performs a component-wise bit count of the single |
| source vector to yield a result vector. Each component of the result |
| vector contains the number of one bits in the corresponding component of |
| the source vector. |
| |
| tmp0 = VectorLoad(op0); |
| result.x = BitCount(tmp0.x); |
| result.y = BitCount(tmp0.y); |
| result.z = BitCount(tmp0.z); |
| result.w = BitCount(tmp0.w); |
| |
| BTC supports only signed and unsigned integer data type modifiers. If no |
| type modifier is specified, both operands and the result are treated as |
| signed integers. |
| |
| |
| Section 2.X.8.Z, BTFL: Find Least Significant Bit |
| |
| The BTFL instruction searches for the least significant bit of each |
| component of the single source vector, yielding a result vector comprising |
| the bit number of the located bit for each component. |
| |
| tmp0 = VectorLoad(op0); |
| result.x = FindLSB(tmp0.x); |
| result.y = FindLSB(tmp0.y); |
| result.z = FindLSB(tmp0.z); |
| result.w = FindLSB(tmp0.w); |
| |
| BTFL supports only signed and unsigned integer data type modifiers. For |
| unsigned integer data types, the search will yield the bit number of the |
| least significant one bit in each component, or the maximum integer (all |
| bits are ones) if the source vector component is zero. For signed data |
| types, the search will yield the bit number of the least significant one |
| bit in each component, or -1 if the source vector component is zero. If |
| no type modifier is specified, both operands and the result are treated as |
| signed integers. |
| |
| |
| Section 2.X.8.Z, BTFM: Find Most Significant Bit |
| |
| The BTFM instruction searches for the most significant bit of each |
| component of the single source vector, yielding a result vector comprising |
| the bit number of the located bit for each component. |
| |
| tmp0 = VectorLoad(op0); |
| result.x = FindMSB(tmp0.x); |
| result.y = FindMSB(tmp0.y); |
| result.z = FindMSB(tmp0.z); |
| result.w = FindMSB(tmp0.w); |
| |
| BTFM supports only signed and unsigned integer data type modifiers. For |
| unsigned integer data types, the search will yield the bit number of the |
| most significant one bit in each component , or the maximum integer (all |
| bits are ones) if the source vector component is zero. For signed data |
| types, the search will yield the bit number of the most significant one |
| bit if the source value is positive, the bit number of the most |
| significant zero bit if the source value is negative, or -1 if the source |
| value is zero. If no type modifier is specified, both operands and the |
| result are treated as signed integers. |
| |
| |
| Section 2.X.8.Z, CVT: Data Type Conversion |
| |
| The CVT instruction converts each component of the single source vector |
| from one specified data type to another to yield a result vector. |
| |
| tmp0 = VectorLoad(op0); |
| result = DataTypeConvert(tmp0); |
| |
| The CVT instruction requires two storage modifiers. The first specifies |
| the data type of the result components; the second specifies the data type |
| of the operand components. The supported storage modifiers are F16, F32, |
| F64, S8, S16, S32, S64, U8, U16, U32, and U64. A storage modifier of |
| "F16" indicates a source or destination that is treated as having a |
| floating-point type, but whose sixteen least significant bits describe a |
| 16-bit floating-point value using the encoding provided in Section 2.1.2. |
| |
| If the component size of the source register doesn't match the size of the |
| specified operand data type, the source register components are first |
| interpreted as a value with the same base data type as the operand and |
| converted to the operand data type. The operand components are then |
| converted to the result data type. Finally, if the component size of the |
| destination register doesn't match the specified result data type, the |
| result components are converted to values of the same base data type with |
| a size matching the result register's component size. |
| |
| Data type conversion is performed by first converting the source |
| components to an infinite-precision value of the destination data type, |
| and then converting to the result data type. When converting between |
| floating-point and integer values, integer values are never interpreted as |
| being normalized to [0,1] or [-1,+1]. Converting the floating-point |
| special values -INF, +INF, and NaN to integers will yield undefined |
| results. |
| |
| When converting from a non-integral floating-point value to an integer, |
| one of the two integers closest in value to the floating-point value are |
| chosen according to the rounding instruction modifier. If "CEIL" or "FLR" |
| is specified, the larger or smaller value, respectively is chosen. If |
| "TRUNC" is specified, the value nearest to zero is chosen. If "ROUND" is |
| specified, if one integer is nearer in value to the original |
| floating-point value, it is chosen; otherwise, the even integer is chosen. |
| "ROUND" is used if no rounding modifier is specified. |
| |
| When converting from the infinite-precision intermediate value to the |
| destination data type: |
| |
| * Floating-point values not exactly representable in the destination |
| data are rounded to one of the two nearest values in the destination |
| type according to the rounding modifier. Note that the results of |
| float-to-float conversion are not automatically rounded to integer |
| values, even if a rounding modifier such as CEIL or FLR is specified. |
| |
| * Integer values are clamped to the closest value representable in the |
| result data type if the "SAT" (saturation) modifier is specified. |
| |
| * Integer values drop the most significant bits if the "SAT" modifier is |
| not specified. |
| |
| Negation and absolute value operators are not supported on the source |
| operand; a program using such operators will fail to compile. |
| |
| CVT supports no data type modifiers; the type of the operand and result |
| vectors is fully specified by the required storage modifiers. |
| |
| |
| Section 2.X.8.Z, EMIT: Emit Vertex |
| |
| (Modify the description of the EMIT opcode to deal with the interaction |
| with multiple vertex streams added by ARB_transform_feedback3. For more |
| information on vertex streams, see ARB_transform_feedback3.) |
| |
| The EMIT instruction emits a new vertex to be added to the current output |
| primitive for vertex stream zero. The attributes of the emitted vertex |
| are given by the current values of the vertex result variables. After the |
| EMIT instruction completes, a new vertex is started and all result |
| variables become undefined. |
| |
| |
| Section 2.X.8.Z, EMITS: Emit Vertex to Stream |
| |
| (Add new geometry program opcode; the EMITS instruction is not supported |
| for any other program types. For more information on vertex streams, see |
| ARB_transform_feedback3.) |
| |
| The EMITS instruction emits a new vertex to be added to the current output |
| primitive for the vertex stream specified by the single signed integer |
| scalar operand. The attributes of the emitted vertex are given by the |
| current values of the vertex result variables. After the EMITS |
| instruction completes, a new vertex is started and all result variables |
| become undefined. |
| |
| If the specified stream is negative or greater than or equal to the |
| implementation-dependent number of vertex streams |
| (MAX_VERTEX_STREAMS_NV), the results of the instruction are undefined. |
| |
| |
| Section 2.X.8.Z, IPAC: Interpolate at Centroid |
| |
| The IPAC instruction generates a result vector by evaluating the fragment |
| attribute named by the single vector operand at the centroid location. |
| The result vector would be identical to the value obtained by a MOV |
| instruction if the attribute variable were declared using the CENTROID |
| modifier. |
| |
| When interpolating an attribute variable with this instruction, the |
| CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT |
| and NOPERSPECTIVE variable modifiers operate normally. |
| |
| tmp0 = Interpolate(op0, x_pixel + x_centroid, y_pixel + x_centroid); |
| result = tmp0; |
| |
| IPAC supports only floating-point data type modifiers. A program will |
| fail to load if it contains an IPAC instruction whose single operand is |
| not a fragment program attribute variable or matches the "fragment.facing" |
| or "primitive.id" binding. |
| |
| |
| Section 2.X.8.Z, IPAO: Interpolate with Offset |
| |
| The IPAO instruction generates a result vector by evaluating the fragment |
| attribute named by the single vector operand at an offset from the pixel |
| center given by the x and y components of the second vector operand. The |
| z and w components of the second vector operand are ignored. The (x,y) |
| position used for interpolating the attribute variable is obtained by |
| adding the (x,y) offsets in the second vector operand to the (x,y) |
| position of the pixel center. |
| |
| The range of offsets supported by the IPAO instruction is |
| implementation-dependent. The position used to interpolate the attribute |
| variable is undefined if the x or y component of the second operand is |
| less than MIN_FRAGMENT_INTERPOLATION_OFFSET_NV or greater than |
| MAX_FRAGMENT_INTERPOLATION_OFFSET_NV. Additionally, the granularity of |
| offsets may be limited. The (x,y) value may be snapped to a fixed |
| sub-pixel grid with the number of subpixel bits given by |
| FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV. |
| |
| When interpolating an attribute variable with this instruction, the |
| CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT |
| and NOPERSPECTIVE variable modifiers operate normally. |
| |
| tmp1 = VectorLoad(op1); |
| tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x); |
| result = tmp0; |
| |
| IPAO supports only floating-point data type modifiers. A program will |
| fail to load if it contains an IPAO instruction whose first operand is not |
| a fragment program attribute variable or matches the "fragment.facing" or |
| "primitive.id" binding. |
| |
| |
| Section 2.X.8.Z, IPAS: Interpolate at Sample Location |
| |
| The IPAS instruction generates a result vector by evaluating the fragment |
| attribute named by the single vector operand at the location of the |
| pixel's sample whose sample number is given by the second integer scalar |
| operand. If multisample buffers are not available (SAMPLE_BUFFERS is |
| zero), the attribute will be evaluated at the pixel center. If the sample |
| number given by the second operand does not exist, the position used to |
| interpolate the attribute is undefined. |
| |
| When interpolating an attribute variable with this instruction, the |
| CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT |
| and NOPERSPECTIVE variable modifiers operate normally. |
| |
| sample = ScalarLoad(op1); |
| tmp1 = SampleOffset(sample); |
| tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x); |
| result = tmp0; |
| |
| IPAS supports only floating-point data type modifiers. A program will |
| fail to load if it contains an IPAO instruction whose first operand is not |
| a fragment program attribute variable or matches the "fragment.facing" or |
| "primitive.id" binding. |
| |
| |
| Section 2.X.8.Z, LDC: Load from Constant Buffer |
| |
| The LDC instruction loads a vector operand from a buffer object to yield a |
| result vector. The operand used for the LDC instruction must correspond |
| to a parameter buffer variable declared using the "CBUFFER" statement; a |
| program will fail to load if any other type of operand is used in an LDC |
| instruction. |
| |
| result = BufferMemoryLoad(&op0, storageModifier); |
| |
| A base operand vector is fetched from memory as described in Section |
| 2.X.4.5, with the GPU address derived from the binding corresponding to |
| the operand. A final operand vector is derived from the base operand |
| vector by applying swizzle, negation, and absolute value operand modifiers |
| as described in Section 2.X.4.2. |
| |
| The amount of memory in any given buffer object binding accessible by the |
| LDC instruction may be limited. If any component fetched by the LDC |
| instruction extends 4*<n> or more basic machine units from the beginning |
| of the buffer object binding, where <n> is the implementation-dependent |
| constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that |
| component will be undefined. |
| |
| LDC supports no base data type modifiers, but requires exactly one storage |
| modifier. The base data types of the operand and result vectors are |
| derived from the storage modifier. |
| |
| |
| Section 2.X.8.Z, LOAD: Global Load |
| |
| The LOAD instruction generates a result vector by reading an address from |
| the single unsigned integer scalar operand and fetching data from buffer |
| object memory, as described in Section 2.X.4.5. |
| |
| address = ScalarLoad(op0); |
| result = BufferMemoryLoad(address, storageModifier); |
| |
| LOAD supports no base data type modifiers, but requires exactly one |
| storage modifier. The base data type of the result vector is derived from |
| the storage modifier. The single scalar operand is always interpreted as |
| an unsigned integer. |
| |
| |
| Section 2.X.8.Z, MEMBAR: Memory Barrier |
| |
| The MEMBAR instruction synchronizes memory transactions to ensure that |
| memory transactions resulting from any instruction executed by the thread |
| prior to the MEMBAR instruction complete prior to any memory transactions |
| issued after the instruction. |
| |
| MEMBAR has no operands and generates no result. |
| |
| |
| Section 2.X.8.Z, PK64: Pack 64-Bit Component |
| |
| The PK64 instruction reads the four components of the single vector |
| operand as 32-bit values, packs the bit representations of these into a |
| pair of 64-bit values, and replicates those to produce a four-component |
| result vector. The "x" and "y" components of the operand are packed to |
| produce the "x" and "z" components of the result vector; the "z" and "w" |
| components of the operand are packed to produce the "y" and "w" components |
| of the result vector. The PK64 instruction can be reversed by the UP64 |
| instruction below. |
| |
| This instruction is intended to allow a program to reconstruct 64-bit |
| integer or floating-point values generated by the application but passed |
| to the GL as two 32-bit values taken from adjacent words in memory. The |
| ability to use this technique depends on how the 64-bit value is stored in |
| memory. For "little-endian" processors, first 32-bit value would hold the |
| with the least significant 32 bits of the 64-bit value. For "big-endian" |
| processors, the first 32-bit value holds the most significant 32 bits of |
| the 64-bit value. This reconstruction assumes that the first 32-bit word |
| comes from the x component of the operand and the second 32-bit word comes |
| from the y component. The method used to construct a 64-bit value from a |
| pair of 32-bit values depends on the processor type. |
| |
| tmp = VectorLoad(op0); |
| |
| if (underlying system is little-endian) { |
| result.x = RawBits(tmp.x) | (RawBits(tmp.y) << 32); |
| result.y = RawBits(tmp.z) | (RawBits(tmp.w) << 32); |
| result.z = RawBits(tmp.x) | (RawBits(tmp.y) << 32); |
| result.w = RawBits(tmp.z) | (RawBits(tmp.w) << 32); |
| } else { |
| result.x = RawBits(tmp.y) | (RawBits(tmp.x) << 32); |
| result.y = RawBits(tmp.w) | (RawBits(tmp.z) << 32); |
| result.z = RawBits(tmp.y) | (RawBits(tmp.x) << 32); |
| result.w = RawBits(tmp.w) | (RawBits(tmp.z) << 32); |
| } |
| |
| PK64 supports integer and floating-point data type modifiers, which |
| specify the base data type of the operand and result. The single vector |
| operand is always treated as having 32-bit components, and the result is |
| treated as a vector with 64-bit components. The encoding performed by |
| PK64 can be reversed using the UP64 instruction. |
| |
| A program will fail to load if it contains a PK64 instruction that writes |
| its results to a variable not declared as "LONG". |
| |
| |
| Section 2.X.8.Z, STORE: Global Store |
| |
| The STORE instruction reads an address from the second unsigned integer |
| scalar operand and writes the contents of the first vector operand to |
| buffer object memory at that address, as described in Section 2.X.4.5. |
| This instruction generates no result. |
| |
| tmp0 = VectorLoad(op0); |
| address = ScalarLoad(op1); |
| BufferMemoryStore(address, tmp0, storageModifier); |
| |
| STORE supports no base data type modifiers, but requires exactly one |
| storage modifier. The base data type of the vector components of the |
| first operand is derived from the storage modifier. The second operand is |
| always interpreted as an unsigned integer scalar. |
| |
| |
| Section 2.X.8.Z, TEX: Texture Sample |
| |
| (Modify the instruction pseudo-code to account for texel offsets no |
| longer need to be immediate arguments.) |
| |
| tmp = VectorLoad(op0); |
| if (instruction has variable texel offset) { |
| itmp = VectorLoad(op1); |
| } else { |
| itmp = instruction.texelOffset; |
| } |
| ddx = ComputePartialsX(tmp); |
| ddy = ComputePartialsY(tmp); |
| lambda = ComputeLOD(ddx, ddy); |
| result = TextureSample(tmp, lambda, ddx, ddy, itmp); |
| |
| |
| Section 2.X.8.Z, TGALL: Test for All Non-Zero in a Thread Group |
| |
| The TGALL instruction produces a result vector by reading a vector operand |
| for each active thread in the current thread group and comparing each |
| component to zero. A result vector component contains a TRUE value |
| (described below) if the value of the corresponding component in the |
| operand vector is non-zero for all active threads, and a FALSE value |
| otherwise. |
| |
| An implementation may choose to arrange programs threads into thread |
| groups, and execute an instruction simultaneously for each thread in the |
| group. If the TGALL instruction is contained inside conditional flow |
| control blocks and not all threads in the group execute the instruction, |
| the operand values for threads not executing the instruction have no |
| bearing on the value returned. The method used to arrange threads into |
| groups is undefined. |
| |
| tmp = VectorLoad(op0); |
| result = { TRUE, TRUE, TRUE, TRUE }; |
| for (all active threads) { |
| if ([thread]tmp.x == 0) result.x = FALSE; |
| if ([thread]tmp.y == 0) result.y = FALSE; |
| if ([thread]tmp.z == 0) result.z = FALSE; |
| if ([thread]tmp.w == 0) result.w = FALSE; |
| } |
| |
| TGALL supports all data type modifiers. For floating-point data types, |
| the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data |
| types, the TRUE value is -1 and the FALSE value is 0. For unsigned |
| integer data types, the TRUE value is the maximum integer value (all bits |
| are ones) and the FALSE value is zero. |
| |
| |
| Section 2.X.8.Z, TGANY: Test for Any Non-Zero in a Thread Group |
| |
| The TGANY instruction produces a result vector by reading a vector operand |
| for each active thread in the current thread group and comparing each |
| component to zero. A result vector component contains a TRUE value |
| (described below) if the value of the corresponding component in the |
| operand vector is non-zero for any active thread, and a FALSE value |
| otherwise. |
| |
| An implementation may choose to arrange programs threads into thread |
| groups, and execute an instruction simultaneously for each thread in the |
| group. If the TGANY instruction is contained inside conditional flow |
| control blocks and not all threads in the group execute the instruction, |
| the operand values for threads not executing the instruction have no |
| bearing on the value returned. The method used to arrange threads into |
| groups is undefined. |
| |
| tmp = VectorLoad(op0); |
| result = { FALSE, FALSE, FALSE, FALSE }; |
| for (all active threads) { |
| if ([thread]tmp.x != 0) result.x = TRUE; |
| if ([thread]tmp.y != 0) result.y = TRUE; |
| if ([thread]tmp.z != 0) result.z = TRUE; |
| if ([thread]tmp.w != 0) result.w = TRUE; |
| } |
| |
| TGANY supports all data type modifiers. For floating-point data types, |
| the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data |
| types, the TRUE value is -1 and the FALSE value is 0. For unsigned |
| integer data types, the TRUE value is the maximum integer value (all bits |
| are ones) and the FALSE value is zero. |
| |
| |
| Section 2.X.8.Z, TGEQ: Test for All Equal Values in a Thread Group |
| |
| The TGEQ instruction produces a result vector by reading a vector operand |
| for each active thread in the current thread group and comparing each |
| component to zero. A result vector component contains a TRUE value |
| (described below) if the value of the corresponding component in the |
| operand vector is the same for all active threads, and a FALSE value |
| otherwise. |
| |
| An implementation may choose to arrange programs threads into thread |
| groups, and execute an instruction simultaneously for each thread in the |
| group. If the TGEQ instruction is contained inside conditional flow |
| control blocks and not all threads in the group execute the instruction, |
| the operand values for threads not executing the instruction have no |
| bearing on the value returned. The method used to arrange threads into |
| groups is undefined. |
| |
| tmp = VectorLoad(op0); |
| tgall = { TRUE, TRUE, TRUE, TRUE }; |
| tgany = { FALSE, FALSE, FALSE, FALSE }; |
| for (all active threads) { |
| if ([thread]tmp.x == 0) tgall.x = FALSE; else tgany.x = TRUE; |
| if ([thread]tmp.y == 0) tgall.y = FALSE; else tgany.y = TRUE; |
| if ([thread]tmp.z == 0) tgall.z = FALSE; else tgany.z = TRUE; |
| if ([thread]tmp.w == 0) tgall.w = FALSE; else tgany.w = TRUE; |
| } |
| result.x = (tgall.x == tgany.x) ? TRUE : FALSE; |
| result.y = (tgall.y == tgany.y) ? TRUE : FALSE; |
| result.z = (tgall.z == tgany.z) ? TRUE : FALSE; |
| result.w = (tgall.w == tgany.w) ? TRUE : FALSE; |
| |
| TGEQ supports all data type modifiers. For floating-point data types, the |
| TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data |
| types, the TRUE value is -1 and the FALSE value is 0. For unsigned |
| integer data types, the TRUE value is the maximum integer value (all bits |
| are ones) and the FALSE value is zero. |
| |
| |
| Section 2.X.8.Z, TXB: Texture Sample with Bias |
| |
| (Modify the instruction pseudo-code to account for texel offsets no |
| longer need to be immediate arguments.) |
| |
| tmp = VectorLoad(op0); |
| if (instruction has variable texel offset) { |
| itmp = VectorLoad(op1); |
| } else { |
| itmp = instruction.texelOffset; |
| } |
| ddx = ComputePartialsX(tmp); |
| ddy = ComputePartialsY(tmp); |
| lambda = ComputeLOD(ddx, ddy); |
| result = TextureSample(tmp, lambda + tmp.w, ddx, ddy, itmp); |
| |
| Section 2.X.8.Z, TXG: Texture Gather |
| |
| (Update the TXG opcode description from NV_gpu_program4_1 specification. |
| This version adds two capabilities: any component of a multi-component |
| texture can be selected by tacking on a component name to the texture |
| variable passed to identify the texture unit, and depth compares are |
| supported if a SHADOW target is specified.) |
| |
| The TXG instruction takes the four components of a single floating-point |
| vector operand as a texture coordinate, determines a set of four texels to |
| sample from the base level of detail of the specified texture image, and |
| returns one component from each texel in a four-component result vector. |
| To determine the four texels to sample, the minification and magnification |
| filters are ignored and the rules for LINEAR filter are applied to the |
| base level of the texture image to determine the texels T_i0_j1, T_i1_j1, |
| T_i1_j0, and T_i0_j0, as defined in equations 3.23 through 3.25. The |
| texels are then converted to texture source colors (Rs,Gs,Bs,As) according |
| to table 3.21, followed by application of the texture swizzle as described |
| in section 3.8.13. A four-component vector is returned by taking one of |
| the four components of the swizzled texture source colors from each of the |
| four selected texels. The component is selected using the |
| <texImageUnitComp> grammar rule, by adding a scalar suffix |
| (".x", ".y", ".z", ".w") to the identified texture; if no scalar suffix |
| is provided, the first component is selected. |
| |
| TXG only operates on 2D, SHADOW2D, CUBE, SHADOWCUBE, ARRAY2D, |
| SHADOWARRAY2D, ARRAYCUBE, SHADOWARRAYCUBE, RECT, and SHADOWRECT texture |
| targets; a program will fail to compile if any other texture target is |
| used. |
| |
| When using a "SHADOW" texture target, component selection is ignored. |
| Instead, depth comparisons are performed on the depth values for each of |
| the four selected texels, and 0/1 values are returned based on the results |
| of the comparison. |
| |
| As with other texture accesses, the results of a texture gather operation |
| are undefined if the texture target in the instruction is incompatible |
| with the selected texture's base internal format and depth compare mode. |
| |
| tmp = VectorLoad(op0); |
| ddx = (0,0,0); |
| ddy = (0,0,0); |
| lambda = 0; |
| if (instruction has variable texel offset) { |
| itmp = VectorLoad(op1); |
| } else { |
| itmp = instruction.texelOffset; |
| } |
| result.x = TextureSample_i0j1(tmp, lambda, ddx, ddy, itmp).<comp>; |
| result.y = TextureSample_i1j1(tmp, lambda, ddx, ddy, itmp).<comp>; |
| result.z = TextureSample_i1j0(tmp, lambda, ddx, ddy, itmp).<comp>; |
| result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; |
| |
| In this pseudocode, "<comp>" refers to the texel component selected by the |
| <texImageUnitComp> grammar rule, as described above. |
| |
| TXG supports all three data type modifiers. The single operand is always |
| treated as a floating-point vector; the results are interpreted according |
| to the data type modifier. |
| |
| |
| Section 2.X.8.Z, TXGO: Texture Gather with Per-Texel Offsets |
| |
| Like the TXG instruction, the TXGO instruction takes the four components |
| of its first floating-point vector operand as a texture coordinate, |
| determines a set of four texels to sample from the base level of detail of |
| the specified texture image, and returns one component from each texel in |
| a four-component result vector. The second and third vector operands are |
| taken as signed four-component integer vectors providing the x and y |
| components of the offsets, respectively, used to determine the location of |
| each of the four texels. To determine the four texels to sample, each of |
| the four independent offsets is used in conjunction with the specified |
| texture coordinate to select a texel. The minification and magnification |
| filters are ignored and the rules for LINEAR filtering are used to select |
| the texel T_i0_j0, as defined in equations 3.23 through 3.25, from the |
| base level of the texture image. The texels are then converted to texture |
| source colors (Rs,Gs,Bs,As) according to table 3.21, followed by |
| application of the texture swizzle as described in section 3.8.13. A |
| four-component vector is returned by taking one of the four components |
| of the swizzled texture source colors from each of the four selected |
| texels. The component is selected using the <texImageUnitComp> grammar |
| rule, by adding a scalar suffix (".x", ".y", ".z", ".w") to the identified |
| texture; if no scalar suffix is provided, the first component is selected. |
| |
| TXGO only operates on 2D, SHADOW2D, ARRAY2D, SHADOWARRAY2D, RECT, and |
| SHADOWRECT texture targets; a program will fail to compile if any other |
| texture target is used. |
| |
| When using a "SHADOW" texture target, component selection is ignored. |
| Instead, depth comparisons are performed on the depth values for each of |
| the four selected texels, and 0/1 values are returned based on the results |
| of the comparison. |
| |
| As with other texture accesses, the results of a texture gather operation |
| are undefined if the texture target in the instruction is incompatible |
| with the selected texture's base internal format and depth compare mode. |
| |
| tmp = VectorLoad(op0); |
| itmp1 = VectorLoad(op1); |
| itmp2 = VectorLoad(op2); |
| ddx = (0,0,0); |
| ddy = (0,0,0); |
| lambda = 0; |
| itmp = (op1.x, op2.x); |
| result.x = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; |
| itmp = (op1.y, op2.y); |
| result.y = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; |
| itmp = (op1.z, op2.z); |
| result.z = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; |
| itmp = (op1.w, op2.w); |
| result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; |
| |
| In this pseudocode, "<comp>" refers to the texel component selected by the |
| <texImageUnitComp> grammar rule, as described above. |
| |
| If TEXTURE_WRAP_S or TEXTURE_WRAP_T are either CLAMP or MIRROR_CLAMP_EXT, |
| the results of the TXGO instruction are undefined. |
| |
| Note: The TXG instruction is equivalent to the TXGO instruction with X |
| and Y offset vectors of (0,1,1,0) and (0,0,-1,-1), respectively. |
| |
| TXGO supports all three data type modifiers. The first operand is always |
| treated as a floating-point vector and the second and third operands are |
| always treated as a signed integer vector; the results are interpreted |
| according to the data type modifier. |
| |
| |
| Section 2.X.8.Z, TXL: Texture Sample with LOD |
| |
| (Modify the instruction pseudo-code to account for texel offsets no |
| longer need to be immediate arguments.) |
| |
| tmp = VectorLoad(op0); |
| if (instruction has variable texel offset) { |
| itmp = VectorLoad(op1); |
| } else { |
| itmp = instruction.texelOffset; |
| } |
| ddx = (0,0,0); |
| ddy = (0,0,0); |
| result = TextureSample(tmp, tmp.w, ddx, ddy, itmp); |
| |
| |
| Section 2.X.8.Z, TXP: Texture Sample with Projection |
| |
| (Modify the instruction pseudo-code to account for texel offsets no |
| longer need to be immediate arguments.) |
| |
| tmp0 = VectorLoad(op0); |
| tmp0.x = tmp0.x / tmp0.w; |
| tmp0.y = tmp0.y / tmp0.w; |
| tmp0.z = tmp0.z / tmp0.w; |
| if (instruction has variable texel offset) { |
| itmp = VectorLoad(op1); |
| } else { |
| itmp = instruction.texelOffset; |
| } |
| ddx = ComputePartialsX(tmp); |
| ddy = ComputePartialsY(tmp); |
| lambda = ComputeLOD(ddx, ddy); |
| result = TextureSample(tmp, lambda, ddx, ddy, itmp); |
| |
| |
| Section 2.X.8.Z, UP64: Unpack 64-bit Component |
| |
| The UP64 instruction produces a vector result with 32-bit components by |
| unpacking the bits of the "x" and "y" components of a 64-bit vector |
| operand. The "x" component of the operand is unpacked to produce the "x" |
| and "y" components of the result vector; the "y" component is unpacked to |
| produce the "z" and "w" components of the result vector. |
| |
| This instruction is intended to allow a program to pass 64-bit integer or |
| floating-point values to an application using two 32-bit values stored in |
| adjacent words in memory, which will be read by the application as single |
| 64-bit values. The ability to use this technique depends on how the |
| 64-bit value is stored in memory. For "little-endian" processors, the |
| first 32-bit value would hold the with the least significant 32 bits of |
| the 64-bit value. For "big-endian" processors, the first 32-bit value |
| holds the most significant 32 bits of the 64-bit value. This |
| reconstruction assumes that the first 32-bit word comes from the "x" |
| component of the operand and the second 32-bit word comes from the "y" |
| component. The method used to unpack a 64-bit value into a pair of 32-bit |
| values depends on the processor type. |
| |
| tmp = VectorLoad(op0); |
| if (underlying system is little-endian) { |
| result.x = (RawBits(tmp.x) >> 0) & 0xFFFFFFFF; |
| result.y = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF; |
| result.z = (RawBits(tmp.y) >> 0) & 0xFFFFFFFF; |
| result.w = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF; |
| } else { |
| result.x = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF; |
| result.y = (RawBits(tmp.x) >> 0) & 0xFFFFFFFF; |
| result.z = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF; |
| result.w = (RawBits(tmp.y) >> 0) & 0xFFFFFFFF; |
| } |
| |
| UP64 supports integer and floating-point data type modifiers, which |
| specify the base data type of the operand and result. The single operand |
| vector always has 64-bit components. The result is treated as a vector |
| with 32-bit components. The encoding performed by UP64 can be reversed |
| using the PK64 instruction. |
| |
| A program will fail to load if it contains a UP64 instruction whose |
| operand is a variable not declared as "LONG". |
| |
| |
| Modify Section 2.14.6.1 of the NV_geometry_program4 specification, |
| Geometry Program Input Primitives |
| |
| (add patches to the list of supported input primitive types) |
| |
| The supported input primitive types are: ... |
| |
| Patches (PATCHES) |
| |
| Geometry programs that operate on patches are valid only for the |
| PATCHES_NV primitive type. There are a variable number of vertices |
| available for each program invocation, depending on the number of input |
| vertices in the primitive itself. For a patch with <n> vertices, |
| "vertex[0]" refers to the first vertex of the patch, and "vertex[<n>-1]" |
| refers to the last vertex. |
| |
| |
| Modify Section 2.14.6.2 of the NV_geometry_program4 specification, |
| Geometry Program Output Primitives |
| |
| (Add a new paragraph limiting the use of the EMITS opcode to geometry |
| programs with a POINTS output primitive type at the end of the section. |
| This limitation may be removed in future specifications.) |
| |
| Geometry programs may write to multiple vertex streams only if the |
| specified output primitive type is POINTS. A program will fail to load if |
| it contains and EMITS instruction and the output primitive type specified |
| by the PRIMITIVE_OUT declaration is not POINTS. |
| |
| Modify Section 2.14.6.4 of the NV_geometry_program4 specification, |
| Geometry Program Output Limits |
| |
| (Modify the limitation on the total number of components emitted by a |
| geometry program from NV_gpu_program4 to be per-invocation. If a that |
| limit is 4096 and a program has 16 invocations, each of the 16 program |
| invocation can emit up to 4096 total components.) |
| |
| There are two implementation-dependent limits that limit the total number |
| of vertices that each invocation of a program can emit. First, the vertex |
| limit may not exceed the value of MAX_PROGRAM_OUTPUT_VERTICES_NV. Second, |
| product of the vertex limit and the number of result variable components |
| written by the program (PROGRAM_RESULT_COMPONENTS_NV, as described in |
| section 2.X.3.5 of NV_gpu_program4) may not exceed the value of |
| MAX_PROGRAM_TOTAL_OUTPUT_COMPONENTS_NV. A geometry program will fail to |
| load if its maximum vertex count or maximum total component count exceeds |
| the implementation-dependent limit. The limits may be queried by calling |
| GetProgramiv with a <target> of GEOMETRY_PROGRAM_NV. Note that the |
| maximum number of vertices that a geometry program can emit may be much |
| lower than MAX_PROGRAM_OUTPUT_VERTICES_NV if the program writes a large |
| number of result variable components. If a geometry program has multiple |
| invocations (via the "INVOCATIONS" declaration), the program will load |
| successfully as long as no single invocation exceeds the total component |
| count limit, even if the total output of all invocations combined exceeds |
| the limit. |
| |
| |
| Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization) |
| |
| Modify Section 3.X, Early Per-Fragment Tests, as documented in the |
| EXT_shader_image_load_store specification |
| |
| (add new paragraph at the end of a section, describing how early fragment |
| tests work when assembly fragment programs are active) |
| |
| If an assembly fragment program is active, early depth tests are |
| considered enabled if and only if the fragment program source included the |
| NV_early_fragment_tests option. |
| |
| |
| Add to Section 3.11.4.5 of ARB_fragment_program (Fragment Program): |
| |
| Section 3.11.4.5.3, ARB_blend_func_extended Option |
| |
| If a fragment program specifies the "ARB_blend_func_extended" option, dual |
| source color outputs as described in ARB_blend_func_extended are made |
| available through the use of the "result.color[n].primary" and |
| "result.color[n].secondary" result bindings, corresponding to SRC_COLOR |
| and SRC1_COLOR, respectively, for the fragment color output numbered <n>. |
| |
| |
| Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment |
| Operations and the Frame Buffer) |
| |
| Modify Section 4.4.3, Rendering When an Image of a Bound Texture Object |
| is Also Attached to the Framebuffer, p. 288 |
| |
| (Replace the complicated set of conditions with the following) |
| |
| Specifically, the values of rendered fragments are undefined if any |
| shader stage fetches texels from a given mipmap level, cubemap face, and |
| array layer of a texture if that same mipmap level, cubemap face, and |
| array layer of the texture can be written to via fragment shader outputs, |
| even if the reads and writes are not in the same Draw call. However, an |
| application can insert MemoryBarrier(TEXTURE_FETCH_BARRIER_BIT_NV) between |
| Draw calls that have such read/write hazards in order to guarantee that |
| writes have completed and caches have been invalidated, as described in |
| section 2.20.X. |
| |
| |
| Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions) |
| |
| None. |
| |
| Additions to Chapter 6 of the OpenGL 3.0 Specification (State and |
| State Requests) |
| |
| None. |
| |
| Additions to Appendix A of the OpenGL 3.0 Specification (Invariance) |
| |
| None. |
| |
| Additions to the AGL/GLX/WGL Specifications |
| |
| None. |
| |
| GLX Protocol |
| |
| None. |
| |
| Errors |
| |
| None, other than new conditions by which a program string would fail to |
| load. |
| |
| New State |
| |
| None. |
| |
| |
| New Implementation Dependent State |
| |
| Minimum |
| Get Value Type Get Command Value Description Sec. Attrib |
| -------------------------------- ---- --------------- ------- --------------------- ------ ------ |
| MAX_GEOMETRY_PROGRAM_ Z+ GetIntegerv 32 Maximum number of GP 2.X.6.Y - |
| INVOCATIONS_NV invocations per prim. |
| MIN_FRAGMENT_INTERPOLATION_ R GetFloatv -0.5 Max. negative offset 2.X.8.Z - |
| OFFSET_NV for IPAO instruction. |
| MAX_FRAGMENT_INTERPOLATION_ R GetFloatv +0.5 Max. positive offset 2.X.8.Z - |
| OFFSET_NV for IPAO instruction. |
| FRAGMENT_PROGRAM_INTERPOLATION_ Z+ GetIntegerv 4 Subpixel bit count 2.X.8.Z - |
| OFFSET_BITS_NV for IPAO instruction |
| |
| |
| Dependencies on NV_gpu_program4, NV_vertex_program4, NV_geometry_program4, and |
| NV_fragment_program4 |
| |
| This extension is written against the NV_gpu_program4 family of |
| extensions, and introduces new instruction set features and inputs/outputs |
| described here. These features are available only if the extension is |
| supported and the appropriate program header string is used ("!!NVvp5.0" |
| for vertex programs, "!!NVgp5.0" for geometry programs, and "!!NVfp5.0" |
| for fragment programs.) When loading a program with an older header (e.g., |
| "!!NVvp4.0"), the instruction set features described in this extension are |
| not available. The features in this extension build upon those documented |
| in full in NV_gpu_program4. |
| |
| Dependencies on NV_tessellation_program5 |
| |
| This extension provides the basic assembly instruction set constructs for |
| tessellation programs. If this extension is supported, tessellation |
| control and evaluation programs are supported, as described in the |
| NV_tessellation_program5 specification. There is no separate extension |
| string for tessellation programs; such support is implied by this |
| extension. |
| |
| Dependencies on ARB_transform_feedback3 |
| |
| The concept of multiple vertex streams emitted by a geometry shader is |
| introduced by ARB_transform_feedback3, as is the description of how they |
| operate and implementation-dependent limits on the number of streams. |
| This extension simply provides a mechanism to emit a vertex to more than |
| one stream. If ARB_transform_feedback3 is not supported, language |
| describing the EMITS opcode and the restriction on PRIMITIVE_OUT when |
| EMITS is used should be removed. |
| |
| Dependencies on NV_shader_buffer_load |
| |
| The programmability functionality provided by NV_shader_buffer_load is |
| also incorporated by this extension. Any assembly program using a program |
| header corresponding to this or any subsequent extension (e.g., |
| "!!NVfp5.0") may use the LOAD opcode without needing to declare "OPTION |
| NV_shader_buffer_load". |
| |
| NV_shader_buffer_load is required by this extension, which means that the |
| API mechanisms documented there allowing applications to make a buffer |
| resident and query its GPU address are available to any applications using |
| this extension. |
| |
| In addition to the basic functionality in NV_shader_buffer_load, this |
| extension provides the ability to load 64-bit integers and floating-point |
| values using the "S64", "S64X2", "S64X4", "U64", "U64X2", "U64X4", "F64", |
| "F64X2", and "F64X4" opcode modifiers. |
| |
| Dependencies on NV_shader_buffer_store |
| |
| This extension provides assembly programmability support for the |
| NV_shader_buffer_store, which provides the API mechanisms allowing buffer |
| object to be stored to. NV_shader_buffer_store does not have a separate |
| extension string entry, and will always be supported if this extension is |
| present. |
| |
| Dependencies on NV_parameter_buffer_object2 |
| |
| The programmability functionality provided by NV_parameter_buffer_object2 |
| is also incorporated by this extension. Any assembly program using a |
| program header corresponding to this or any subsequent extension (e.g., |
| "!!NVfp5.0") may use the LDC opcode without needing to declare "OPTION |
| NV_parameter_buffer_object2". |
| |
| In addition to the basic functionality in NV_parameter_buffer_object2, |
| this extension provides the ability to load 64-bit integers and |
| floating-point values using the "S64", "S64X2", "S64X4", "U64", "U64X2", |
| "U64X4", "F64", "F64X2", and "F64X4" opcode modifiers. |
| |
| Dependencies on OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle |
| |
| If OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle are not |
| supported, remove the swizzling step from the definition of TXG and TXGO. |
| |
| Dependencies on ARB_blend_func_extended |
| |
| If ARB_blend_func_extended is not supported, references to the dual source |
| color output bindings (result.color.primary and result.color.secondary) |
| should be removed. |
| |
| Dependencies on EXT_shader_image_load_store |
| |
| EXT_shader_image_load_store provides OpenGL Shading Language mechanisms to |
| load/store to buffer and texture image memory, including spec language |
| describing memory access ordering and synchronization, a built-in function |
| (MemoryBarrierEXT) controlling synchronization of memory operations, and |
| spec language describing early fragment tests that can be enabled via GLSL |
| fragment shader source. These sections of the EXT_shader_image_load_store |
| specification apply equally to the assembly program memory accesses |
| provided by this extension. If EXT_shader_image_load_store is not |
| supported, the sections of that specification describing these features |
| should be considered to be added to this extension. |
| |
| EXT_shader_image_load_store additionally provides and documents assembly |
| language support for image loads, stores, and atomics as described in the |
| "Dependencies on NV_gpu_program5" section of EXT_shader_image_load_store. |
| The features described there are automatically supported for all |
| NV_gpu_program5 assembly programs without requiring any additional |
| "OPTION" line. |
| |
| Dependencies on ARB_shader_subroutine |
| |
| ARB_shader_subroutine provides and documents assembly language support for |
| subroutines as described in the "Dependencies on NV_gpu_program5" section |
| of ARB_shader_subroutine. The features described there are automatically |
| supported for all NV_gpu_program5 assembly programs without requiring any |
| additional "OPTION" line. |
| |
| |
| Issues |
| |
| (1) Are there any restrictions or performance concerns involving the |
| support for indexing textures or parameter buffers? |
| |
| RESOLVED: There are no significant functional limitations. Textures |
| and parameter buffers accessed with an index must be declared as arrays, |
| so the assembler knows which textures might be accessed this way. |
| Additionally, accessing an array of textures or parameter buffers with |
| an out-of-bounds index will yield undefined results. |
| |
| In particular, there is no limitation on the values used for indexing -- |
| they are not required to be true constants and are not required to have |
| the same value for all vertices/fragments in a primitive. However, |
| using divergent texture or parameter buffer indices may have performance |
| concerns. We expect that GPU implementations of this extension will run |
| multiple program threads in parallel (SIMD). If different threads in a |
| thread group have different indices, it will be necessary to do lookups |
| in more than one texture at once. This is likely to result in some |
| thread serialization. We expect that indexed texture or parameter |
| buffer access where all indices in a thread group match will perform |
| identically to non-indexed accesses. |
| |
| (2) Which texture instructions support programmable texel offsets, and |
| what offset limits apply? |
| |
| RESOLVED: Most texture instructions (TEX, TXB, TXF, TXG, TXL, TXP) |
| support both constant texel offsets as provided by NV_gpu_program4 and |
| programmable texel offsets. TXD supports only constant offsets. TXGO |
| does not support non-zero or programmable offsets in the texture portion |
| of the instruction, but provides full support for programmable offsets |
| via two of the three vector arguments in the regular instruction. |
| |
| For example, |
| |
| TEX result, coord, texture[0], 2D, (-1,-1); |
| |
| uses the NV_gpu_program4 mechanism applies a constant texel offset of |
| (-1,-1) to the texture coordinates. With programmable offsets, the |
| following code applies the same offset. |
| |
| TEMP offxy; |
| MOV offxy, {-1, -1}; |
| TEX result, coord, texture[0], offset(offxy); |
| |
| Of course, the programmable form allows the offsets to be computed in |
| the program and does not require constant values. |
| |
| For most texture instructions, the range of allowable offsets is |
| [MIN_PROGRAM_TEXEL_OFFSET_EXT, MAX_PROGRAM_TEXEL_OFFSET_EXT] for both |
| constant and programmable texel offsets. Constant offsets can be |
| checked when the program is loaded, and out-of-bounds offsets cause the |
| program to fail to load. Programmable offsets can not have a |
| load-time range check; out-of-bounds offsets produce undefined results. |
| |
| Additionally, the new TXGO instruction has a separate (likely larger) |
| allowable offset range, [MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV, |
| MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV], that applies to the offset |
| vectors passed in its second and third operand. |
| |
| In the initial implementation of this extension, the range limits are |
| [-8,+7] for most instructions and [-32,+31] for TXGO. |
| |
| (3) What is TXGO (texture gather with separate offsets) good for? |
| |
| RESOLVED: TXGO allows for efficiently sampling a single-component |
| texture with a variety of offsets that need not be contiguous. |
| |
| For example, a shadow mapping algorithm using a high-resolution shadow |
| map may have pixels whose footpoint covers a large number of texels in |
| the shadow map. Such pixels could do a single lookup into a |
| lower-resolution texture (using mipmapping), but quality problems will |
| arise. Alternately, a shader could perform a large number of texture |
| lookups using either NEAREST or LINEAR filtering from the |
| high-resolution texture. NEAREST filtering will require a separate |
| lookup for each texel accessed; LINEAR filtering may require somewhat |
| fewer lookups, but all accesses cover a 2x2 portion of the texture. The |
| TXG instruction added to NV_gpu_program4_1 allows a 2x2 block of texels |
| to be returned in a single instruction in case the program wants to do |
| something other than linear filtering with the samples. The TXGO allows |
| a program to do semi-random sampling of the texture without requiring |
| that each sample cover a 2x2 block of texels. For example, the TXGO |
| instruction would allow a program to the four texels A, H, J, O from the |
| 4x4 block depicted below: |
| |
| TXGO result, coord, {-1,+2,0,+1}, {-1,0,+1,+2}, texture[0], 2D; |
| |
| The "equivalent" TXG instruction would only sample the four center |
| texels F, G, J, and K |
| |
| TXG result, coord, texture[0], 2D; |
| |
| All sixteen texels of the footprint could be sampled with four TXG |
| instructions, |
| |
| TXG result0, coord, texture[0], 2D, (-1,-1); |
| TXG result1, coord, texture[0], 2D, (-1,+1); |
| TXG result2, coord, texture[0], 2D, (+1,-1); |
| TXG result3, coord, texture[0], 2D, (+1,+1); |
| |
| but accessing a smaller number of samples spread across the footprint |
| with fewer instructions may produce results that are good enough. |
| |
| The figure here depicts a texture with texel (0,0) shown in the |
| upper-left corner. If you insist on a lower-left origin, please look at |
| this figure while standing on your head. |
| |
| (0,0) +-+-+-+-+ |
| |A|B|C|D| |
| +-+-+-+-+ |
| |E|F|G|H| |
| +-+-+-+-+ |
| |I|J|K|L| |
| +-+-+-+-+ |
| |M|N|O|P| |
| +-+-+-+-+ (4,4) |
| |
| (4) Why are the results of TXGO (texture gather with separate offsets) |
| undefined if the wrap mode is CLAMP or MIRROR_CLAMP_EXT? |
| |
| RESOLVED: The CLAMP and MIRROR_CLAMP_EXT wrap modes are fairly |
| different from other wrap modes. After adding any instruction offsets, |
| the spec says to pre-clamp the (u,v) coordinates to [0,texture_size] |
| before generating the footprint. If such clamping occurs on one edge |
| for a normal texture filtering operation, the footprint ends up being |
| half border texels, half edge texels, and the clamping effectively |
| forces the interpolation weights used for texture filtering to 50/50. |
| |
| We expect the TXG instruction to be used in cases where an application |
| may want to do custom filtering, and is in control of its own filtering |
| weights. Coordinate clamping as above will affect the footprint used |
| for filtering, but not the weights. In the NV_gpu_program4_1 spec, we |
| defined the TXG/CLAMP combination to simply return the "normal" |
| footprint produced after the pre-clamp operation above. Any adjustment |
| of weights due to clamping is the responsibility of the application. We |
| don't expect this to be a common operation, because CLAMP_TO_EDGE or |
| CLAMP_TO_BORDER are much more sensible wrap modes. |
| |
| The hardware implementing TXGO is anticipated to extract all four |
| samples in a single pass. However, the spec language is defined for |
| simplicity to perform four separate "gather" operations with the four |
| provided offsets, extract a single sample from each, and combine the |
| four samples into a vector. This would require four separate pre-clamp |
| operations, which was deemed too costly to implement in hardware for a |
| wrap mode that doesn't work well with texture gather operations. Even |
| if such hardware were built, it still wouldn't obtain a footprint |
| resembling the half-border, half-edge footprint for simple TXGO offsets |
| -- that would require different per-texel clamping rules for the four |
| samples. We chose to leave the results of this operation undefined. |
| |
| (5) Should double-precision floating-point support be required or |
| optional? If optional, how? |
| |
| RESOLVED: Double-precision floating-point support will be optional in |
| case low-end GPUs supporting the remainder of these instruction features |
| choose to cut costs by removing the silicon necessary to implement |
| 64-bit floating-point arithmetic. |
| |
| (6) While this extension supports double-precision computation, how can |
| you provide high-precision inputs and outputs to the GPU programs? |
| |
| RESOLVED: The underlying hardware implementing this extension does not |
| provide full support for 64-bit floats, even though DOUBLE is a standard |
| data type provided by the GL. For example, when specifying a vertex |
| array with a data type of DOUBLE, the vertex attribute components will |
| end up being converted to 32-bit floats (FLOAT) by the driver before |
| being passed to the hardware, and the extra precision in the original |
| 64-bit float values will be lost. |
| |
| For vertex attributes, the EXT_vertex_attrib_64bit and |
| NV_vertex_attrib_integer_64bit extensions provide the ability to specify |
| 64-bit vertex attribute components using the VertexAttribL* and |
| VertexAttribLPointer APIs. Such attributes can be read in a vertex |
| program using a "LONG ATTRIB" declaration: |
| |
| LONG ATTRIB vector64; |
| |
| The LONG modifier can only be used vertex program inputs, and can not be |
| used for inputs of any program type or outputs of any program type. |
| |
| For other cases, this extension provides the PK64 and UP64 instructions |
| that provide a mechanism to pass 64-bit components using consecutive |
| 32-bit components. For example, a 3-component vector with 64-bit |
| components can be passed to a vertex shader using multiple vertex |
| attributes without using the VertexAttribL APIs with the following code: |
| |
| /* Pass the X/Y components in vertex attribute 0 (X/Y/Z/W). Use |
| stride to skip over Z. */ |
| glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble), |
| (GLdouble *) buffer); |
| |
| /* Pass the Z components in vertex attribute 1 (X/Y). Use stride to |
| skip over original X/Y components. */ |
| glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble), |
| (GLdouble *) buffer + 2); |
| |
| In this example, the vertex program would use the PK64 instruction to |
| reconstruct the 64-bit value for each component as follows: |
| |
| LONG TEMP reconstructed; |
| PK64 reconstructed.xy, vertex.attrib[0]; |
| PK64 reconstructed.z, vertex.attrib[1]; |
| |
| A similar technique can be used to pass 64-bit values computed by a GPU |
| program, using transform feedback or writes to a color buffer. The UP64 |
| instruction would be used to convert the 64-bit computed value into two |
| 32-bit values, which would be written to adjacent components. |
| |
| Note also that the original hardware implementation of this extension |
| does not support interpolation of 64-bit floating-point values. If an |
| application desires to pass a 64-bit floating-point value from a vertex |
| or geometry program to a fragment program, and doesn't require |
| interpolation, the PK64/UP64 techniques can be combined. For example, |
| the vertex shader could unpack a 3-component vector with 64-bit |
| components into a four-component and a two-component 32-bit vector: |
| |
| LONG TEMP result64; |
| RESULT result32[2] = { result.attrib[0..1] }; |
| UP64 result32[0], result64.xyxy; |
| UP64 result32[1].xy, result64.z; |
| |
| The fragment program would read and reconstruct using PK64: |
| |
| LONG TEMP input64; |
| FLAT ATTRIB input32[3] = { fragment.attrib[0..1] }; |
| PK64 input64.xy, input32[0]; |
| PK64 input64.z, input32[1]; |
| |
| Note that such inputs must be declared as "FLAT" in the fragment program |
| to prevent the hardware from trying to do floating-point interpolation |
| on the separate 32-bit halves of the value being passed. Such |
| interpolation would produce complete garbage. |
| |
| (7) What are instanced geometry programs useful for? |
| |
| RESOLVED: Instanced geometry programs allow geometry programs that |
| perform regular operations to run more efficiently. |
| |
| Consider a simple example of an algorithm that uses geometry programs to |
| render primitives to a cube map in a single pass. Without instanced |
| geometry programs, the geometry program to render triangles to the cube |
| map would do something like: |
| |
| for (face = 0; face < 6; face++) { |
| for (vertex = 0; vertex < 3; vertex++) { |
| project vertex <vertex> onto face <face>, output position |
| compute/copy attributes of emitted <vertex> to outputs |
| output <face> to result.layer |
| emit the projected vertex |
| } |
| end the primitive (next triangle) |
| } |
| |
| This algorithm would output 18 vertices per input triangle, three for |
| each cube face. The six triangles emitted would be rasterized, one per |
| face. Geometry programs that emit a large number of attributes have |
| often posed performance challenges, since all the attributes must be |
| stored somewhere until the emitted primitives. Large storage |
| requirements may limit the number of threads that can be run in parallel |
| and reduce overall performance. |
| |
| Instanced geometry programs allow this example to be restructured to run |
| with six separate threads, one per face. Each thread projects the |
| triangle to only a single face (identified by the invocation number) and |
| emits only 3 vertices. The reduced storage requirements allow more |
| geometry program threads to be run in parallel, with greater overall |
| efficiency. |
| |
| Additionally, the total number of attributes that can be emitted by a |
| single geometry program invocation is limited. However, for instanced |
| geometry shaders, that limit applies to each of <N> program invocations |
| which allows for a larger total output. For example, if the GL |
| implementation supports only 1024 components of output per program |
| invocation, the 18-vertex algorithm above could emit no more than 56 |
| components per vertex. The same algorithm implemented as a 3-vertex |
| 6-invocation geometry program could theoretically allow for 341 |
| components per vertex. |
| |
| (8) What are the special interpolation opcodes (IPAC, IPAO, IPAS) good |
| for, and how do they work? |
| |
| RESOLVED: The interpolation opcodes allow programs to control the |
| frequency and location at which fragment inputs are sampled. Limited |
| control has been provided in previous extensions, but the support was |
| more limited. NV_gpu_program4 had an interpolation modifier (CENTROID) |
| that allowed attributes to be sampled inside the primitive, but that was |
| a per-attribute modifier -- you could only sample any given attribute at |
| one location. NV_gpu_program4_1 added a new interpolation modifier |
| (SAMPLE) that directed that fragment programs be run once per sample, |
| and that the specified attributes be interpolated at the sample |
| location. Per-sample interpolation can produce higher quality, but the |
| performance cost is significant since more fragment program invocations |
| are required. |
| |
| This extension provides additional control over interpolation, and |
| allows programs to interpolate attributes at different locations without |
| necessarily requiring the performance hit of per-sample invocation. |
| |
| The IPAC instruction allows an attribute to be sampled at the centroid |
| location, while still allowing the same attribute to be sampled |
| elsewhere. The IPAS instruction allows the attribute to be sampled at a |
| number sample location, as per-sample interpolation would do. Multiple |
| IPAS instructions with different sample numbers allows a program to |
| sample an attribute at multiple sample points in the pixel and then |
| combine the samples in a programmable manner, which may allow for higher |
| quality than simply interpolating at a single representative point in |
| the pixel. The IPAO instruction allows the attribute to be sampled at |
| an arbitrary (x,y) offset relative to the pixel center. The range of |
| supported (x,y) values is limited, and the limits in the initial |
| implementation are not large enough to permit sampling the attribute |
| outside the pixel. |
| |
| Note that previous instruction sets allowed shaders to fake IPAC, |
| IPAS, and IPAO by a sequence such as: |
| |
| TEMP ddx, ddy, offset, interp; |
| MOV interp, fragment.attrib[0]; # start with center |
| DDX ddx, fragment.attrib[0]; |
| MAD interp, offset.x, ddx, interp; # add offset.x * dA/dx |
| DDY ddx, fragment.attrib[0]; |
| MAD interp, offset.y, ddy, interp; # add offset.y * dA/dy |
| |
| However, this method does not apply perspective correction. The quality |
| of the results may be unacceptable, particularly for primitives that are |
| nearly perpendicular to the screen. |
| |
| The semantics of the first operand of these instructions is different |
| from normal assembly instructions. Operands are normally evaluated by |
| loading the value of the corresponding variable and applying any |
| swizzle/negation/absolute value modifier before the instruction is |
| executed. In the IPAC/IPAO/IPAS instructions, the value of the |
| attribute is evaluated by the instruction itself. Swizzles, negation, |
| and absolute value modifiers are still allowed, and are applied after |
| the attribute values are interpolated. |
| |
| (9) When using a program that issues global stores (via the STORE |
| instruction), what amount of execution ordering is guaranteed? How |
| can an application ensure that writes executed in a shader have |
| completed and will be visible to other operations using the buffer |
| object in question? |
| |
| RESOLVED: There are very few automatic guarantees for potential |
| write/read or write/write conflicts. Program invocations will run in |
| generally run in arbitrary order, and applications can't rely on |
| read/write order to match primitive order. |
| |
| To get consistent results when buffers are read and written using |
| multiple pipeline stages, manual synchronization using the |
| MemoryBarrierEXT() API documented in EXT_shader_image_load_store or some |
| other synchronization primitive is necessary. |
| |
| (10) Unlike most other shader features, the STORE opcode allows for |
| externally-visible side effects from executing a program. How does |
| this capability interact with other features of the GL? |
| |
| RESOLVED: First, some GL implementations support a variety of "early Z" |
| optimizations designed to minimize unnecessary fragment processing work, |
| such as executing an expensive fragment program on a fragment that will |
| eventually fail the depth test. Such optimizations have been valid |
| because fragment programs had no side effects. That is no longer the |
| case, and such optimizations may not be employed if the fragment program |
| performs a global store. However, we provide a new "early depth and |
| stencil test" enable that allows applications to deterministically |
| control depth and stencil testing. If enabled, depth testing is always |
| performed prior to fragment program execution. Fragment programs will |
| never be run on fragments that fail any of these tests. |
| |
| Second, we are permitting global stores in all program types; however, |
| the number of program invocations is not well-defined for some program |
| types. For example, a GL implementation may choose to combine multiple |
| instances of identical vertices (e.g., duplicate indices in |
| DrawElements, immediate-mode vertices with identical data) into one |
| single vertex program invocation, or it may run a vertex program on each |
| separately. Similarly, the tessellation primitive generator will |
| generate independent primitives with duplicated vertices, which may or |
| may not be combined for tessellation evaluation program execution. |
| Fragment program execution also has several issues described in more |
| detail below. |
| |
| (11) What issues arise when running fragment programs doing global stores? |
| |
| RESOLVED: The order of per-fragment operations in the existing OpenGL |
| 3.0 specification can be fairly loose, because previously-defined |
| fragment programs, shaders, and fixed-function fragment processing had |
| no side effects. With side effects, the order of operations must be |
| defined more tightly. In particular, the pixel ownership and scissor |
| tests are specified to be performed prior to fragment program execution, |
| and we provide an option to perform depth and stencil tests early as |
| well. |
| |
| OpenGL implementations sometimes run fragment programs on "helper" |
| pixels that have no coverage in order to be able to compute sane partial |
| deriviatives for fragment program instructions (DDX, DDY) or automatic |
| level-of-detail calculation for texturing. In this approach, |
| derivatives are approximated by computing the difference in a quantity |
| computed for a given fragment at (x,y) and a fragment at a neighboring |
| pixel. When a fragment program is executed on a "helper" pixel, global |
| stores have no effect. Helper pixels aren't explicitly mentioned in the |
| spec body; instead, partial derivatives are obtained by magic. |
| |
| If a fragment program contains a KIL instruction, compilers may not |
| reorder code where an ATOM or STORE execution is executed before a KIL |
| instruction that logically precedes it in flow control. Once a fragment |
| is killed, subsequent atomics or stores should never be executed. |
| |
| Multisample rasterization poses several issues for fragment programs |
| with global stores. The number of times a fragment program is executed |
| for multisample rendering is not fully specified, which gives |
| implementations a number of different choices -- pure multisample (only |
| runs once), pure supersample (runs once per covered sample), or modes in |
| between. There are some ways for an application to indirectly control |
| the behavior -- for example, fragment programs specifying per-sample |
| attribute interpolation are guaranteed to run once per covered sample. |
| |
| Note that when rendering to a multisample buffer, a pair of adjacent |
| triangles may cause a fragment program to be executed more than once at |
| a given (x,y) with different sets of samples covered. This can also |
| occur in the interior of a quadrilateral or polygon primitive. |
| Implementations are permitted to split quads and polygons with >3 |
| vertices into triangles, creating interior edges that split a pixel. |
| |
| (12) What happens if early fragment tests are enabled, the early depth |
| test passes, and a fragment program that computes a new depth value |
| is executed? |
| |
| RESOLVED: The depth value produced by the fragment program has no |
| effect if early fragment tests are enabled. The depth value computed by |
| a fragment program is used only by the post-fragment program stencil and |
| depth tests, and those tests always have no effect when early depth |
| testing is enabled. |
| |
| (13) How do early fragment tests interact with occlusion queries? |
| |
| RESOLVED: When early fragment tests are enabled, sample counting for |
| occlusion queries also happens prior to fragment program execution. |
| Enabling early fragment tests can change the overall sample count, |
| because samples killed by alpha test and alpha to coverage will still be |
| counted if early fragment tests are enabled. |
| |
| (14) What happens if a program performs a global store to a GPU address |
| corresponding to a read-only buffer mapping? What if it performs a |
| global read to a write-only mapping? |
| |
| RESOLVED: Implementations may choose implement full memory protection, |
| in which case accesses using the wrong type of memory mapping will fault |
| and lead to termination of the application. |
| |
| However, full memory protection is not required in this extension -- |
| implementations may choose to substitute a read-write mapping in place |
| of a read-only or write-only mapping. As a result, we specify the |
| result of such invalid loads and stores to be undefined. |
| |
| Note that if a program erroneously writes to nominally read-only |
| mappings, the results may be weird. If the implementation substitutes a |
| read-write mapping, such invalid writes are likely to proceed normally. |
| However, if the application later makes a buffer object non-resident and |
| the memory manager of the GL implementation needs to move the buffer, |
| the GL may assume that the contents of the buffer have not been modified |
| and thus discard the new values written by the (invalid) global store |
| instructions. |
| |
| (15) What performance considerations apply to atomics? |
| |
| RESOLVED: Atomics can be useful for operations like locking, or for |
| maintaining counters. Note that high-performance GPUs may have hundreds |
| of program threads in flight at once, and may also have some SIMD |
| characteristics (where threads are grouped and run as a unit). Using |
| ATOM instructions with a single memory address to implement a critical |
| section will result in serial execution -- only one of the hundreds of |
| threads can execute code in the critical section at a time. |
| |
| When a global operation would be done under a lock, it may be possible |
| to improve performance if the algorithm can be parallelized to have |
| multiple critical sections. For example, an application could allocate |
| an array of shared resources, each protected by its own lock, and use |
| the LSBs of the primitive ID or some function of the screen-space (x,y) |
| to determine which resource in the array to use. |
| |
| (16) The atomic instruction ATOM returns the old contents of memory into |
| the result register. Should we provide a version of this opcodes |
| that doesn't return a value? |
| |
| RESOLVED: No. In theory, atomics that don't return any values can |
| perform better (because the program may not need to allocate resources |
| to hold a result or wait for the result. However, a new opcode isn't |
| required to obtain this behavior -- a compiler can recognize that the |
| result of an ATOM instruction is written to a "dummy" temporary that |
| isn't read by subsequent instructions: |
| |
| TEMP junk; |
| ATOM.ADD.U32 junk, address, 1; |
| |
| The compiler can also recognize that the result will always be discarded |
| if a conditional write mask of "(FL)" is used. |
| |
| ATOM.ADD.U32 not_junk (FL), address, 1; |
| |
| (17) How do we ensure that memory access made by multiple program |
| invocations of possibly different types are coherent? |
| |
| RESOLVED: Atomic instructions allow program invocations to coordinate |
| using shared global memory addresses. However, memory transactions, |
| including atomics, are not guaranteed to land in the order specified in |
| the program; they may be reordered by the compiler, cached in different |
| memory hierarchies, and stored in a distributed memory system where |
| later stores to one "partition" might be completed prior to earlier |
| stores to another. The MEMBAR instruction helps control memory |
| transaction ordering by ensuring that all memory transactions prior to |
| the barrier complete before any after the barrier. Additionally the |
| ".COH" modifier ensures that memory transactions using the modifier are |
| cached coherently and will be visible to other shader invocations. |
| |
| (18) How do the TXG and TXGO opcodes work with sRGB textures? |
| |
| RESOLVED. Gamma-correction is applied to the texture source color |
| before "gathering" and hence applies to all four components, unless |
| the texture swizzle of the selected component is ALPHA in which case |
| no gamma-correction is applied. |
| |
| (19) How can render-to-texture algorithms take advantage of |
| MemoryBarrierEXT, nominally provided for global memory transactions? |
| |
| RESOLVED: Many algorithms use RTT to ping-pong between two allocations, |
| using the result of one rendering pass as the input to the next. |
| Existing mechanisms require expensive FBO Binds, DrawBuffer changes, or |
| FBO attachment changes to safely swap the render target and texture. With |
| memory barriers, layered geometry shader rendering, and texture arrays, |
| an application can very cheaply ping-pong between two layers of a single |
| texture. i.e. |
| |
| X = 0; |
| // Bind the array texture to a texture unit |
| // Attach the array texture to an FBO using FramebufferTextureARB |
| while (!done) { |
| // Stuff X in a constant, vertex attrib, etc. |
| Draw - |
| Texturing from layer X; |
| Writing gl_Layer = 1 - X in the geometry shader; |
| |
| MemoryBarrierNV(TEXTURE_FETCH_BARRIER_BIT_NV); |
| X = 1 - X; |
| } |
| |
| However, be warned that this requires geometry shaders and hence adds |
| the overhead that all geometry must pass through an additional program |
| stage, so an application using large amounts of geometry could become |
| geometry-limited or more shader-limited. |
| |
| (20) What is the ".PREC" instruction modifier good for? |
| |
| RESOLVED: ".PREC" provides some invariance guarantees is useful for |
| certain algorithms. Using ".PREC", it is possible to ensure that an |
| algorithm can be written to produce identical results on subtly |
| different inputs. For example, the order of vertices visible to a |
| geometry or tessellation shader used to subdivide primitive edges might |
| present an edge shared between two primitives in one direction for one |
| primitive and the other direction for the adjacent primitive. Even if |
| the weights are identical in the two cases, there may be cracking if the |
| computations are being done in an order-dependent manner. If the |
| position of a new vertex were evaluation with code below with |
| limited-precision floating-point math, it's not necessarily the case |
| that we will get the same result for inputs (a,b,c) and (c,b,a) in the |
| following code: |
| |
| ADD result, a, b; |
| ADD result, result, c; |
| |
| There are two problems with this code: the rounding errors will be |
| different and the implementation is free to rearrange the computation |
| order. The code can be rewritten as follows with ".PREC" and a |
| symmetric evaluation order to ensure a precise result with the inputs |
| reversed: |
| |
| ADD result, a, c; |
| ADD.PREC result, result, b; |
| |
| Note that in this example, the first instruction doesn't need the |
| ".PREC" qualifier because the second instruction requires that the |
| implementation compute <a>+<c>, which will be done reliably if <a> and |
| <c> are inputs. If <a> and <c> were results of other computations, the |
| first add and possibly the dependent computations may also need to be |
| tagged with ".PREC" to ensure reliable results. |
| |
| The ".PREC" modifier will disable certain optimization and thus carries |
| a performance cost. |
| |
| (21) What are the TGALL, TGANY, TGEQ instructions good for? |
| |
| RESOLVED: If an implementation performs SIMD thread execution, |
| divergent branching may result in reduced performance if the "if" and |
| "else" blocks of an "if" statement are executed sequentially. For |
| example, an algorithm may have both a "fast path" that performs a |
| computation quickly for a subset of all cases and a "fast path" that |
| performs a computation quickly but correctly. When performing SIMD |
| execution, code like the following: |
| |
| SNE.S.CC cc.x, condition.x; |
| IF NE.x; |
| # do fast path |
| ELSE; |
| # do slow path |
| ENDIF; |
| |
| may end up executing *both* the fast and slow paths for a SIMD thread |
| group if <condition> diverges, and may execute more slowly than simply |
| executing the slow path unconditionally. These instructions allow code |
| like: |
| |
| # Condition code matches NE if and only if condition.x is non-zero |
| # for all threads. |
| TGALL.S.CC cc.x, condition.x; |
| IF NE.x; |
| # do fast path |
| ELSE; |
| # do slow path |
| ENDIF; |
| |
| that executes the fast path if and only if it can be used for *all* |
| threads in the group. For thread groups where <condition> diverges, |
| this algorithm would unconditionally run the slow path, but would never |
| run both in sequence. |
| |
| |
| Revision History |
| |
| Rev. Date Author Changes |
| ---- -------- -------- ----------------------------------------- |
| 7 09/11/14 pbrown Minor typo fixes. |
| |
| 6 07/04/13 pbrown Add missing language describing the |
| <texImageUnitComp> grammar rule for component |
| selection in TXG and TXGO instructions. |
| |
| 5 09/23/10 pbrown Add missing constants for {MIN,MAX}_PROGRAM_ |
| TEXTURE_GATHER_OFFSET_NV (same as ARB/core). |
| Add missing description for "su" in the opcode |
| table; fix a couple operand order bugs for |
| STORE. |
| |
| 4 06/22/10 pbrown Specify that the y/z/w component of the ATOM |
| results are undefined, as is the case with |
| ATOMIM from EXT_shader_image_load_store. |
| |
| 3 04/13/10 pbrown Remove F32 support from ATOM.ADD. |
| |
| 2 03/22/10 pbrown Various wording updates to the spec overview, |
| dependencies, issues, and body. Remove various |
| spec language that has been refactored into the |
| EXT_shader_image_load_store specification. |
| |
| 1 pbrown Internal revisions. |