| Name |
| |
| NV_fragment_shader_interlock |
| |
| Name Strings |
| |
| GL_NV_fragment_shader_interlock |
| |
| Contact |
| |
| Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com) |
| |
| Contributors |
| |
| Jeff Bolz, NVIDIA Corporation |
| Mathias Heyer, NVIDIA Corporation |
| |
| Status |
| |
| Shipping |
| |
| Version |
| |
| Last Modified Date: March 27, 2015 |
| NVIDIA Revision: 2 |
| |
| Number |
| |
| OpenGL Extension #468 |
| OpenGL ES Extension #230 |
| |
| Dependencies |
| |
| This extension is written against the OpenGL 4.3 |
| (Compatibility Profile, dated February 14, 2013), and the |
| OpenGL ES 3.1.0 (dated March 17, 2014) Specification |
| |
| This extension is written against the OpenGL Shading Language |
| Specification (version 4.30, revision 8) and the OpenGL ES Shading |
| Language Specification (version 3.10, revision 2). |
| |
| OpenGL 4.3 and GLSL 4.30 are required in an OpenGL implementation |
| OpenGL ES 3.1 and GLSL ES 3.10 are required in an OpenGL ES implementation |
| |
| This extension interacts with NV_shader_buffer_load and |
| NV_shader_buffer_store. |
| |
| This extension interacts with NV_gpu_program4 and NV_gpu_program5. |
| |
| This extension interacts with EXT_tessellation_shader. |
| |
| This extension interacts with OES_sample_shading |
| |
| This extension interacts with OES_shader_multisample_interpolation |
| |
| This extension interacts with OES_shader_image_atomic |
| |
| Overview |
| |
| In unextended OpenGL 4.3 or OpenGL ES 3.1, applications may produce a |
| large number of fragment shader invocations that perform loads and |
| stores to memory using image uniforms, atomic counter uniforms, |
| buffer variables, or pointers. The order in which loads and stores |
| to common addresses are performed by different fragment shader |
| invocations is largely undefined. For algorithms that use shader |
| writes and touch the same pixels more than once, one or more of the |
| following techniques may be required to ensure proper execution ordering: |
| |
| * inserting Finish or WaitSync commands to drain the pipeline between |
| different "passes" or "layers"; |
| |
| * using only atomic memory operations to write to shader memory (which |
| may be relatively slow and limits how memory may be updated); or |
| |
| * injecting spin loops into shaders to prevent multiple shader |
| invocations from touching the same memory concurrently. |
| |
| This extension provides new GLSL built-in functions |
| beginInvocationInterlockNV() and endInvocationInterlockNV() that delimit a |
| critical section of fragment shader code. For pairs of shader invocations |
| with "overlapping" coverage in a given pixel, the OpenGL implementation |
| will guarantee that the critical section of the fragment shader will be |
| executed for only one fragment at a time. |
| |
| There are four different interlock modes supported by this extension, |
| which are identified by layout qualifiers. The qualifiers |
| "pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual |
| exclusion in the critical section for any pair of fragments corresponding |
| to the same pixel. When using multisampling, the qualifiers |
| "sample_interlock_ordered" and "sample_interlock_unordered" only provide |
| mutual exclusion for pairs of fragments that both cover at least one |
| common sample in the same pixel; these are recommended for performance if |
| shaders use per-sample data structures. |
| |
| Additionally, when the "pixel_interlock_ordered" or |
| "sample_interlock_ordered" layout qualifier is used, the interlock also |
| guarantees that the critical section for multiple shader invocations with |
| "overlapping" coverage will be executed in the order in which the |
| primitives were processed by the GL. Such a guarantee is useful for |
| applications like blending in the fragment shader, where an application |
| requires that fragment values to be composited in the framebuffer in |
| primitive order. |
| |
| This extension can be useful for algorithms that need to access per-pixel |
| data structures via shader loads and stores. Such algorithms using this |
| extension can access such data structures in the critical section without |
| worrying about other invocations for the same pixel accessing the data |
| structures concurrently. Additionally, the ordering guarantees are useful |
| for cases where the API ordering of fragments is meaningful. For example, |
| applications may be able to execute programmable blending operations in |
| the fragment shader, where the destination buffer is read via image loads |
| and the final value is written via image stores. |
| |
| New Procedures and Functions |
| |
| None. |
| |
| New Tokens |
| |
| None. |
| |
| Modifications to the OpenGL 4.3 Specification (Compatibility Profile) |
| |
| None. |
| |
| Modifications to the OpenGL Shading Language Specification, Version 4.30 |
| |
| Including the following line in a shader can be used to control the |
| language features described in this extension: |
| |
| #extension GL_NV_fragment_shader_interlock : <behavior> |
| |
| where <behavior> is as specified in section 3.3. |
| |
| New preprocessor #defines are added to the OpenGL Shading Language: |
| |
| #define GL_NV_fragment_shader_interlock 1 |
| |
| |
| Modify Section 4.4.1.3, Fragment Shader Inputs (p. 58) |
| |
| (add to the list of layout qualifiers containing "early_fragment_tests", |
| p. 59, and modify the surrounding language to reflect that multiple |
| layout qualifiers are supported on "in") |
| |
| layout-qualifier-id |
| pixel_interlock_ordered |
| pixel_interlock_unordered |
| sample_interlock_ordered |
| sample_interlock_unordered |
| |
| (add to the end of the section, p. 59) |
| |
| The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered", |
| "sample_interlock_ordered", and "sample_interlock_unordered" control the |
| ordering of the execution of shader invocations between calls to the |
| built-in functions beginInvocationInterlockNV() and |
| endInvocationInterlockNV(), as described in section 8.13.3. A |
| compile or link error will be generated if more than one of these layout |
| qualifiers is specified in shader code. If a program containing a |
| fragment shader includes none of these layout qualifiers, it is as |
| though "pixel_interlock_ordered" were specified. |
| |
| Add to the end of Section 8.13, Fragment Processing Functions (p. 168) |
| |
| 8.13.3, Fragment Shader Execution Ordering Functions |
| |
| By default, fragment shader invocations are generally executed in |
| undefined order. Multiple fragment shader invocations may be executed |
| concurrently, including multiple invocations corresponding to a single |
| pixel. Additionally, fragment shader invocations for a single pixel might |
| not be processed in the order in which the primitives generating the |
| fragments were specified in the OpenGL API. |
| |
| The paired functions beginInvocationInterlockNV() and |
| endInvocationInterlockNV() allow shaders to specify a critical section, |
| inside which stronger execution ordering is guaranteed. When using the |
| "pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier, |
| ordering guarantees are provided for any pair of fragment shader |
| invocations X and Y triggered by fragments A and B corresponding to the |
| same pixel. When using the "sample_interlock_ordered" or |
| "sample_interlock_unordered" qualifier, ordering guarantees are provided |
| for any pair of fragment shader invocations X and Y triggered by fragments |
| A and B that correspond to the same pixel, where at least one sample of |
| the pixel is covered by both fragments. No ordering guarantees are |
| provided for pairs of fragment shader invocations corresponding to |
| different pixels. Additionally, no ordering guarantees are provided for |
| pairs of fragment shader invocations corresponding to the same fragment. |
| When multisampling is enabled and the framebuffer has sample buffers, |
| multiple fragment shader invocations may result from a single fragment due |
| to the use of the "sample" auxilliary storage qualifier, OpenGL API |
| commands forcing multiple shader invocations per fragment, or for other |
| implementation-dependent reasons. |
| |
| When using the "pixel_interlock_unordered" or "sample_interlock_unordered" |
| qualifier, the interlock will ensure that the critical sections of |
| fragment shader invocations X and Y with overlapping coverage will never |
| execute concurrently. That is, invocation X is guaranteed to complete its |
| call to endInvocationInterlockNV() before invocation Y completes its call |
| to beginInvocationInterlockNV(), or vice versa. |
| |
| When using the "pixel_interlock_ordered" or "sample_interlock_ordered" |
| layout qualifier, the critical sections of invocations X and Y with |
| overlapping coverage will be executed in a specific order, based on the |
| relative order assigned to their fragments A and B. If fragment A is |
| considered to precede fragment B, the critical section of invocation X is |
| guaranteed to complete before the critical section of invocation Y begins. |
| When a pair of fragments A and B have overlapping coverage, fragment A is |
| considered to precede fragment B if |
| |
| * the OpenGL API command producing fragment A was called prior to the |
| command producing B, or |
| |
| * the point, line, triangle, [[compatibility profile: quadrilateral, |
| polygon,]] or patch primitive producing fragment A appears earlier in |
| the same strip, loop, fan, or independent primitive list producing |
| fragment B. |
| |
| When [[compatibility profile: decomposing quadrilateral or polygon |
| primitives or]] tessellating a single patch primitive, multiple |
| primitives may be generated in an undefined implementation-dependent |
| order. When fragments A and B are generated from such unordered |
| primitives, their ordering is also implementation-dependent. |
| |
| If fragment shader X completes its critical section before fragment shader |
| Y begins its critical section, all stores to memory performed in the |
| critical section of invocation X using a pointer, image uniform, atomic |
| counter uniform, or buffer variable qualified by "coherent" are guaranteed |
| to be visible to any reads of the same types of variable performed in the |
| critical section of invocation Y. |
| |
| If multisampling is disabled, or if the framebuffer does not include |
| sample buffers, fragment coverage is computed per-pixel. In this case, |
| the "sample_interlock_ordered" or "sample_interlock_unordered" layout |
| qualifiers are treated as "pixel_interlock_ordered" or |
| "pixel_interlock_unordered", respectively. |
| |
| |
| Syntax: |
| |
| void beginInvocationInterlockNV(void); |
| void endInvocationInterlockNV(void); |
| |
| Description: |
| |
| The beginInvocationInterlockNV() and endInvocationInterlockNV() may only |
| be placed inside the function main() of a fragment shader and may not be |
| called within any flow control. These functions may not be called after a |
| return statement in the function main(), but may be called after a discard |
| statement. A compile- or link-time error will be generated if main() |
| calls either function more than once, contains a call to one function |
| without a matching call to the other, or calls endInvocationInterlockNV() |
| before calling beginInvocationInterlockNV(). |
| |
| Additions to the AGL/GLX/WGL Specifications |
| |
| None. |
| |
| Errors |
| |
| None. |
| |
| New State |
| |
| None. |
| |
| New Implementation Dependent State |
| |
| None. |
| |
| Interactions with OpenGL ES 3.1 |
| |
| Disabling multisample rasterization is not available on OpenGL ES; |
| it is always enabled. |
| |
| |
| Dependencies on EXT_tessellation_shader |
| |
| If this extension is implemented on OpenGL ES and EXT_tessellation_shader |
| is not supported, remove language referring to tessellation of patch |
| primitives. |
| |
| |
| Dependencies on OES_sample_shading |
| |
| If this extension is implemented on OpenGL ES and OES_sample_shading |
| is not supported, remove references to per-sample shading via |
| MinSampleShading[OES]() |
| |
| |
| Dependencies on OES_shader_image_atomic |
| |
| If this extension is implemented on OpenGL ES and OES_shader_image_atomic |
| is not supported, disregard language referring to atomic memory operations. |
| |
| |
| Dependencies on OES_shader_multisample_interpolation |
| |
| If this extension is implemented on OpenGL ES and OES_shader_- |
| multisample_interpolation is not supported, ignore language |
| about the "sample" auxilliary storage qualifier. |
| |
| |
| Dependencies on NV_shader_buffer_load and NV_shader_buffer_store |
| |
| If NV_shader_buffer_load and NV_shader_buffer_store are not supported, |
| references to ordering memory accesses using pointers should be deleted. |
| |
| |
| Dependencies on NV_gpu_program4 and NV_fragment_program4 |
| |
| Modify Section 2.X.2, Program Grammar, of the NV_fragment_program4 |
| specification (which modifies the NV_gpu_program4 base grammar) |
| |
| <SpecialInstruction> ::= "FSIB" |
| | "FSIE" |
| |
| |
| Modify Section 2.X.4, Program Execution Environment |
| |
| (add to the opcode table) |
| |
| Modifiers |
| Instruction F I C S H D Out Inputs Description |
| ----------- - - - - - - --- -------- -------------------------------- |
| FSIB - - - - - - - - begin fragment shader interlock |
| FSIE - - - - - - - - end fragment shader interlock |
| |
| |
| Modify Section 2.X.6, Program Options |
| |
| + Fragment Shader Interlock (NV_pixel_interlock_ordered, |
| NV_pixel_interlock_unordered, NV_sample_interlock_ordered, and |
| NV_sample_interlock_ordered) |
| |
| If a fragment program specifies the "NV_pixel_interlock_ordered", |
| "NV_pixel_interlock_unordered", "NV_sample_interlock_ordered", or |
| "NV_sample_interlock_ordered" options, it will configure a critical |
| section using the FSIB (fragment shader interlock begin) and FSIE opcodes |
| (fragment shader interlock end) opcodes. The execution of the critical |
| sections will be ordered for pairs of program invocations corresponding to |
| the same pixel, as described in Section 8.13.3 of the OpenGL Shading |
| Language Specification, where the four options are considered to specify |
| layout qualifiers with names equivalent to matching the program option. |
| |
| A program will fail to load if it specifies more than one of these program |
| options, if it specifies exactly one of these options but does not contain |
| exactly one FSIB instruction and one FSIE instruction, or if it contains |
| an FSIB or FSIE instruction without specifying any of these options. |
| |
| |
| Add the following subsections to section 2.X.8, Program Instruction Set |
| |
| |
| Section 2.X.8.Z, FSIB: Fragment Shader Interlock Begin |
| |
| The FSIB instruction specifies the beginning of a critical section in a |
| fragment program, where execution of the critical section is ordered |
| relative to other fragments. This instruction has no other effect. |
| |
| The FSIB instruction is not allowed in arbitrary locations in a program. |
| A program will fail to load if it includes an FSIB instruction inside a |
| IF/ELSE/ENDIF block, inside a REP/ENDREP block, or inside any subroutine |
| block other than the one labeled "main". Additionally, a program will |
| fail to load if it contains more than one FSIB instruction, or if its one |
| FSIB instruction is not followed by an FSIE instruction. |
| |
| FSIB has no operands and generates no result. |
| |
| |
| Section 2.X.8.Z, FSIE: Fragment Shader Interlock End |
| |
| The FSIE instruction specifies the end of a critical section in a fragment |
| program, where execution of the critical section is ordered relative to |
| other fragments. This instruction has no other effect. |
| |
| The FSIE instruction is not allowed in arbitrary locations in a program. |
| A program will fail to load if it includes an FSIE instruction inside a |
| IF/ELSE/ENDIF block, inside a REP/ENDREP block, or inside any subroutine |
| block other than the one labeled "main". Additionally, a program will |
| fail to load if it contains more than one FSIE instruction, or if its one |
| FSIE instruction is not preceded by an FSIB instruction. |
| |
| FSIE has no operands and generates no result. |
| |
| Issues |
| |
| (1) What should this extension be called? |
| |
| RESOLVED: NV_fragment_shader_interlock. The |
| beginInvocationInterlockNV() and endInvocationInterlockNV() commands |
| identify a critical section during which other invocations with |
| overlapping coverage are locked out until the critical section |
| completes. |
| |
| (2) When using multisampling, the OpenGL specification permits |
| multiple fragment shader invocations to be generated for a single |
| fragment. For example, per-sample shading using the "sample" |
| auxilliary storage qualifier or the MinSampleShading() OpenGL API command |
| can be used to force per-sample shading. What execution ordering |
| guarantees are provided between fragment shader invocations generated |
| from the same fragment? |
| |
| RESOLVED: We don't provide any ordering guarantees in this extension. |
| This implies that when using multisampling, there is no guarantee that |
| two fragment shader invocations for the same fragment won't be executing |
| their critical sections concurrently. This could cause problems for |
| algorithms sharing data structures between all the samples of a pixel |
| unless accesses to these data structures are performed atomically. |
| |
| When using per-sample shading, the interlock we provide *does* guarantee |
| that no two invocations corresponding to the same sample execute the |
| critical section concurrently. If a separate set of data structures is |
| provided for each sample, no conflicts should occur within the critical |
| section. |
| |
| Note that in addition to the per-sample shading options in the shading |
| language and API, implementations may provide multisample antialiasing |
| modes where the implementation can't simply run the fragment shader once |
| and broadcast results to a large set of covered samples. |
| |
| (3) What performance differences are expected between shaders using the |
| "pixel" and "sample" layout qualifier variants in this extension (e.g., |
| "pixel_invocation_ordered" and "sample_invocation_ordered")? |
| |
| RESOLVED: We expect that shaders using "sample" qualifiers may have |
| higher performance, since the implementation need not order pairs of |
| fragments that touch the same pixel with "complementary" coverage. Such |
| situations are fairly common: when two adjacent triangles combine to |
| cover a given pixel, two fragments will be generated for the pixel but |
| no sample will be covered by both. When using "sample" qualifiers, the |
| invocations for both fragments can run concurrently. When using "pixel" |
| qualifiers, the critical section for one fragment must wait until the |
| critical section for the other fragment completes. |
| |
| (4) What performance differences are expected between shaders using the |
| "ordered" and "unordered" layout qualifier variants in this extension |
| (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")? |
| |
| RESOLVED: We expect that shaders using "unordered" may have higher |
| performance, since the critical section implementation doesn't need to |
| ensure that all previous invocations with overlapping coverage have |
| completed their critical sections. Some algorithms (e.g., building data |
| structures in order-independent transparency algorithms) will require |
| mutual exclusion when updating per-pixel data structures, but do not |
| require that shaders execute in a specific ordering. |
| |
| (5) Are fragment shaders using this extension allowed to write outputs? |
| If so, is there any guarantee on the order in which such outputs are |
| written to the framebuffer? |
| |
| RESOLVED: Yes, fragment shaders with critical sections may still write |
| outputs. If fragment shader outputs are written, they are stored or |
| blended into the framebuffer in API order, as is the case for fragment |
| shaders not using this extension. |
| |
| (6) What considerations apply when using this extension to implement a |
| programmable form of conventional blending using image stores? |
| |
| RESOLVED: Per-fragment operations performed in the pipeline following |
| fragment shader execution obviously have no effect on image stores |
| executing during fragment shader execution. In particular, multisample |
| operations such as broadcasting a single fragment output to multiple |
| samples or modifying the coverage with alpha-to-coverage or a shader |
| coverage mask output value have no effect. Fragments can not be killed |
| before fragment shader blending using the fixed-function alpha test or |
| using the depth test with a Z value produced by the shader. Fragments |
| will normally not be killed by fixed-function depth or stencil tests, |
| but those tests can be enabled before fragment shader invocations using |
| the layout qualifier "early_fragment_tests". Any required |
| fixed-function features that need to be handled before programmable |
| blending that aren't enabled by "early_fragment_tests" would need to be |
| emulated in the shader. |
| |
| Note also that performing blend computations in the shader are not |
| guaranteed to produce results that are bit-identical to these produced |
| by fixed-function blending hardware, even if mathematically equivalent |
| algorithms are used. |
| |
| (7) For operations accessing shared per-pixel data structures in the |
| critical section, what operations (if any) must be performed in shader |
| code to ensure that stores from one shader invocation are visible to |
| the next? |
| |
| RESOLVED: The "coherent" qualifier is required in the declaration of |
| the shared data structures to ensure that writes performed by one |
| invocation are visible to reads performed by another invocation. |
| |
| In shaders that don't use the interlock, "coherent" is not sufficient as |
| there is no guarantee of the ordering of fragment shader invocations -- |
| even if invocation A can see the values written by another invocation B, |
| there is no general guarantee that invocation A's read will be performed |
| before invocation B's write. The built-in function memoryBarrier() can |
| be used to generate a weak ordering by which threads can communicate, |
| but it doesn't order memory transactions between two separate |
| invocations. With the interlock, execution ordering between two threads |
| from the same pixel is well-defined as long as the loads and stores are |
| performed inside the critical section, and the use of "coherent" ensures |
| that stores done by one invocation are visible to other invocations. |
| |
| (8) Should we provide an explicit mechanisms for shaders to indicate a |
| critical section? Or should we just automatically infer a critical |
| section by analyzing shader code? Or should we just wrap the entire |
| fragment shader in a critical section? |
| |
| RESOLVED: Provide an explicit critical section. |
| |
| We definitely don't want to wrap the entire shader in a critical section |
| when a smaller section will suffice. Doing so would hold off the |
| execution of any other fragment shader invocation with the same (x,y) |
| for the entire (potentially long) life of the fragment shader. Hardware |
| would need to track a large number of fragments awaiting execution, and |
| may be so backed up that further fragments will be blocked even if they |
| don't overlap with any fragments currently executing. Providing a |
| smaller critical section reduces the amount of time other fragments are |
| blocked and allows implementations to perform useful work for |
| conflicting fragments before they hit the critical section. |
| |
| While a compiler could analyze the code and wrap a critical section |
| around all memory accesses, it may be difficult to determine which |
| accesses actually require mutual exclusion and ordering, and which |
| accesses are safe to do with no protection. Requiring shaders to |
| explicitly identify a critical section doesn't seem overwhelmingly |
| burdensome, and allows applications to exclude memory accesses that it |
| knows to be "safe". |
| |
| (9) What restrictions should be imposed on the use of the |
| beginInvocationInterlockNV() and endInvocationInterlockNV() functions |
| delimiting a critical section? |
| |
| RESOLVED: We impose restrictions similar to those on the barrier() |
| built-in function in tessellation control shaders to ensure that any |
| shader using this functionality has a single critical section that can |
| be easily identified during compilation. In particular, we require that |
| these functions be called in main() and don't permit them to be called |
| in conditional flow control. |
| |
| These restrictions ensure that there is always exactly one call to the |
| "begin" and "end" functions in a predictable location in the compiled |
| shader code, and ensure that the compiler and hardware don't have to |
| deal with unusual cases (like entering a critical section and never |
| leaving, leaving a critical section without entering it, or trying to |
| enter a critical section more than once). |
| |
| Revision History |
| |
| Revision 2, 2015/03/27 |
| - Add ES interactions |
| |
| Revision 1 |
| - Internal revisions |