| Name |
| |
| ARB_fragment_shader_interlock |
| |
| Name Strings |
| |
| GL_ARB_fragment_shader_interlock |
| |
| Contact |
| |
| Slawomir Grajewski, Intel (slawomir.grajewski 'at' intel.com) |
| |
| Contributors |
| |
| Contributors to INTEL_fragment_shader_ordering |
| Contributers to NV_fragment_shader_interlock |
| |
| Notice |
| |
| Copyright (c) 2015 The Khronos Group Inc. Copyright terms at |
| http://www.khronos.org/registry/speccopyright.html |
| |
| Status |
| |
| Complete. Approved by the ARB on June 26, 2015. |
| Ratified by the Khronos Board of Promoters on August 7, 2015. |
| |
| Version |
| |
| Last Modified Date: May 7, 2015 |
| Revision: 2 |
| |
| Number |
| |
| ARB Extension #177 |
| |
| Dependencies |
| |
| This extension is written against the OpenGL 4.5 (Core Profile) |
| Specification. |
| |
| This extension is written against version 4.50 (revision 5) of the OpenGL |
| Shading Language Specification. |
| |
| OpenGL 4.2 or ARB_shader_image_load_store is required; GLSL 4.20 is |
| required. |
| |
| Overview |
| |
| In unextended OpenGL 4.5, applications may produce a |
| large number of fragment shader invocations that perform loads and |
| stores to memory using image uniforms, atomic counter uniforms, |
| buffer variables, or pointers. The order in which loads and stores |
| to common addresses are performed by different fragment shader |
| invocations is largely undefined. For algorithms that use shader |
| writes and touch the same pixels more than once, one or more of the |
| following techniques may be required to ensure proper execution ordering: |
| |
| * inserting Finish or WaitSync commands to drain the pipeline between |
| different "passes" or "layers"; |
| |
| * using only atomic memory operations to write to shader memory (which |
| may be relatively slow and limits how memory may be updated); or |
| |
| * injecting spin loops into shaders to prevent multiple shader |
| invocations from touching the same memory concurrently. |
| |
| This extension provides new GLSL built-in functions |
| beginInvocationInterlockARB() and endInvocationInterlockARB() that delimit |
| a critical section of fragment shader code. For pairs of shader |
| invocations with "overlapping" coverage in a given pixel, the OpenGL |
| implementation will guarantee that the critical section of the fragment |
| shader will be executed for only one fragment at a time. |
| |
| There are four different interlock modes supported by this extension, |
| which are identified by layout qualifiers. The qualifiers |
| "pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual |
| exclusion in the critical section for any pair of fragments corresponding |
| to the same pixel. When using multisampling, the qualifiers |
| "sample_interlock_ordered" and "sample_interlock_unordered" only provide |
| mutual exclusion for pairs of fragments that both cover at least one |
| common sample in the same pixel; these are recommended for performance if |
| shaders use per-sample data structures. |
| |
| Additionally, when the "pixel_interlock_ordered" or |
| "sample_interlock_ordered" layout qualifier is used, the interlock also |
| guarantees that the critical section for multiple shader invocations with |
| "overlapping" coverage will be executed in the order in which the |
| primitives were processed by the GL. Such a guarantee is useful for |
| applications like blending in the fragment shader, where an application |
| requires that fragment values to be composited in the framebuffer in |
| primitive order. |
| |
| This extension can be useful for algorithms that need to access per-pixel |
| data structures via shader loads and stores. Such algorithms using this |
| extension can access such data structures in the critical section without |
| worrying about other invocations for the same pixel accessing the data |
| structures concurrently. Additionally, the ordering guarantees are useful |
| for cases where the API ordering of fragments is meaningful. For example, |
| applications may be able to execute programmable blending operations in |
| the fragment shader, where the destination buffer is read via image loads |
| and the final value is written via image stores. |
| |
| New Procedures and Functions |
| |
| None. |
| |
| New Tokens |
| |
| None. |
| |
| Modifications to the OpenGL Shading Language Specification, Version 4.50 |
| |
| Including the following line in a shader can be used to control the |
| language features described in this extension: |
| |
| #extension GL_ARB_fragment_shader_interlock : <behavior> |
| |
| where <behavior> is as specified in section 3.3. |
| |
| New preprocessor #defines are added to the OpenGL Shading Language: |
| |
| #define GL_ARB_fragment_shader_interlock 1 |
| |
| |
| Modify Section 4.4.1.3, Fragment Shader Inputs (p. 63) |
| |
| (add to the list of layout qualifiers containing "early_fragment_tests", |
| p. 63, and modify the surrounding language to reflect that multiple |
| layout qualifiers are supported on "in") |
| |
| layout-qualifier-id |
| pixel_interlock_ordered |
| pixel_interlock_unordered |
| sample_interlock_ordered |
| sample_interlock_unordered |
| |
| (add to the end of the section, p. 63) |
| |
| The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered", |
| "sample_interlock_ordered", and "sample_interlock_unordered" control the |
| ordering of the execution of shader invocations between calls to the |
| built-in functions beginInvocationInterlockARB() and |
| endInvocationInterlockARB(), as described in section 8.13.3. A |
| compile or link error will be generated if more than one of these layout |
| qualifiers is specified in shader code. If a program containing a |
| fragment shader includes none of these layout qualifiers, it is as |
| though "pixel_interlock_ordered" were specified. |
| |
| Add to the end of Section 8.13, Fragment Processing Functions (p. 170) |
| |
| 8.13.3, Fragment Shader Execution Ordering Functions |
| |
| By default, fragment shader invocations are generally executed in |
| undefined order. Multiple fragment shader invocations may be executed |
| concurrently, including multiple invocations corresponding to a single |
| pixel. Additionally, fragment shader invocations for a single pixel might |
| not be processed in the order in which the primitives generating the |
| fragments were specified in the OpenGL API. |
| |
| The paired functions beginInvocationInterlockARB() and |
| endInvocationInterlockARB() allow shaders to specify a critical section, |
| inside which stronger execution ordering is guaranteed. When using the |
| "pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier, |
| ordering guarantees are provided for any pair of fragment shader |
| invocations X and Y triggered by fragments A and B corresponding to the |
| same pixel. When using the "sample_interlock_ordered" or |
| "sample_interlock_unordered" qualifier, ordering guarantees are provided |
| for any pair of fragment shader invocations X and Y triggered by fragments |
| A and B that correspond to the same pixel, where at least one sample of |
| the pixel is covered by both fragments. No ordering guarantees are |
| provided for pairs of fragment shader invocations corresponding to |
| different pixels. Additionally, no ordering guarantees are provided for |
| pairs of fragment shader invocations corresponding to the same fragment. |
| When multisampling is enabled and the framebuffer has sample buffers, |
| multiple fragment shader invocations may result from a single fragment due |
| to the use of the "sample" auxiliary storage qualifier, OpenGL API |
| commands forcing multiple shader invocations per fragment, or for other |
| implementation-dependent reasons. |
| |
| When using the "pixel_interlock_unordered" or "sample_interlock_unordered" |
| qualifier, the interlock will ensure that the critical sections of |
| fragment shader invocations X and Y with overlapping coverage will never |
| execute concurrently. That is, invocation X is guaranteed to complete its |
| call to endInvocationInterlockARB() before invocation Y completes its call |
| to beginInvocationInterlockARB(), or vice versa. |
| |
| When using the "pixel_interlock_ordered" or "sample_interlock_ordered" |
| layout qualifier, the critical sections of invocations X and Y with |
| overlapping coverage will be executed in a specific order, based on the |
| relative order assigned to their fragments A and B. If fragment A is |
| considered to precede fragment B, the critical section of invocation X is |
| guaranteed to complete before the critical section of invocation Y begins. |
| When a pair of fragments A and B have overlapping coverage, fragment A is |
| considered to precede fragment B if |
| |
| * the OpenGL API command producing fragment A was called prior to the |
| command producing B, or |
| |
| * the point, line, triangle, [[compatibility profile: quadrilateral, |
| polygon,]] or patch primitive producing fragment A appears earlier in |
| the same strip, loop, fan, or independent primitive list producing |
| fragment B. |
| |
| When [[compatibility profile: decomposing quadrilateral or polygon |
| primitives or]] tessellating a single patch primitive, multiple |
| primitives may be generated in an undefined implementation-dependent |
| order. When fragments A and B are generated from such unordered |
| primitives, their ordering is also implementation-dependent. |
| |
| If fragment shader X completes its critical section before fragment shader |
| Y begins its critical section, all stores to memory performed in the |
| critical section of invocation X using a pointer, image uniform, atomic |
| counter uniform, or buffer variable qualified by "coherent" are guaranteed |
| to be visible to any reads of the same types of variable performed in the |
| critical section of invocation Y. |
| |
| If multisampling is disabled, or if the framebuffer does not include |
| sample buffers, fragment coverage is computed per-pixel. In this case, |
| the "sample_interlock_ordered" or "sample_interlock_unordered" layout |
| qualifiers are treated as "pixel_interlock_ordered" or |
| "pixel_interlock_unordered", respectively. |
| |
| Syntax: |
| |
| void beginInvocationInterlockARB(void); |
| void endInvocationInterlockARB(void); |
| |
| Description: |
| |
| The beginInvocationInterlockARB() and endInvocationInterlockARB() may only |
| be placed inside the function main() of a fragment shader and may not be |
| called within any flow control. These functions may not be called after a |
| return statement in the function main(), but may be called after a discard |
| statement. A compile- or link-time error will be generated if main() |
| calls either function more than once, contains a call to one function |
| without a matching call to the other, or calls endInvocationInterlockARB() |
| before calling beginInvocationInterlockARB(). |
| |
| Additions to the AGL/GLX/WGL Specifications |
| |
| None. |
| |
| Errors |
| |
| None. |
| |
| New State |
| |
| None. |
| |
| New Implementation Dependent State |
| |
| None. |
| |
| Issues |
| |
| (1) When using multisampling, the OpenGL specification permits |
| multiple fragment shader invocations to be generated for a single |
| fragment. For example, per-sample shading using the "sample" |
| auxiliary storage qualifier or the MinSampleShading() OpenGL API command |
| can be used to force per-sample shading. What execution ordering |
| guarantees are provided between fragment shader invocations generated |
| from the same fragment? |
| |
| RESOLVED: We don't provide any ordering guarantees in this extension. |
| This implies that when using multisampling, there is no guarantee that |
| two fragment shader invocations for the same fragment won't be executing |
| their critical sections concurrently. This could cause problems for |
| algorithms sharing data structures between all the samples of a pixel |
| unless accesses to these data structures are performed atomically. |
| |
| When using per-sample shading, the interlock we provide *does* guarantee |
| that no two invocations corresponding to the same sample execute the |
| critical section concurrently. If a separate set of data structures is |
| provided for each sample, no conflicts should occur within the critical |
| section. |
| |
| Note that in addition to the per-sample shading options in the shading |
| language and API, implementations may provide multisample antialiasing |
| modes where the implementation can't simply run the fragment shader once |
| and broadcast results to a large set of covered samples. |
| |
| (2) What performance differences are expected between shaders using the |
| "pixel" and "sample" layout qualifier variants in this extension (e.g., |
| "pixel_invocation_ordered" and "sample_invocation_ordered")? |
| |
| RESOLVED: We expect that shaders using "sample" qualifiers may have |
| higher performance, since the implementation need not order pairs of |
| fragments that touch the same pixel with "complementary" coverage. Such |
| situations are fairly common: when two adjacent triangles combine to |
| cover a given pixel, two fragments will be generated for the pixel but |
| no sample will be covered by both. When using "sample" qualifiers, the |
| invocations for both fragments can run concurrently. When using "pixel" |
| qualifiers, the critical section for one fragment must wait until the |
| critical section for the other fragment completes. |
| |
| (3) What performance differences are expected between shaders using the |
| "ordered" and "unordered" layout qualifier variants in this extension |
| (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")? |
| |
| RESOLVED: We expect that shaders using "unordered" may have higher |
| performance, since the critical section implementation doesn't need to |
| ensure that all previous invocations with overlapping coverage have |
| completed their critical sections. Some algorithms (e.g., building data |
| structures in order-independent transparency algorithms) will require |
| mutual exclusion when updating per-pixel data structures, but do not |
| require that shaders execute in a specific ordering. |
| |
| (4) Are fragment shaders using this extension allowed to write outputs? |
| If so, is there any guarantee on the order in which such outputs are |
| written to the framebuffer? |
| |
| RESOLVED: Yes, fragment shaders with critical sections may still write |
| outputs. If fragment shader outputs are written, they are stored or |
| blended into the framebuffer in API order, as is the case for fragment |
| shaders not using this extension. |
| |
| (5) What considerations apply when using this extension to implement a |
| programmable form of conventional blending using image stores? |
| |
| RESOLVED: Per-fragment operations performed in the pipeline following |
| fragment shader execution obviously have no effect on image stores |
| executing during fragment shader execution. In particular, multisample |
| operations such as broadcasting a single fragment output to multiple |
| samples or modifying the coverage with alpha-to-coverage or a shader |
| coverage mask output value have no effect. Fragments can not be killed |
| before fragment shader blending using the fixed-function alpha test or |
| using the depth test with a Z value produced by the shader. Fragments |
| will normally not be killed by fixed-function depth or stencil tests, |
| but those tests can be enabled before fragment shader invocations using |
| the layout qualifier "early_fragment_tests". Any required |
| fixed-function features that need to be handled before programmable |
| blending that aren't enabled by "early_fragment_tests" would need to be |
| emulated in the shader. |
| |
| Note also that performing blend computations in the shader are not |
| guaranteed to produce results that are bit-identical to these produced |
| by fixed-function blending hardware, even if mathematically equivalent |
| algorithms are used. |
| |
| (6) For operations accessing shared per-pixel data structures in the |
| critical section, what operations (if any) must be performed in shader |
| code to ensure that stores from one shader invocation are visible to |
| the next? |
| |
| RESOLVED: The "coherent" qualifier is required in the declaration of |
| the shared data structures to ensure that writes performed by one |
| invocation are visible to reads performed by another invocation. |
| |
| In shaders that don't use the interlock, "coherent" is not sufficient as |
| there is no guarantee of the ordering of fragment shader invocations -- |
| even if invocation A can see the values written by another invocation B, |
| there is no general guarantee that invocation A's read will be performed |
| before invocation B's write. The built-in function memoryBarrier() can |
| be used to generate a weak ordering by which threads can communicate, |
| but it doesn't order memory transactions between two separate |
| invocations. With the interlock, execution ordering between two threads |
| from the same pixel is well-defined as long as the loads and stores are |
| performed inside the critical section, and the use of "coherent" ensures |
| that stores done by one invocation are visible to other invocations. |
| |
| (7) Should we provide an explicit mechanisms for shaders to indicate a |
| critical section? Or should we just automatically infer a critical |
| section by analyzing shader code? Or should we just wrap the entire |
| fragment shader in a critical section? |
| |
| RESOLVED: Provide an explicit critical section. |
| |
| We definitely don't want to wrap the entire shader in a critical section |
| when a smaller section will suffice. Doing so would hold off the |
| execution of any other fragment shader invocation with the same (x,y) |
| for the entire (potentially long) life of the fragment shader. Hardware |
| would need to track a large number of fragments awaiting execution, and |
| may be so backed up that further fragments will be blocked even if they |
| don't overlap with any fragments currently executing. Providing a |
| smaller critical section reduces the amount of time other fragments are |
| blocked and allows implementations to perform useful work for |
| conflicting fragments before they hit the critical section. |
| |
| While a compiler could analyze the code and wrap a critical section |
| around all memory accesses, it may be difficult to determine which |
| accesses actually require mutual exclusion and ordering, and which |
| accesses are safe to do with no protection. Requiring shaders to |
| explicitly identify a critical section doesn't seem overwhelmingly |
| burdensome, and allows applications to exclude memory accesses that it |
| knows to be "safe". |
| |
| (8) What restrictions should be imposed on the use of the |
| beginInvocationInterlockARB() and endInvocationInterlockARB() functions |
| delimiting a critical section? |
| |
| RESOLVED: We impose restrictions similar to those on the barrier() |
| built-in function in tessellation control shaders to ensure that any |
| shader using this functionality has a single critical section that can |
| be easily identified during compilation. In particular, we require that |
| these functions be called in main() and don't permit them to be called |
| in conditional flow control. |
| |
| These restrictions ensure that there is always exactly one call to the |
| "begin" and "end" functions in a predictable location in the compiled |
| shader code, and ensure that the compiler and hardware don't have to |
| deal with unusual cases (like entering a critical section and never |
| leaving, leaving a critical section without entering it, or trying to |
| enter a critical section more than once). |
| |
| Revision History |
| |
| Rev. Date Author Changes |
| ---- -------- -------- ----------------------------------------- |
| 1 04/01/15 S.Grajewski Inital version merging |
| INTEL_fragment_shader_ordering with |
| NV_fragment_shader_interlock |
| |
| 2 05/07/15 S.Grajewski Built-in functions |
| beginInvocationInterlockARB() and |
| endInvocationInterlockARB() have now ARB |
| suffixes. |