extensions/ARB/ARB_fragment_shader_interlock.txt - external/github.com/KhronosGroup/OpenGL-Registry - Git at Google

 Name

     ARB_fragment_shader_interlock

 Name Strings

     GL_ARB_fragment_shader_interlock

 Contact

     Slawomir Grajewski, Intel  (slawomir.grajewski 'at' intel.com)

 Contributors

     Contributors to INTEL_fragment_shader_ordering
     Contributers to NV_fragment_shader_interlock

 Notice

     Copyright (c) 2015 The Khronos Group Inc. Copyright terms at
         http://www.khronos.org/registry/speccopyright.html

 Status

     Complete. Approved by the ARB on June 26, 2015.
     Ratified by the Khronos Board of Promoters on August 7, 2015.

 Version

     Last Modified Date:        May 7, 2015
     Revision:                  2

 Number

     ARB Extension #177

 Dependencies

     This extension is written against the OpenGL 4.5 (Core Profile)
     Specification.

     This extension is written against version 4.50 (revision 5) of the OpenGL
     Shading Language Specification.

     OpenGL 4.2 or ARB_shader_image_load_store is required; GLSL 4.20 is
     required.

 Overview

     In unextended OpenGL 4.5, applications may produce a
     large number of fragment shader invocations that perform loads and
     stores to memory using image uniforms, atomic counter uniforms,
     buffer variables, or pointers. The order in which loads and stores
     to common addresses are performed by different fragment shader
     invocations is largely undefined.  For algorithms that use shader
     writes and touch the same pixels more than once, one or more of the
     following techniques may be required to ensure proper execution ordering:

       * inserting Finish or WaitSync commands to drain the pipeline between
         different "passes" or "layers";

       * using only atomic memory operations to write to shader memory (which
         may be relatively slow and limits how memory may be updated); or

       * injecting spin loops into shaders to prevent multiple shader
         invocations from touching the same memory concurrently.

     This extension provides new GLSL built-in functions
     beginInvocationInterlockARB() and endInvocationInterlockARB() that delimit
     a critical section of fragment shader code.  For pairs of shader
     invocations with "overlapping" coverage in a given pixel, the OpenGL
     implementation will guarantee that the critical section of the fragment
     shader will be executed for only one fragment at a time.

     There are four different interlock modes supported by this extension,
     which are identified by layout qualifiers.  The qualifiers
     "pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual
     exclusion in the critical section for any pair of fragments corresponding
     to the same pixel.  When using multisampling, the qualifiers
     "sample_interlock_ordered" and "sample_interlock_unordered" only provide
     mutual exclusion for pairs of fragments that both cover at least one
     common sample in the same pixel; these are recommended for performance if
     shaders use per-sample data structures.

     Additionally, when the "pixel_interlock_ordered" or
     "sample_interlock_ordered" layout qualifier is used, the interlock also
     guarantees that the critical section for multiple shader invocations with
     "overlapping" coverage will be executed in the order in which the
     primitives were processed by the GL.  Such a guarantee is useful for
     applications like blending in the fragment shader, where an application
     requires that fragment values to be composited in the framebuffer in
     primitive order.

     This extension can be useful for algorithms that need to access per-pixel
     data structures via shader loads and stores.  Such algorithms using this
     extension can access such data structures in the critical section without
     worrying about other invocations for the same pixel accessing the data
     structures concurrently.  Additionally, the ordering guarantees are useful
     for cases where the API ordering of fragments is meaningful.  For example,
     applications may be able to execute programmable blending operations in
     the fragment shader, where the destination buffer is read via image loads
     and the final value is written via image stores.

 New Procedures and Functions

     None.

 New Tokens

     None.

 Modifications to the OpenGL Shading Language Specification, Version 4.50

     Including the following line in a shader can be used to control the
     language features described in this extension:

       #extension GL_ARB_fragment_shader_interlock : <behavior>

     where <behavior> is as specified in section 3.3.

     New preprocessor #defines are added to the OpenGL Shading Language:

       #define GL_ARB_fragment_shader_interlock           1


     Modify Section 4.4.1.3, Fragment Shader Inputs (p. 63)

     (add to the list of layout qualifiers containing "early_fragment_tests",
      p. 63, and modify the surrounding language to reflect that multiple
      layout qualifiers are supported on "in")

       layout-qualifier-id
         pixel_interlock_ordered
         pixel_interlock_unordered
         sample_interlock_ordered
         sample_interlock_unordered

     (add to the end of the section, p. 63)

     The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered",
     "sample_interlock_ordered", and "sample_interlock_unordered" control the
     ordering of the execution of shader invocations between calls to the
     built-in functions beginInvocationInterlockARB() and
     endInvocationInterlockARB(), as described in section 8.13.3. A
     compile or link error will be generated if more than one of these layout
     qualifiers is specified in shader code. If a program containing a
     fragment shader includes none of these layout qualifiers, it is as
     though "pixel_interlock_ordered" were specified.

     Add to the end of Section 8.13, Fragment Processing Functions (p. 170)

     8.13.3, Fragment Shader Execution Ordering Functions

     By default, fragment shader invocations are generally executed in
     undefined order. Multiple fragment shader invocations may be executed
     concurrently, including multiple invocations corresponding to a single
     pixel. Additionally, fragment shader invocations for a single pixel might
     not be processed in the order in which the primitives generating the
     fragments were specified in the OpenGL API.

     The paired functions beginInvocationInterlockARB() and
     endInvocationInterlockARB() allow shaders to specify a critical section,
     inside which stronger execution ordering is guaranteed.  When using the
     "pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier,
     ordering guarantees are provided for any pair of fragment shader
     invocations X and Y triggered by fragments A and B corresponding to the
     same pixel. When using the "sample_interlock_ordered" or
     "sample_interlock_unordered" qualifier, ordering guarantees are provided
     for any pair of fragment shader invocations X and Y triggered by fragments
     A and B that correspond to the same pixel, where at least one sample of
     the pixel is covered by both fragments. No ordering guarantees are
     provided for pairs of fragment shader invocations corresponding to
     different pixels. Additionally, no ordering guarantees are provided for
     pairs of fragment shader invocations corresponding to the same fragment.
     When multisampling is enabled and the framebuffer has sample buffers,
     multiple fragment shader invocations may result from a single fragment due
     to the use of the "sample" auxiliary storage qualifier, OpenGL API
     commands forcing multiple shader invocations per fragment, or for other
     implementation-dependent reasons.

     When using the "pixel_interlock_unordered" or "sample_interlock_unordered"
     qualifier, the interlock will ensure that the critical sections of
     fragment shader invocations X and Y with overlapping coverage will never
     execute concurrently. That is, invocation X is guaranteed to complete its
     call to endInvocationInterlockARB() before invocation Y completes its call
     to beginInvocationInterlockARB(), or vice versa.

     When using the "pixel_interlock_ordered" or "sample_interlock_ordered"
     layout qualifier, the critical sections of invocations X and Y with
     overlapping coverage will be executed in a specific order, based on the
     relative order assigned to their fragments A and B.  If fragment A is
     considered to precede fragment B, the critical section of invocation X is
     guaranteed to complete before the critical section of invocation Y begins.
     When a pair of fragments A and B have overlapping coverage, fragment A is
     considered to precede fragment B if

       * the OpenGL API command producing fragment A was called prior to the
         command producing B, or

       * the point, line, triangle, [[compatibility profile: quadrilateral,
         polygon,]] or patch primitive producing fragment A appears earlier in
         the same strip, loop, fan, or independent primitive list producing
         fragment B.

     When [[compatibility profile: decomposing quadrilateral or polygon
     primitives or]] tessellating a single patch primitive, multiple
     primitives may be generated in an undefined implementation-dependent
     order.  When fragments A and B are generated from such unordered
     primitives, their ordering is also implementation-dependent.

     If fragment shader X completes its critical section before fragment shader
     Y begins its critical section, all stores to memory performed in the
     critical section of invocation X using a pointer, image uniform, atomic
     counter uniform, or buffer variable qualified by "coherent" are guaranteed
     to be visible to any reads of the same types of variable performed in the
     critical section of invocation Y.

     If multisampling is disabled, or if the framebuffer does not include
     sample buffers, fragment coverage is computed per-pixel. In this case,
     the "sample_interlock_ordered" or "sample_interlock_unordered" layout
     qualifiers are treated as "pixel_interlock_ordered" or
     "pixel_interlock_unordered", respectively.

       Syntax:

         void beginInvocationInterlockARB(void);
         void endInvocationInterlockARB(void);

       Description:

     The beginInvocationInterlockARB() and endInvocationInterlockARB() may only
     be placed inside the function main() of a fragment shader and may not be
     called within any flow control.  These functions may not be called after a
     return statement in the function main(), but may be called after a discard
     statement.  A compile- or link-time error will be generated if main()
     calls either function more than once, contains a call to one function
     without a matching call to the other, or calls endInvocationInterlockARB()
     before calling beginInvocationInterlockARB().

 Additions to the AGL/GLX/WGL Specifications

     None.

 Errors

     None.

 New State

     None.

 New Implementation Dependent State

     None.

 Issues

     (1) When using multisampling, the OpenGL specification permits
         multiple fragment shader invocations to be generated for a single
         fragment.  For example, per-sample shading using the "sample"
         auxiliary storage qualifier or the MinSampleShading() OpenGL API command
         can be used to force per-sample shading.  What execution ordering
         guarantees are provided between fragment shader invocations generated
         from the same fragment?

       RESOLVED:  We don't provide any ordering guarantees in this extension.
       This implies that when using multisampling, there is no guarantee that
       two fragment shader invocations for the same fragment won't be executing
       their critical sections concurrently.  This could cause problems for
       algorithms sharing data structures between all the samples of a pixel
       unless accesses to these data structures are performed atomically.

       When using per-sample shading, the interlock we provide *does* guarantee
       that no two invocations corresponding to the same sample execute the
       critical section concurrently.  If a separate set of data structures is
       provided for each sample, no conflicts should occur within the critical
       section.

       Note that in addition to the per-sample shading options in the shading
       language and API, implementations may provide multisample antialiasing
       modes where the implementation can't simply run the fragment shader once
       and broadcast results to a large set of covered samples.

     (2) What performance differences are expected between shaders using the
        "pixel" and "sample" layout qualifier variants in this extension (e.g.,
        "pixel_invocation_ordered" and "sample_invocation_ordered")?

       RESOLVED:  We expect that shaders using "sample" qualifiers may have
       higher performance, since the implementation need not order pairs of
       fragments that touch the same pixel with "complementary" coverage.  Such
       situations are fairly common:  when two adjacent triangles combine to
       cover a given pixel, two fragments will be generated for the pixel but
       no sample will be covered by both.  When using "sample" qualifiers, the
       invocations for both fragments can run concurrently.  When using "pixel"
       qualifiers, the critical section for one fragment must wait until the
       critical section for the other fragment completes.

     (3) What performance differences are expected between shaders using the
        "ordered" and "unordered" layout qualifier variants in this extension
        (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")?

       RESOLVED:  We expect that shaders using "unordered" may have higher
       performance, since the critical section implementation doesn't need to
       ensure that all previous invocations with overlapping coverage have
       completed their critical sections.  Some algorithms (e.g., building data
       structures in order-independent transparency algorithms) will require
       mutual exclusion when updating per-pixel data structures, but do not
       require that shaders execute in a specific ordering.

     (4) Are fragment shaders using this extension allowed to write outputs?
         If so, is there any guarantee on the order in which such outputs are
         written to the framebuffer?

       RESOLVED:  Yes, fragment shaders with critical sections may still write
       outputs.  If fragment shader outputs are written, they are stored or
       blended into the framebuffer in API order, as is the case for fragment
       shaders not using this extension.

     (5) What considerations apply when using this extension to implement a
         programmable form of conventional blending using image stores?

       RESOLVED:  Per-fragment operations performed in the pipeline following
       fragment shader execution obviously have no effect on image stores
       executing during fragment shader execution.  In particular, multisample
       operations such as broadcasting a single fragment output to multiple
       samples or modifying the coverage with alpha-to-coverage or a shader
       coverage mask output value have no effect.  Fragments can not be killed
       before fragment shader blending using the fixed-function alpha test or
       using the depth test with a Z value produced by the shader.  Fragments
       will normally not be killed by fixed-function depth or stencil tests,
       but those tests can be enabled before fragment shader invocations using
       the layout qualifier "early_fragment_tests".  Any required
       fixed-function features that need to be handled before programmable
       blending that aren't enabled by "early_fragment_tests" would need to be
       emulated in the shader.

       Note also that performing blend computations in the shader are not
       guaranteed to produce results that are bit-identical to these produced
       by fixed-function blending hardware, even if mathematically equivalent
       algorithms are used.

     (6) For operations accessing shared per-pixel data structures in the
         critical section, what operations (if any) must be performed in shader
         code to ensure that stores from one shader invocation are visible to
         the next?

       RESOLVED:  The "coherent" qualifier is required in the declaration of
       the shared data structures to ensure that writes performed by one
       invocation are visible to reads performed by another invocation.

       In shaders that don't use the interlock, "coherent" is not sufficient as
       there is no guarantee of the ordering of fragment shader invocations --
       even if invocation A can see the values written by another invocation B,
       there is no general guarantee that invocation A's read will be performed
       before invocation B's write.  The built-in function memoryBarrier() can
       be used to generate a weak ordering by which threads can communicate,
       but it doesn't order memory transactions between two separate
       invocations.  With the interlock, execution ordering between two threads
       from the same pixel is well-defined as long as the loads and stores are
       performed inside the critical section, and the use of "coherent" ensures
       that stores done by one invocation are visible to other invocations.

     (7) Should we provide an explicit mechanisms for shaders to indicate a
         critical section?  Or should we just automatically infer a critical
         section by analyzing shader code?  Or should we just wrap the entire
         fragment shader in a critical section?

       RESOLVED:  Provide an explicit critical section.

       We definitely don't want to wrap the entire shader in a critical section
       when a smaller section will suffice.  Doing so would hold off the
       execution of any other fragment shader invocation with the same (x,y)
       for the entire (potentially long) life of the fragment shader.  Hardware
       would need to track a large number of fragments awaiting execution, and
       may be so backed up that further fragments will be blocked even if they
       don't overlap with any fragments currently executing.  Providing a
       smaller critical section reduces the amount of time other fragments are
       blocked and allows implementations to perform useful work for
       conflicting fragments before they hit the critical section.

       While a compiler could analyze the code and wrap a critical section
       around all memory accesses, it may be difficult to determine which
       accesses actually require mutual exclusion and ordering, and which
       accesses are safe to do with no protection.  Requiring shaders to
       explicitly identify a critical section doesn't seem overwhelmingly
       burdensome, and allows applications to exclude memory accesses that it
       knows to be "safe".

     (8) What restrictions should be imposed on the use of the
         beginInvocationInterlockARB() and endInvocationInterlockARB() functions
         delimiting a critical section?

       RESOLVED:  We impose restrictions similar to those on the barrier()
       built-in function in tessellation control shaders to ensure that any
       shader using this functionality has a single critical section that can
       be easily identified during compilation.  In particular, we require that
       these functions be called in main() and don't permit them to be called
       in conditional flow control.

       These restrictions ensure that there is always exactly one call to the
       "begin" and "end" functions in a predictable location in the compiled
       shader code, and ensure that the compiler and hardware don't have to
       deal with unusual cases (like entering a critical section and never
       leaving, leaving a critical section without entering it, or trying to
       enter a critical section more than once).

 Revision History

     Rev.    Date    Author        Changes
     ----  --------  --------     -----------------------------------------
      1    04/01/15  S.Grajewski  Inital version merging
                                  INTEL_fragment_shader_ordering with
                                  NV_fragment_shader_interlock

      2    05/07/15  S.Grajewski  Built-in functions
                                  beginInvocationInterlockARB() and
                                  endInvocationInterlockARB() have now ARB
                                  suffixes.
	Name

	ARB_fragment_shader_interlock

	Name Strings

	GL_ARB_fragment_shader_interlock

	Contact

	Slawomir Grajewski, Intel (slawomir.grajewski 'at' intel.com)

	Contributors

	Contributors to INTEL_fragment_shader_ordering
	Contributers to NV_fragment_shader_interlock

	Notice

	Copyright (c) 2015 The Khronos Group Inc. Copyright terms at
	http://www.khronos.org/registry/speccopyright.html

	Status

	Complete. Approved by the ARB on June 26, 2015.
	Ratified by the Khronos Board of Promoters on August 7, 2015.

	Version

	Last Modified Date: May 7, 2015
	Revision: 2

	Number

	ARB Extension #177

	Dependencies

	This extension is written against the OpenGL 4.5 (Core Profile)
	Specification.

	This extension is written against version 4.50 (revision 5) of the OpenGL
	Shading Language Specification.

	OpenGL 4.2 or ARB_shader_image_load_store is required; GLSL 4.20 is
	required.

	Overview

	In unextended OpenGL 4.5, applications may produce a
	large number of fragment shader invocations that perform loads and
	stores to memory using image uniforms, atomic counter uniforms,
	buffer variables, or pointers. The order in which loads and stores
	to common addresses are performed by different fragment shader
	invocations is largely undefined. For algorithms that use shader
	writes and touch the same pixels more than once, one or more of the
	following techniques may be required to ensure proper execution ordering:

	* inserting Finish or WaitSync commands to drain the pipeline between
	different "passes" or "layers";

	* using only atomic memory operations to write to shader memory (which
	may be relatively slow and limits how memory may be updated); or

	* injecting spin loops into shaders to prevent multiple shader
	invocations from touching the same memory concurrently.

	This extension provides new GLSL built-in functions
	beginInvocationInterlockARB() and endInvocationInterlockARB() that delimit
	a critical section of fragment shader code. For pairs of shader
	invocations with "overlapping" coverage in a given pixel, the OpenGL
	implementation will guarantee that the critical section of the fragment
	shader will be executed for only one fragment at a time.

	There are four different interlock modes supported by this extension,
	which are identified by layout qualifiers. The qualifiers
	"pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual
	exclusion in the critical section for any pair of fragments corresponding
	to the same pixel. When using multisampling, the qualifiers
	"sample_interlock_ordered" and "sample_interlock_unordered" only provide
	mutual exclusion for pairs of fragments that both cover at least one
	common sample in the same pixel; these are recommended for performance if
	shaders use per-sample data structures.

	Additionally, when the "pixel_interlock_ordered" or
	"sample_interlock_ordered" layout qualifier is used, the interlock also
	guarantees that the critical section for multiple shader invocations with
	"overlapping" coverage will be executed in the order in which the
	primitives were processed by the GL. Such a guarantee is useful for
	applications like blending in the fragment shader, where an application
	requires that fragment values to be composited in the framebuffer in
	primitive order.

	This extension can be useful for algorithms that need to access per-pixel
	data structures via shader loads and stores. Such algorithms using this
	extension can access such data structures in the critical section without
	worrying about other invocations for the same pixel accessing the data
	structures concurrently. Additionally, the ordering guarantees are useful
	for cases where the API ordering of fragments is meaningful. For example,
	applications may be able to execute programmable blending operations in
	the fragment shader, where the destination buffer is read via image loads
	and the final value is written via image stores.

	New Procedures and Functions

	None.

	New Tokens

	None.

	Modifications to the OpenGL Shading Language Specification, Version 4.50

	Including the following line in a shader can be used to control the
	language features described in this extension:

	#extension GL_ARB_fragment_shader_interlock : <behavior>

	where <behavior> is as specified in section 3.3.

	New preprocessor #defines are added to the OpenGL Shading Language:

	#define GL_ARB_fragment_shader_interlock 1


	Modify Section 4.4.1.3, Fragment Shader Inputs (p. 63)

	(add to the list of layout qualifiers containing "early_fragment_tests",
	p. 63, and modify the surrounding language to reflect that multiple
	layout qualifiers are supported on "in")

	layout-qualifier-id
	pixel_interlock_ordered
	pixel_interlock_unordered
	sample_interlock_ordered
	sample_interlock_unordered

	(add to the end of the section, p. 63)

	The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered",
	"sample_interlock_ordered", and "sample_interlock_unordered" control the
	ordering of the execution of shader invocations between calls to the
	built-in functions beginInvocationInterlockARB() and
	endInvocationInterlockARB(), as described in section 8.13.3. A
	compile or link error will be generated if more than one of these layout
	qualifiers is specified in shader code. If a program containing a
	fragment shader includes none of these layout qualifiers, it is as
	though "pixel_interlock_ordered" were specified.

	Add to the end of Section 8.13, Fragment Processing Functions (p. 170)

	8.13.3, Fragment Shader Execution Ordering Functions

	By default, fragment shader invocations are generally executed in
	undefined order. Multiple fragment shader invocations may be executed
	concurrently, including multiple invocations corresponding to a single
	pixel. Additionally, fragment shader invocations for a single pixel might
	not be processed in the order in which the primitives generating the
	fragments were specified in the OpenGL API.

	The paired functions beginInvocationInterlockARB() and
	endInvocationInterlockARB() allow shaders to specify a critical section,
	inside which stronger execution ordering is guaranteed. When using the
	"pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier,
	ordering guarantees are provided for any pair of fragment shader
	invocations X and Y triggered by fragments A and B corresponding to the
	same pixel. When using the "sample_interlock_ordered" or
	"sample_interlock_unordered" qualifier, ordering guarantees are provided
	for any pair of fragment shader invocations X and Y triggered by fragments
	A and B that correspond to the same pixel, where at least one sample of
	the pixel is covered by both fragments. No ordering guarantees are
	provided for pairs of fragment shader invocations corresponding to
	different pixels. Additionally, no ordering guarantees are provided for
	pairs of fragment shader invocations corresponding to the same fragment.
	When multisampling is enabled and the framebuffer has sample buffers,
	multiple fragment shader invocations may result from a single fragment due
	to the use of the "sample" auxiliary storage qualifier, OpenGL API
	commands forcing multiple shader invocations per fragment, or for other
	implementation-dependent reasons.

	When using the "pixel_interlock_unordered" or "sample_interlock_unordered"
	qualifier, the interlock will ensure that the critical sections of
	fragment shader invocations X and Y with overlapping coverage will never
	execute concurrently. That is, invocation X is guaranteed to complete its
	call to endInvocationInterlockARB() before invocation Y completes its call
	to beginInvocationInterlockARB(), or vice versa.

	When using the "pixel_interlock_ordered" or "sample_interlock_ordered"
	layout qualifier, the critical sections of invocations X and Y with
	overlapping coverage will be executed in a specific order, based on the
	relative order assigned to their fragments A and B. If fragment A is
	considered to precede fragment B, the critical section of invocation X is
	guaranteed to complete before the critical section of invocation Y begins.
	When a pair of fragments A and B have overlapping coverage, fragment A is
	considered to precede fragment B if

	* the OpenGL API command producing fragment A was called prior to the
	command producing B, or

	* the point, line, triangle, [[compatibility profile: quadrilateral,
	polygon,]] or patch primitive producing fragment A appears earlier in
	the same strip, loop, fan, or independent primitive list producing
	fragment B.

	When [[compatibility profile: decomposing quadrilateral or polygon
	primitives or]] tessellating a single patch primitive, multiple
	primitives may be generated in an undefined implementation-dependent
	order. When fragments A and B are generated from such unordered
	primitives, their ordering is also implementation-dependent.

	If fragment shader X completes its critical section before fragment shader
	Y begins its critical section, all stores to memory performed in the
	critical section of invocation X using a pointer, image uniform, atomic
	counter uniform, or buffer variable qualified by "coherent" are guaranteed
	to be visible to any reads of the same types of variable performed in the
	critical section of invocation Y.

	If multisampling is disabled, or if the framebuffer does not include
	sample buffers, fragment coverage is computed per-pixel. In this case,
	the "sample_interlock_ordered" or "sample_interlock_unordered" layout
	qualifiers are treated as "pixel_interlock_ordered" or
	"pixel_interlock_unordered", respectively.

	Syntax:

	void beginInvocationInterlockARB(void);
	void endInvocationInterlockARB(void);

	Description:

	The beginInvocationInterlockARB() and endInvocationInterlockARB() may only
	be placed inside the function main() of a fragment shader and may not be
	called within any flow control. These functions may not be called after a
	return statement in the function main(), but may be called after a discard
	statement. A compile- or link-time error will be generated if main()
	calls either function more than once, contains a call to one function
	without a matching call to the other, or calls endInvocationInterlockARB()
	before calling beginInvocationInterlockARB().

	Additions to the AGL/GLX/WGL Specifications

	None.

	Errors

	None.

	New State

	None.

	New Implementation Dependent State

	None.

	Issues

	(1) When using multisampling, the OpenGL specification permits
	multiple fragment shader invocations to be generated for a single
	fragment. For example, per-sample shading using the "sample"
	auxiliary storage qualifier or the MinSampleShading() OpenGL API command
	can be used to force per-sample shading. What execution ordering
	guarantees are provided between fragment shader invocations generated
	from the same fragment?

	RESOLVED: We don't provide any ordering guarantees in this extension.
	This implies that when using multisampling, there is no guarantee that
	two fragment shader invocations for the same fragment won't be executing
	their critical sections concurrently. This could cause problems for
	algorithms sharing data structures between all the samples of a pixel
	unless accesses to these data structures are performed atomically.

	When using per-sample shading, the interlock we provide does guarantee
	that no two invocations corresponding to the same sample execute the
	critical section concurrently. If a separate set of data structures is
	provided for each sample, no conflicts should occur within the critical
	section.

	Note that in addition to the per-sample shading options in the shading
	language and API, implementations may provide multisample antialiasing
	modes where the implementation can't simply run the fragment shader once
	and broadcast results to a large set of covered samples.

	(2) What performance differences are expected between shaders using the
	"pixel" and "sample" layout qualifier variants in this extension (e.g.,
	"pixel_invocation_ordered" and "sample_invocation_ordered")?

	RESOLVED: We expect that shaders using "sample" qualifiers may have
	higher performance, since the implementation need not order pairs of
	fragments that touch the same pixel with "complementary" coverage. Such
	situations are fairly common: when two adjacent triangles combine to
	cover a given pixel, two fragments will be generated for the pixel but
	no sample will be covered by both. When using "sample" qualifiers, the
	invocations for both fragments can run concurrently. When using "pixel"
	qualifiers, the critical section for one fragment must wait until the
	critical section for the other fragment completes.

	(3) What performance differences are expected between shaders using the
	"ordered" and "unordered" layout qualifier variants in this extension
	(e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")?

	RESOLVED: We expect that shaders using "unordered" may have higher
	performance, since the critical section implementation doesn't need to
	ensure that all previous invocations with overlapping coverage have
	completed their critical sections. Some algorithms (e.g., building data
	structures in order-independent transparency algorithms) will require
	mutual exclusion when updating per-pixel data structures, but do not
	require that shaders execute in a specific ordering.

	(4) Are fragment shaders using this extension allowed to write outputs?
	If so, is there any guarantee on the order in which such outputs are
	written to the framebuffer?

	RESOLVED: Yes, fragment shaders with critical sections may still write
	outputs. If fragment shader outputs are written, they are stored or
	blended into the framebuffer in API order, as is the case for fragment
	shaders not using this extension.

	(5) What considerations apply when using this extension to implement a
	programmable form of conventional blending using image stores?

	RESOLVED: Per-fragment operations performed in the pipeline following
	fragment shader execution obviously have no effect on image stores
	executing during fragment shader execution. In particular, multisample
	operations such as broadcasting a single fragment output to multiple
	samples or modifying the coverage with alpha-to-coverage or a shader
	coverage mask output value have no effect. Fragments can not be killed
	before fragment shader blending using the fixed-function alpha test or
	using the depth test with a Z value produced by the shader. Fragments
	will normally not be killed by fixed-function depth or stencil tests,
	but those tests can be enabled before fragment shader invocations using
	the layout qualifier "early_fragment_tests". Any required
	fixed-function features that need to be handled before programmable
	blending that aren't enabled by "early_fragment_tests" would need to be
	emulated in the shader.

	Note also that performing blend computations in the shader are not
	guaranteed to produce results that are bit-identical to these produced
	by fixed-function blending hardware, even if mathematically equivalent
	algorithms are used.

	(6) For operations accessing shared per-pixel data structures in the
	critical section, what operations (if any) must be performed in shader
	code to ensure that stores from one shader invocation are visible to
	the next?

	RESOLVED: The "coherent" qualifier is required in the declaration of
	the shared data structures to ensure that writes performed by one
	invocation are visible to reads performed by another invocation.

	In shaders that don't use the interlock, "coherent" is not sufficient as
	there is no guarantee of the ordering of fragment shader invocations --
	even if invocation A can see the values written by another invocation B,
	there is no general guarantee that invocation A's read will be performed
	before invocation B's write. The built-in function memoryBarrier() can
	be used to generate a weak ordering by which threads can communicate,
	but it doesn't order memory transactions between two separate
	invocations. With the interlock, execution ordering between two threads
	from the same pixel is well-defined as long as the loads and stores are
	performed inside the critical section, and the use of "coherent" ensures
	that stores done by one invocation are visible to other invocations.

	(7) Should we provide an explicit mechanisms for shaders to indicate a
	critical section? Or should we just automatically infer a critical
	section by analyzing shader code? Or should we just wrap the entire
	fragment shader in a critical section?

	RESOLVED: Provide an explicit critical section.

	We definitely don't want to wrap the entire shader in a critical section
	when a smaller section will suffice. Doing so would hold off the
	execution of any other fragment shader invocation with the same (x,y)
	for the entire (potentially long) life of the fragment shader. Hardware
	would need to track a large number of fragments awaiting execution, and
	may be so backed up that further fragments will be blocked even if they
	don't overlap with any fragments currently executing. Providing a
	smaller critical section reduces the amount of time other fragments are
	blocked and allows implementations to perform useful work for
	conflicting fragments before they hit the critical section.

	While a compiler could analyze the code and wrap a critical section
	around all memory accesses, it may be difficult to determine which
	accesses actually require mutual exclusion and ordering, and which
	accesses are safe to do with no protection. Requiring shaders to
	explicitly identify a critical section doesn't seem overwhelmingly
	burdensome, and allows applications to exclude memory accesses that it
	knows to be "safe".

	(8) What restrictions should be imposed on the use of the
	beginInvocationInterlockARB() and endInvocationInterlockARB() functions
	delimiting a critical section?

	RESOLVED: We impose restrictions similar to those on the barrier()
	built-in function in tessellation control shaders to ensure that any
	shader using this functionality has a single critical section that can
	be easily identified during compilation. In particular, we require that
	these functions be called in main() and don't permit them to be called
	in conditional flow control.

	These restrictions ensure that there is always exactly one call to the
	"begin" and "end" functions in a predictable location in the compiled
	shader code, and ensure that the compiler and hardware don't have to
	deal with unusual cases (like entering a critical section and never
	leaving, leaving a critical section without entering it, or trying to
	enter a critical section more than once).

	Revision History

	Rev. Date Author Changes
	---- -------- -------- -----------------------------------------
	1 04/01/15 S.Grajewski Inital version merging
	INTEL_fragment_shader_ordering with
	NV_fragment_shader_interlock

	2 05/07/15 S.Grajewski Built-in functions
	beginInvocationInterlockARB() and
	endInvocationInterlockARB() have now ARB
	suffixes.