blob: 8bb006cded6d93bde0bcd56a0f86564aab90556a [file] [log] [blame]
Name
ARB_compute_shader
Name Strings
GL_ARB_compute_shader
Contact
Graham Sellers, AMD (graham.sellers 'at' amd.com)
Contributors
Pat Brown, NVIDIA
Daniel Koch, TransGaming
John Kessenich
Members of the ARB working group
Notice
Copyright (c) 2012-2014 The Khronos Group Inc. Copyright terms at
http://www.khronos.org/registry/speccopyright.html
Specification Update Policy
Khronos-approved extension specifications are updated in response to
issues and bugs prioritized by the Khronos OpenGL Working Group. For
extensions which have been promoted to a core Specification, fixes will
first appear in the latest version of that core Specification, and will
eventually be backported to the extension document. This policy is
described in more detail at
https://www.khronos.org/registry/OpenGL/docs/update_policy.php
Status
Complete.
Approved by the ARB on 2012/06/12.
Version
Last Modified Date: December 10, 2018
Revision: 28
Number
ARB Extension #122
Dependencies
OpenGL 4.2 is required.
This extension is written based on the wording of the OpenGL 4.2 (Core
Profile) specification, and on the wording of the OpenGL Shading Language
(GLSL) Specification, version 4.20.
This extension interacts with OpenGL 4.3 and
ARB_shader_storage_buffer_object.
This extension interacts with NV_vertex_buffer_unified_memory.
Overview
Recent graphics hardware has become extremely powerful and a strong desire
to harness this power for work (both graphics and non-graphics) that does
not fit the traditional graphics pipeline well has emerged. To address
this, this extension adds a new single-stage program type known as a
compute program. This program may contain one or more compute shaders
which may be launched in a manner that is essentially stateless. This allows
arbitrary workloads to be sent to the graphics hardware with minimal
disturbance to the GL state machine.
In most respects, a compute program is identical to a traditional OpenGL
program object, with similar status, uniforms, and other such properties.
It has access to many of the same resources as fragment and other shader
types, such as textures, image variables, atomic counters, and so on.
However, it has no predefined inputs nor any fixed-function outputs. It
cannot be part of a pipeline and its visible side effects are through its
actions on images and atomic counters.
OpenCL is another solution for using graphics processors as generalized
compute devices. This extension addresses a different need. For example,
OpenCL is designed to be usable on a wide range of devices ranging from
CPUs, GPUs, and DSPs through to FPGAs. While one could implement GL on these
types of devices, the target here is clearly GPUs. Another difference is
that OpenCL is more full featured and includes features such as multiple
devices, asynchronous queues and strict IEEE semantics for floating point
operations. This extension follows the semantics of OpenGL - implicitly
synchronous, in-order operation with single-device, single queue
logical architecture and somewhat more relaxed numerical precision
requirements. Although not as feature rich, this extension offers several
advantages for applications that can tolerate the omission of these
features. Compute shaders are written in GLSL, for example and so code may
be shared between compute and other shader types. Objects are created and
owned by the same context as the rest of the GL, and therefore no
interoperability API is required and objects may be freely used by both
compute and graphics simultaneously without acquire-release semantics or
object type translation.
New Procedures and Functions
void DispatchCompute(uint num_groups_x,
uint num_groups_y,
uint num_groups_z);
void DispatchComputeIndirect(intptr indirect);
New Tokens
Accepted by the <type> parameter of CreateShader and returned in the
<params> parameter by GetShaderiv:
COMPUTE_SHADER 0x91B9
Accepted by the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv,
GetDoublev and GetInteger64v:
MAX_COMPUTE_UNIFORM_BLOCKS 0x91BB
MAX_COMPUTE_TEXTURE_IMAGE_UNITS 0x91BC
MAX_COMPUTE_IMAGE_UNIFORMS 0x91BD
MAX_COMPUTE_SHARED_MEMORY_SIZE 0x8262
MAX_COMPUTE_UNIFORM_COMPONENTS 0x8263
MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS 0x8264
MAX_COMPUTE_ATOMIC_COUNTERS 0x8265
MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS 0x8266
MAX_COMPUTE_WORK_GROUP_INVOCATIONS 0x90EB
Accepted by the <pname> parameter of GetIntegeri_v, GetBooleani_v,
GetFloati_v, GetDoublei_v and GetInteger64i_v:
MAX_COMPUTE_WORK_GROUP_COUNT 0x91BE
MAX_COMPUTE_WORK_GROUP_SIZE 0x91BF
Accepted by the <pname> parameter of GetProgramiv:
COMPUTE_WORK_GROUP_SIZE 0x8267
Accepted by the <pname> parameter of GetActiveUniformBlockiv:
UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER 0x90EC
Accepted by the <pname> parameter of GetActiveAtomicCounterBufferiv:
ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER 0x90ED
Accepted by the <target> parameters of BindBuffer, BufferData,
BufferSubData, MapBuffer, UnmapBuffer, GetBufferSubData, and
GetBufferPointerv:
DISPATCH_INDIRECT_BUFFER 0x90EE
Accepted by the <value> parameter of GetIntegerv, GetBooleanv,
GetInteger64v, GetFloatv, and GetDoublev:
DISPATCH_INDIRECT_BUFFER_BINDING 0x90EF
Accepted by the <stages> parameter of UseProgramStages:
COMPUTE_SHADER_BIT 0x00000020
Additions to Chapter 2 of the OpenGL 4.2 (Core Profile) Specification
(OpenGL Operation)
In section 2.9.1, "Creating and Binding Buffer Objects", add to table 2.8
(p.43):
Described
Target name Purpose in sections(s)
----------------------- ------------------------- ---------------
DISPATCH_INDIRECT_BUFFER Indirect compute dispatch 5.5
commands
Add to the end of section 2.9.8, "Indirect Commands In Buffer Objects"
(p. 53):
Arguments to the DispatchComputeIndirect command are stored in buffer
objects as a group of three unsigned integers.
A buffer object is bound to DISPATCH_INDIRECT_BUFFER by calling BindBuffer
with target set to DISPATCH_INDIRECT_BUFFER, and buffer set to the name of
the buffer object. If no corresponding buffer object exists, one is
initialized as defined in section 2.9.
DispatchComputeIndirect sources its arguments from the buffer object whose
name is bound to DISPATCH_INDIRECT_BUFFER, using the <indirect> parameter as
an offset into the buffer object in the same fashion as described in
section 2.9.6. An INVALID_OPERATION error is generated if this command
sources data beyond the end of the buffer object, if zero is bound to
DISPATCH_INDIRECT_BUFFER, or if <indirect> is less than zero or not a
multiple of the size, in basic machine units, of uint.
In section 2.11, "Vertex Shaders", modify the introductory text on shaders
to include compute shaders (second paragraph, p. 56):
In addition to vertex shaders, tessellation control..., geometry shaders,
fragment shaders, and compute shders can be created, compiled, and linked
into program objects. .... (section 3.10). Compute shaders perform
general computations for dispatched arrays of shader invocations (section
5.5), but do not operate on primitives processed by the other shader
types. ...
In section 2.11.3, "Program Objects", add to the reasons that LinkProgram
may fail, p. 61:
* The program object contains objects to form a compute shader (see
section 5.5) and objects to form any other type of shader.
In section 2.11.3, modify the description of active programs (last
paragraph, p. 61, first paragraph, p. 62):
... geometry shader stages, those stages are ignored. If there is no
active program for the compute shader stage, compute dispatches will
generate an error. The active program for the compute shader stage has no
effect on the processing of vertices, geometric primitives, and fragments,
and the active program for all other shader stages has no effect on
compute dispatches.
In section 2.11.4, "Program Pipeline Objects", modify the description of
UseProgramStages, p. 65:
The executables in a program object... becomes current. These stages may
include vertex, tessellation control, tessellation evaluation, geometry,
fragment, or compute, indicated by VERTEX_SHADER_BIT,
TESS_CONTROL_SHADER_BIT, TESS_EVALUATION_SHADER_BIT, GEOMETRY_SHADER_BIT,
FRAGMENT_SHADER_BIT, or COMPUTE_SHADER_BIT, respectively. ...
In the unnumbered "Validation" section of section 2.11.12 "Shader
Execution", modify the list of validation errors, pp. 112-113:
This error is generated by any command that transfers vertices to the GL
or launches compute work if:
* (last bullet, p. 112) One program object is active... first program
object was active. The active compute shader is ignored for the
purposes of this test.
* (2nd bullet, p. 113) There is no current program specified by
UseProgram, there is a current program pipeline object, and the
current program for any shader stage has been relinked since...
* (3rd bullet, p. 113) Any two active samplers in the set of active
program objects are of different types but refer to the same texture
image unit.
* (4th bullet, p. 113) The sum of the number of active samplers for each
active program exceeds the maximum number of texture image units
allowed.
Modify the paragraph describing ValidateProgram, p. 113:
... If validation succeeded, ... set to FALSE. If validation succeeded,
no INVALID_OPERATION validation error will be generated if <program> were
made current via UseProgram, given the current state. If validation
failed, such errors will be generated under the current state.
Modify the paragraph describing ValidateProgramPipeline, p. 114:
... can be queried with GetProgramPipelineiv (see section 6.1.12). If
validation succeeded, no INVALID_OPERATION validation error will be
generated if <pipeline> were bound and no program were made current via
UseProgram, given the current state. If validation failed, such errors
will be generated under the current state.
In subsection 2.11.12, "Shader Execution":
Add to the list of implementation dependent constants under the
"Texture Access" sub-heading:
MAX_COMPUTE_TEXTURE_IMAGE_UNITS (for compute shaders),
Add to the list of implementation dependent constants under the "Atomic
Counter Access" sub-heading:
MAX_COMPUTE_ATOMIC_COUNTERS (for compute shaders),
Add to the list of implementation dependent constants under the "Image
Access" sub-heading:
MAX_COMPUTE_IMAGE_UNIFORMS (for compute shaders),
In section 2.16, "Conditional Rendering", modify the sentence describing
conditional rendering, starting with "In this case"...
In this case, all drawing commands (see section 2.8.3), as well as
Clear and ClearBuffer* (see section 4.2.3), and compute dispatch
through DispacthCompute* (see section 5.5), have no effect.
In the "Shared Memory Access Synchronization" subsection of section
2.11.13, "Shader Memory Access", modify the description of
COMMAND_BARRIER_BIT (p. 118):
* COMMAND_BARRIER_BIT: Command data sourced from buffer objects by
Draw*Indirect and DispatchComputeIndirect commands ... The buffer
objects affected by this bit are derived from the DRAW_INDIRECT_BUFFER
and DISPATCH_INDIRECT_BUFFER bindings.
In subection 2.17.7, "Uniform Variables", replace the paragraph beginning
"If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,"... with:
If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,
UNIFORM_BLOCK_REFERENCED_BY_TESS_CONTROL_SHADER,
UNIFORM_BLOCK_REFERENCED_BY_TESS_EVALUATION_SHADER,
UNIFORM_BLOCK_REFERENCED_BY_GEOMETRY_SHADER,
UNIFORM_BLOCK_REFERENCED_BY_FRAGMENT_SHADER or
UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER, then a boolean value indicating
whether the uniform block identified by uniformBlockIndex is referenced
by the vertex, tessellation control, tessellation evaluation, geometry,
fragment or compute programming stages of <program>, respectively, is
returned.
Also in subsection 2.17.7, "Uniform Variables", replace the paragraph
beginning, "If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER"
on p.80 with:
If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER,
ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_CONTROL_SHADER,
ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_EVALUATION_SHADER,
ATOMIC_COUNTER_BUFFER_REFERENCED_BY_GEOMETRY_SHADER,
ATOMIC_COUNTER_BUFFER_REFERENCED_BY_FRAGMENT_SHADER or
ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER, then a single boolean
value indicating whether the atomic counter buffer identified by
bufferIndex is referenced by the vertex, tessellation control, tessellation
evaluation, geometry, fragment or compute programming stages of
<program>, respectively, is returned.
Under the sub-heading "Uniform Blocks" in subsection 2.11.17, replace the
sentence beginning "The limits for vertex, tessellation ..." on p.92
with:
The limits for vertex, tessellation, geometry, fragment and compute
shaders can be obtained by calling GetIntegerv with <pname> set to
MAX_VERTEX_UNIFORM_BLOCKS, MAX_TESS_CONTROL_UNIFORM_BLOCKS,
MAX_TESS_EVALUATION_UNIFORM_BLOCKS, MAX_GEOMETRY_UNIFORM_BLOCKS,
MAX_FRAGMENT_UNIFORM_BLOCKS and MAX_COMPUTE_UNIFORM_BLOCKS, respectively.
Under the sub-heading "Atomic Counter Buffers" in subsection 2.11.17,
replace the sentence beginning "The limits for vertex, geometry, ..."
on p.96 with:
The limits for vertex, tessellation, geometry, fragment and compute
shaders can be obtained by calling GetIntegerv with <pname> set to
MAX_VERTEX_ATOMIC_COUNTER_BUFFERS, MAX_TESS_CONTROL_ATOMIC_COUNTER_BUFFERS,
MAX_TESS_EVALUATION_ATOMIC_COUNTER_BUFFERS,
MAX_GEOMETRY_ATOMIC_COUNTER_BUFFERS, MAX_FRAGMENT_ATOMIC_COUNTER_BUFFERS and
MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS, respectively.
Additions to Chapter 3 of the OpenGL 4.2 (Core Profile) Specification
(Rasterization)
None.
Additions to Chapter 4 of the OpenGL 4.2 (Core Profile) Specification
(Per-Fragment Operations and the Framebuffer)
None.
Additions to Chapter 5 of the OpenGL 4.2 (Core Profile) Specification
(Special Functions)
Add Section 5.5, "Compute Shaders"
In addition to graphics-oriented shading operations such as vertex,
tessellation, geometry and fragment shading, generic computation may be
performed by the GL through the use of compute shaders. The compute pipeline
is a form of single-stage machine that runs generic shaders. Compute shaders
are created as described in section 2.11.1 using a <type> parameter of
COMPUTE_SHADER. They are attached to and used in program objects as
described in section 2.11.3.
Compute workloads are formed from groups of work items called
_workgroups_ and processed by the executable code for a compute program.
A workgroup is a collection of shader invocations that execute the same code,
potentially in parallel. An invocation within a workgroup may share data
with other members of the same workgroup through shared variables and
issue memory and control barriers to synchronize with other members of the
same workgroup. One or more workgroups is launched by calling:
void DispatchCompute(uint num_groups_x,
uint num_groups_y,
uint num_groups_z);
Each workgroup is processed by the active program object for the
compute shader stage. The error INVALID_OPERATION will be generated if
there is no active program object for the compute shader stage. The
active program for the compute shader stage will be determined in the same
manner as the active program for other pipeline stages, as described in
section 2.11.3. While the individual shader invocations within a
workgroup are executed as a unit, workgroups are executed completely
independently and in unspecified order.
<num_groups_x>, <num_groups_y> and <num_groups_z> specify the number of
workgroups that will be dispatched in the X, Y and Z dimensions,
respectively. The builtin vector variable gl_NumWorkGroups will be
initialized with the contents of the <num_groups_x>, <num_groups_y> and
<num_groups_z> parameters. The maximum number of workgroups that may be
dispatched at one time may be determined by calling GetIntegeri_v with
<pname> set to MAX_COMPUTE_WORK_GROUP_COUNT and <index> must be zero, one,
or two, representing the X, Y, and Z dimensions, respectively. The
values in the <num_groups_x>, <num_groups_y> and <num_groups_z> array must
be less than or equal to the maximum workgroup count for the corresponding
dimension, otherwise an INVALID_VALUE error is generated. If the workgroup
count in any dimension is zero, no workgroups are dispatched.
The workgroup size in each dimension are specified at compile time
using an input layout qualifier in one or more of the compute shaders
attached to the program (see Section 4 of the OpenGL Shading Language
Specification). After the program has been linked, the workgroup size
of the program may be retrieved by calling GetProgramiv with <pname> set to
COMPUTE_WORK_GROUP_SIZE. This will return an array of three integers
containing the workgroup size of the compute program as specified by
its input layout qualifier(s). If <program> is the name of a program that
has not been successfully linked, or is the name of a linked program object
that contains no compute shaders, then an INVALID_OPERATION error is
generated.
The maximum size of a workgroup may be determined by calling
GetIntegeri_v with <pname> set to MAX_COMPUTE_WORK_GROUP_SIZE
and <index> set to 0, 1, or 2 to retrieve the maximum work size in the
X, Y and Z dimension, respectively. Furthermore, the maximum number of
invocations in a single workgroup (i.e., the product of the three
dimensions) may be determined by calling GetIntegerv with <pname> set to
MAX_COMPUTE_WORK_GROUP_INVOCATIONS.
The command
void DispatchComputeIndirect(intptr indirect);
is equivalent (assuming no errors are generated) to calling
DispatchCompute with <num_groups_x>, <num_groups_y> and <num_groups_z>
initialized with the three uint values contained in the buffer currently
bound to the DISPATCH_INDIRECT_BUFFER binding at an offset, in basic
machine units, specified by <indirect>. The error INVALID_VALUE is
generated if <indirect> is less than zero or is not a multiple of four.
The error INVALID_OPERATION is generated if no buffer is bound to
DISPATCH_INDIRECT_BUFFER, if the command would source data beyond the end
of the buffer object, or if there is no active program for the compute
shader stage. If any of <num_groups_x>, <num_groups_y> or <num_groups_z>
is greater than MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding
dimension then the results are undefined.
Add Subsection 5.5.1, "Compute Shader Variables"
Compute shaders can access variables belonging to the current program
object. The amount of storage in the default uniform block accessed by a
compute shader is specified by the value of the implementation dependent
constant MAX_COMPUTE_UNIFORM_COMPONENTS. The total amount of
combined storage available for uniform variables in all uniform blocks
accessed by a compute shader (including the default unifom block) is
specified by the implementation dependent constant
MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS.
There is a limit to the total size of all variables declared as
<shared> in a single program object. This limit, expressed in units of
basic machine units, may be queried as the value of
MAX_COMPUTE_SHARED_MEMORY_SIZE.
Additions to Chapter 6 of the OpenGL 4.2 (Core Profile) Specification
(State and State Requests)
None.
Additions to Chapter 2 of the OpenGL Shading Language Specification, Version
4.20 (Overview of OpenGL Shading)
Replace the last sentence of the first paragraph of the overview with
the following:
"Currently, these processors are the vertex, tessellation control,
tessellation evaluation, geometry, fragment, and compute processors."
Replace the last sentence of the second paragraph of the overview with
the following:
"The specific languages will be referred to by the name of the processor
they target: vertex, tessellation control, tessellation evaluation,
geometry, fragment, or compute."
Add a new Section 2.6 titled "Compute Processor" with the following text:
"The <compute processor> is a programmable unit that operates independently
from the other shader processors. Compilation units written in the OpenGL
Shading Language to run on this processor are called <compute shaders>.
When a complete set of compute shaders are compiled and linked, they
result in a <compute shader executable> that runs on the compute processor.
A compute shader has access to many of the same resources as fragment and
other shader processors, such as textures, buffers, image variables,
atomic counters, and so on. It does not have any predefined inputs
nor any fixed-function outputs. It is not part of the graphics pipeline
and its visible side effects are through actions on images, storage
buffers, and atomic counters.
A compute shader operates on a group of work items called a workgroup.
A workgroup is a collection of shader invocations that execute the same
code, potentially in parallel. An invocation within a workgroup may share data with
other members of the same workgroup through shared variables and issue
memory and control barriers to synchronize with other members of the same workgroup."
Additions to Chapter 4 of the OpenGL Shading Language Specification, Version
4.20 (Variables and Types)
Modify section 4.4.1, second paragraph from
"All shaders allow input layout qualifiers on input variable declarations."
to
"All shaders, except compute shaders, allow input layout location qualifiers on
input variable declarations."
Modify Section 4.3. Add to the table at the start of Section 4.3:
+-------------------+-----------------------------------------------------------+
| Storage Qualifier | Meaning |
+-------------------+-----------------------------------------------------------+
| <shared> | variable storage is shared across all work items in a |
| | workgroup for compute shaders |
+-------------------+-----------------------------------------------------------+
Add the following paragraph to Section 4.3.4, "Input Variables"
Compute shaders do not permit user-defined input variables and do not
form a formal interface with any other shader stage. See section 7.1
for a description of built-in compute shader input variables. All other
input to a compute shader is retrieved explicitly through image loads,
texture fetches, loads from uniforms or uniform buffers, or other user
supplied code. Redeclaration of built-in input variables in compute
shaders is not permitted.
Add the following paragraph to Section 4.3.6, "Output Variables"
Compute shaders have no built-in output variables, do not support
user-defined output variables and do not form a formal interface with any
other shader stage. All outputs from a compute shader take the form of the
side effects such as image stores and operations on atomic counters.
Add Section 4.3.7, "Shared", renumber subsequent sections
The <shared> qualifier is used to declare variables that have storage
shared between all work items of a compute shader workgroup.
Variables declared as <shared> may only be used in compute shaders
(see Section 5.5, "Compute Shaders"). Shared variables are implicitly
coherent. That is, writes to shared variables from one shader invocation
will eventually be seen by other invocations within the same workgroup.
Variables declared as <shared> may not have initializers and their
contents are undefined at the beginning of shader execution. Any data
written to <shared> variables will be visible to other shaders executing
the same shader within the same workgroup. Order of execution
with regards to reads and writes to the same <shared> variables by different
invocations of a shader is not defined. In order to achieve ordering with
respect to reads and writes to <shared> variables, memory barriers must be
employed using the barrier() function (see Section 8.15).
There is a limit to the total size of all variables declared as
<shared> in a single program object. This limit, expressed in units of
basic machine units may be determined by using the OpenGL API to query the
value of MAX_COMPUTE_SHARED_MEMORY_SIZE.
Add Section 4.4.1.4, "Compute-Shader Inputs"
There are no layout location qualifiers for compute shader inputs.
Layout qualifier identifiers for compute shader inputs are the workgroup
size qualifiers:
layout-qualifier-id
local_size_x = integer-constant
local_size_y = integer-constant
local_size_z = integer-constant
<local_size_x>, <local_size_y>, and <local_size_z> are used to define the
local size of the kernel defined by the compute shader in the first,
second, and third dimension, respectively. The default size in each
dimension is 1. If a shader does not specify a size for one of the
dimensions, that dimension will have a size of 1.
For example, the following declaration in a compute shader
layout (local_size_x = 32, local_size_y = 32) in;
is used to declare a two-dimensional compute shader with a local size of
32 x 32 elements as a three-dimensional compute shader where the third dimension is
one element deep.
As another example, the declaration
layout (local_size_x = 8) in;
effectively specifies that a one-dimensional compute shader is being
compiled, and its size is 8 elements.
If the local size of the shader in any dimension is greater than the
maximum size supported by the implementation for that dimension, a
compile-time error results. Also, if such a layout qualifier is declared more
than once in the same shader, all those declarations must indicate the same
workgroup size; otherwise a compile-time error results. If multiple compute
shaders attached to a single program object declare the workgroup size,
the declarations must be identical; otherwise a link-time error results.
Furthermore, if a program object contains any compute shaders, at
least one must contain an input layout qualifier specifying the
workgroup sizes of the program, or a link-time error will occur.
Additions to Chapter 7 of the OpenGL Shading Language Specification, Version
4.20 (Built-in Variables)
Add to the start of Section 7.1, "Built-In Language Variables", before the
description of the vertex language built-in variables:
In the compute language, the built-in variables are declared as follows:
// workgroup dimensions
in uvec3 gl_NumWorkGroups;
const uvec3 gl_WorkGroupSize;
// workgroup and invocation IDs
in uvec3 gl_WorkGroupID;
in uvec3 gl_LocalInvocationID;
// derived variables
in uvec3 gl_GlobalInvocationID;
in uint gl_LocalInvocationIndex;
Add the end of Section 7.1, before Section 7.1.1:
The built-in variable <gl_NumWorkGroups> is a compute-shader input
variable containing the total number of global work items in each
dimension of the workgroup that will execute the compute shader.
Its content is equal to the values specified in the <num_groups_x>,
<num_groups_y>, and <num_groups_z> parameters passed to the
DispatchCompute API entry point.
The built-in constant <gl_WorkGroupSize> is a compute-shader constant
containing the workgroup size of the shader. The size of the workgroup
in the X, Y, and Z dimensions is stored in the x, y, and z components.
The values stored in <gl_WorkGroupSize> match those specified in the
required <local_size_x>, <local_size_y>, and <local_size_z> layout
qualifiers for the current shader. This value is constant so that
it can be used to size arrays of memory that can be shared within
the workgroup.
The built-in variable <gl_WorkGroupID> is a compute-shader input
variable containing the 3-dimensional index of the global workgroup
that the current invocation is executing in. The possible values range
across the parameters passed into DispatchCompute, i.e., from (0, 0, 0) to
(gl_NumWorkGroups.x - 1, gl_NumWorkGroups.y - 1, gl_NumWorkGroups.z - 1).
The built-in variable <gl_LocalInvocationID> is a compute-shader input
variable containing the 3-dimensional index of the workgroup
within the global workgroup that the current invocation is executing in.
The possible values for this variable range across the workgroup
size, i.e. (0,0,0) to (gl_WorkGroupSize.x - 1, gl_WorkGroupSize.y - 1,
gl_WorkGroupSize.z - 1).
The built-in variable <gl_GlobalInvocationID> is a compute shader input
variable containing the global index of the current work item. This
value uniquely identifies this invocation from all other invocations
across all workgroups initiated by the current
DispatchCompute call. This is computed as:
gl_GlobalInvocationID =
gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID.
The built-in variable <gl_LocalInvocationIndex> is a compute shader
input variable that contains the 1-dimensional representation of the
gl_LocalInvocationID. This is useful for uniquely identifying a
unique region of shared memory within the workgroup for this
invocation to use. This is computed as:
gl_LocalInvocationIndex =
gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +
gl_LocalInvocationID.y * gl_WorkGroupSize.x +
gl_LocalInvocationID.x;
Add to the list of built-in constants in Section 7.3:
const ivec3 gl_MaxComputeWorkGroupCount = { 65535, 65535, 65535 };
const ivec3 gl_MaxComputeWorkGroupSize = { 1024, 1024, 64 };
const int gl_MaxComputeUniformComponents = 512;
const int gl_MaxComputeTextureImageUnits = 16;
const int gl_MaxComputeImageUniforms = 8;
const int gl_MaxComputeAtomicCounters = 8;
const int gl_MaxComputeAtomicCounterBuffers = 1;
Additions to Chapter 8 of the OpenGL Shading Language Specification, Version
4.20 (Built-in Variables)
Insert "Atomic Memory Functions" section after Section 8.10, Atomic
Counter Functions (p. 149). Atomic memory operations are supported on
shared variables; the set of operations and their definitions are similar
to those for the imageAtomic*() functions. These functions are fully
documented in the ARB_shader_storage_buffer_object extension (see
dependencies).
Modify the first paragraph of Section 8.15, "Shader Invocation Control
Functions" to read:
The shader invocation control function is only available in tessellation
control shaders and compute shaders. It is used to control the relative
execution order of multiple shader invocations used to process a patch
(in the case of tessellation control shaders) or a workgroup (in the
case of compute shaders), which are otherwise executed with an undefined
order.
+----------------+--------------------------------------------------------------------------+
| Syntax | Description |
+----------------+--------------------------------------------------------------------------+
| barrier | For any given static instance of barrier() appearing in a tessellation |
| | control shader or compute shader, all invocations for a single patch |
| | or workgroup, respectively, must enter it before any will continue |
| | beyond it. |
+----------------+--------------------------------------------------------------------------+
Modify the second paragraph as follows:
... Because invocations may execute in an undefined order between these
barrier calls, the values of a per-vertex or per-patch output variable in
a tessellation control shader or shared variables for compute shaders
will be undefined in a number of cases enumerated in Section 4.3.7 "Output
Variables" (for tessellation control shaders) and Section 4.3.6 "Shared
Variables" (for compute shaders).
Replace the third paragraph with the following:
For tessellation control shaders, the barrier() function may only be
placed inside the function main() of the tessellation control shader and
may not be called within any control flow. Barriers are also disallowed
after a return statement in the function main(). Any such misplaced
barriers result in a compile-time error.
For compute shaders, the barrier() function may be placed within flow
control, but that flow control must be uniform flow control. That is, all
the controlling expressions that lead to execution of the barrier must be
dynamically uniform expressions. This ensures that if any shader
invocation enters a conditional statement, then all invocations will enter
it. While compilers are encouraged to give warnings if they can detect
this might not happen, compilers cannot completely determine this. Hence,
it is the author's responsibility to ensure barrier() only exists inside
uniform flow control. Otherwise, some shader invocations will stall
indefinitely, waiting for a barrier that is never reached by other
invocations.
Modify the table of memory control functions on p.160,
+-----------------------------------+----------------------------------------------------------------------------------------+
| Syntax | Description |
+-----------------------------------+----------------------------------------------------------------------------------------+
| void memoryBarrier() | Control the ordering of all memory transactions issued by a single shader invocation. |
+-----------------------------------+----------------------------------------------------------------------------------------+
| void memoryBarrierAtomicCounter() | Control the ordering of accesses to atomic counter variables issued by a single shader |
| | invocation. |
+-----------------------------------+----------------------------------------------------------------------------------------+
| void memoryBarrierBuffer() | Control the ordering of memory transactions to buffer variables issued within a |
| | single shader invocation. |
+-----------------------------------+----------------------------------------------------------------------------------------+
| void memoryBarrierImage() | Control the ordering of memory transactions to images issued within a single shader |
| | invocation. |
+-----------------------------------+----------------------------------------------------------------------------------------+
| void memoryBarrierShared() | Control the ordering of memory transactions to shared variables issued within a single |
| | shader invocation. |
| | Only available in compute shaders. |
+-----------------------------------+----------------------------------------------------------------------------------------+
| void groupMemoryBarrier() | Control the ordering of all memory transactions issued within a single shader |
| | invocation, as viewed by other invocations in the same workgroup. |
| | Only available in compute shaders. |
+-----------------------------------+----------------------------------------------------------------------------------------+
Modify the subsequent paragraph as follows:
The memory barrier built-in functions can be used to order reads and
writes to variables stored in memory accessible to other shader
invocations. When called, these functions will wait for the completion of
all reads and writes previously performed by the caller that access
selected variable types, and then return with no other effect. The
built-in functions memoryBarrierAtomicCounter(), memoryBarrierBuffer(),
memoryBarrierImage(), and memoryBarrierShared() wait for the completion of
accesses to atomic counter, buffer, image, and shared variables,
respectively. The built-in functions memoryBarrier() and
groupMemoryBarrier() wait for the completion of accesses to all of the
above variable types. The functions memoryBarrierShared() and
groupMemoryBarrier() are available only in compute shaders; the other
functions are available in all shader types.
When these functions return, any memory stores performed using coherent
variables prior to the call will be visible to any future coherent access
to the same memory performed by any other shader invocation. In
particular, the values written this way in one shader stage are guaranteed
to be visible to coherent memory accesses performed by shader invocations
in subsequent stages when those invocations were triggered by the
execution of the original shader invocation (e.g., fragment shader
invocations for a primitive resulting from a particular geometry shader
invocation).
Additionally, memory barrier functions order stores performed by the
calling invocation, as observed by other shader invocations. Without
memory barriers, if one shader invocation performs two stores to coherent
variables, a second shader invocation might see the values written by the
second store prior to seeing those written by the first. However, if the
first shader invocation calls a memory barrier function between the two
stores, selected other shader invocations will never see the results of
the second store before seeing those of the first. When using the
function groupMemoryBarrier(), this ordering guarantee applies only to
other shader invocations in the same compute shader workgroup; all other
memory barrier functions provide the guarantee to all other shader
invocations. No memory barrier is required to guarantee the order of
memory stores as observed by the invocation performing the stores; an
invocation reading from a variable that it previously wrote will always
see the most recently written value unless another shader invocation also
wrote to the same memory.
Dependencies on OpenGL 4.3 and ARB_shader_storage_buffer_object
If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, the
spec language adding the built-in functions atomicAdd(), atomicMin(),
atomicMax(), atomicAnd(), atomicOr(), atomicXor(), atomicExchange(), and
atomicCompSwap() should be considered to be incorporated into this
extension as-is, except that buffer variables will not be supported and
thus cannot be used with these functions. No "#extension" directive is
necessary to use these functions in compute shaders.
If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported,
references to the GLSL built-in function memoryBarrierBuffer() should be
removed.
Dependencies on NV_vertex_buffer_unified_memory
If NV_vertex_buffer_unified_memory is supported, a new buffer address
range and enable is provided to permit the use with
DispatchComputeIndirect with a resident buffer object without requiring
that it be bound to the DISPATCH_INDIRECT_BUFFER target. The following
additional edits apply:
Accepted by the <cap> parameter of GetBufferParameterui64vNV:
DISPATCH_INDIRECT_BUFFER (defined above)
Accepted by the <cap> parameter of Disable, Enable, and IsEnabled, and by
the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv, GetDoublev
and GetInteger64v:
DISPATCH_INDIRECT_UNIFIED_NV 0x90FD
Accepted by the <pname> parameter of BufferAddressRangeNV
and the <value> parameter of GetIntegerui64vNV:
DISPATCH_INDIRECT_ADDRESS_NV 0x90FE
Accepted by the <value> parameter of GetIntegerv:
DISPATCH_INDIRECT_LENGTH_NV 0x90FF
Add to the end of Section 5.5, after discussion of
DispatchComputeIndirect:
If DISPATCH_INDIRECT_UNIFIED_NV is enabled, DispatchComputeIndirect does
not use the buffer bound to DISPATCH_INDIRECT_BUFFER. Instead, it sources
its arguments from the GPU address range specified by calling
BufferAddressRangeNV with a <pname> of DISPATCH_INDIRECT_ADDRESS_NV and an
<index> of zero. The address is obtained by adding the <indirect>
parameter to the base address of the range, specified by the <address>
parameter of BufferAddressRangeNV. If the command sources data outside
the specified address range, the error INVALID_OPERATION will be
generated. The DISPATCH_INDIRECT_BUFFER binding will be ignored in this
case, and no errors will be generated due to the use of this binding. The
error INVALID_VALUE will still be generated if <indirect> is negative. No
INVALID_VALUE error will be generated if <indirect> is not a multiple of
four, but INVALID_OPERATION will be generated if the effective address is
not a multiple of four. If the indirect dispatch address range does not
belong to a buffer object that is resident at the time of the
DispatchComputeIndirect call, undefined results, possibly including
program termination, may occur.
Add the following to the "Compute Dispatch State" table defined in this
extension:
Get Value Type Get Command Initial Value Sec Attribute
--------- ---- ----------- ------------- --- ---------
DISPATCH_INDIRECT_UNIFIED_NV B IsEnabled FALSE 5.5 none
DISPATCH_INDIRECT_ADDRESS_NV Z64+ GetIntegerui64vNV 0 5.5 none
DISPATCH_INDIRECT_LENGTH_NV Z+ GetIntegerv 0 5.5 none
Errors
INVALID_OPERATION is generated by DispatchCompute or
DispatchComputeIndirect if there is no active program for the compute
shader stage.
INVALID_VALUE is generated by DispatchCompute if any of <num_groups_x>,
<num_groups_y> or <num_groups_z> is greater than the value of
MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding dimension.
INVALID_VALUE is generated by DispatchComputeIndirect if <indirect> is
less than zero or not a multiple of four.
INVALID_OPERATION is generated by DispatchComputeIndirect if no buffer is
bound to DISPATCH_INDIRECT_BUFFER or if the command would source data
beyond the end of the bound buffer object.
INVALID_OPERATION is generated by GetProgramiv is <pname> is
COMPUTE_WORK_GROUP_SIZE and either the program has not been linked
successfully, or has been linked but contains no compute shaders.
LinkProgram will fail if <program> contains a combination of compute and
non-compute shaders.
New State
None.
New Implementation Dependent State
Add to Table 6.31, "Program Pipeline Object State"
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
| Get Value | Type | Get Command | Initial Value | Description | Sec. |
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
| COMPUTE_SHADER | Z+ | GetProgramPipelineiv | 0 | Name of current compute shader project object | 2.11.4 |
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
Add to Table 6.32, "Program Object State"
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
| Get Value | Type | Get Command | Initial Value | Description | Sec. |
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
| COMPUTE_WORK_GROUP_SIZE | 3 x Z+ | GetProgramiv | { 0, ... } | Workgroup size of a linked compute program | 5.5 |
| UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER | B | GetActiveUniformBlockiv | FALSE | True if uniform block is referenced by the compute stage | 2.17.7 |
| ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER | B | GetActiveAtomicCounter- | FALSE | AACB has a counter used by compute shaders | 2.17.7 |
| | | Bufferiv | FALSE | | |
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
Insert new table named "Compute Dispatch State", after Table 6.46 "Hints":
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
| Get Value | Type | Get Command | Initial Value | Description | Sec. |
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
| DISPATCH_INDIRECT_BUFFER_BINDING | Z+ | GetIntegerv | 0 | Indirect dispatch buffer binding | 5.5 |
+----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
Insert Table 6.50, "Implementation Dependent Compute Shader Limits",
renumber subsequent tables.
+-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+
| Get Value | Type | Get Command | Minimum Value | Description | Sec. |
+-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+
| MAX_COMPUTE_WORK_GROUP_COUNT | 3 x Z+ | GetIntegeri_v | 65535 | Maximum number of workgroups that may be dispatched by a single | 5.5 |
| | | | | dispatch command (per dimension) | |
| MAX_COMPUTE_WORK_GROUP_SIZE | 3 x Z+ | GetIntegeri_v | 1024 (x, y), 64 (z) | Maximum local size of a compute workgroup (per dimension) | 5.5 |
| MAX_COMPUTE_WORK_GROUP_INVOCATIONS | Z+ | GetIntegerv | 1024 | Maximum total compute shader invocations in a single workgroup | 5.5 |
| MAX_COMPUTE_UNIFORM_BLOCKS | Z+ | GetIntegerv | 12 | Maximum number of uniform blocks per compute program | 2.11.7 |
| MAX_COMPUTE_TEXTURE_IMAGE_UNITS | Z+ | GetIntegerv | 16 | Maximum number of texture image units accessible by a compute shader | 2.11.12 |
| MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS | Z+ | GetIntegerv | 8 | Number of atomic counter buffers accessed by a compute shader | 2.11.17 |
| MAX_COMPUTE_ATOMIC_COUNTERS | Z+ | GetIntegerv | 8 | Number of atomic counters accessed by a compute shader | 2.11.12 |
| MAX_COMPUTE_SHARED_MEMORY_SIZE | Z+ | GetIntegerv | 32768 | Maximum total storage size of all variables declared as <shared> in | |
| | | | | all compute shaders linked into a single program object | |
| MAX_COMPUTE_UNIFORM_COMPONENTS | Z+ | GetIntegerv | 512 | Number of components for compute shader uniform variables | 5.5.1 |
| MAX_COMPUTE_IMAGE_UNIFORMS | Z+ | GetIntegerv | 8 | Number of image variables in compute shaders | 2.11.12 |
| MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS | Z+ | GetIntegerv | * | Number of words for compute shader uniform variables in all uniform | 5.5.1 |
| | | | | blocks, including the default | |
+-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+
Modify Table 6.55, increasing the following minimum values:
MAX_COMBINED_TEXTURE_IMAGE_UNITS 96 (6*16), was 80
MAX_UNIFORM_BUFFER_BINDINGS 72 (6*12), was 60
Issues
1) Should <shared> variables be usable only in compute shaders, or in other
stages too?
RESOLVED: Support only in compute shaders. While some hardware may be
able to support shared variables in shader stages other than compute,
it is difficult to clearly define what the semantics are as far as
sharing. For example, what is the equivalent for a workgroup for
vertex shaders?
2) Can we expose atomics on <shared> variables?
RESOLVED: Yes. The existing atomics in OpenGL 4.2 (via image
variables) don't map well to the <shared> declaration. Instead, we've
defined new atomic functions that take a variable as a first input.
These functions are specified in the ARB_shader_storage_buffer_object
extension and are incorporated into this extension via the interaction
described above. We could have also chosen to define operators +=, &=,
etc. to be atomic when applied to <shared> variables, but shaders may
want to use such variables in cases where atomic access (and the
related overhead) is not required.
3) Should the local size and dimensions of the workgroup be specified at
compile time? What are the default local dimensions?
RESOLVED: Dimension is always 3 and a workgroup size declaration is
compulsory at compile time. There is no default. The value used is
queriable. To use a 1- or 2-dimensional workgroup, the extra
dimension(s) can be set to 1.
4) Do we need the local_work_size parameter in dispatch if the local size
may be specified at compile time in the shader?
RESOLVED: The specification of the workgroup size is now mandatory in
the shader source at compile time and the local_work_size may no longer
be specified at dispatch time.
5) How do multiple shaders attached to a single program object work?
RESOLVED: Just as with any other shader stage. Exactly one of the
shaders must provide the 'main' entry point. All shaders attached to a
program object effectively get compiled into a single, large program at
link time. The program is dispatched as one big entity. Ãœber shader
type functionality can be achieved through the use of subroutine
uniforms, which also work exactly as for other shader stages.
6) Should compute dispatch honor conditional rendering?
RESOLVED: Yes, it does honor conditional rendering.
7) Is it possible to pass compute programs to UseProgram, etc.?
RESOLVED: Yes, compute programs can be made current via UseProgram and
can be made current in a program pipeline object via UseProgramStages.
Note that a compute program must be linked with PROGRAM_SEPARABLE set
to TRUE to be passed to UseProgramStages, even though the compute
pipeline has only a single shader stage.
The active compute program that will be used by DispatchCompute will be
determined in the same manner as the active program for any other
program stage:
* If there is a current program specified via UseProgram, that
program is considered current for all stages, including compute.
* Otherwise, if there is a current program pipeline object, the
program current for the compute stage of the pipeline object is
considered current for the compute stage.
* If neither of the former apply, no program is current for the
compute stage.
The program that is current for the compute stage is considered to be
active if and only if it has a compute shader executable. For example,
if a non-compute program is made current via UseProgram, it will also
be considered "current" for the compute stage, but won't be considered
active.
When using program pipeline objects, it's possible to switch between
graphics and compute work without switching programs. For example, in:
glBindProgramPipeline(pipeline);
glUseProgramStages(pipeline, GL_VERTEX_SHADER_BIT, programA);
glUseProgramStages(pipeline, GL_FRAGMENT_SHADER_BIT, programB);
glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC);
glDrawArrays(GL_TRIANGLES, 0, 900);
glDispatchCompute(5, 5, 5);
the triangles will be processed by programA and programB, while the
compute dispatch will be processed by programC. Similarly,
glUseProgramStages(pipeline, ~GL_COMPUTE_SHADER_BIT, programAB);
glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC);
glDrawArrays(GL_TRIANGLES, 0, 900);
glDispatchCompute(5, 5, 5);
will have the triangles processed by the multi-stage programAB.
8) What happens if you try to draw with no active compute program?
RESOLVED: An INVALID_OPERATION error is generated if there is no
active program for the compute shader stage.
9) Should we increase minimums on certain replicated state bindings
(texture image units, uniform buffer bindings) to reflect the addition
of a sixth shader stage?
RESOLVED: Yes, for MAX_COMBINED_TEXTURE_IMAGE_UNITS and
MAX_UNIFORM_BUFFER_BINDINGS. These limits permit applications to
statically partition the shared set of texture bindings into six
separate sets, one per shader stage.
The limit MAX_COMBINED_UNIFORM_BLOCKS is not increased, because it
reflects the sum of the number of uniform blocks used in each stage of
a single program. Since no single program can have more than five
stages, these limits don't need to be increased.
10) How do the shader built-in variables relate to DirectCompute's
built-in system values (SV_*)?
OpenGL Compute DirectCompute
--------------------------------------------------
gl_NumWorkGroups --
gl_WorkGroupSize --
gl_WorkGroupID SV_GroupID
gl_LocalInvocationID SV_GroupThreadID
gl_GlobalInvocationID SV_DispatchThreadID
gl_LocalInvocationIndex SV_GroupIndex
11) How does "program validation" (checking the active programs against
the current state) apply to DispatchCompute?
RESOLVED: The same program validation logic will be applied to both
graphics primitives (e.g., DrawArrays) and compute dispatches.
Conditions that will cause validation errors for graphics primitives
will also cause validation errors for compute dispatch, even if the
conditions wouldn't otherwise affect compute, for example:
* Mis-configured program pipeline objects (e.g., inserting a geometry
program A between the linked vertex and fragment shaders of of
program B).
* A graphics program has a vertex shader that uses a 2D texture from
texture image unit 0 and a fragment shader that uses a 3D texture
from texture image unit 0.
Similarly, validation errors specific to the compute shader executable
(e.g., using different targets on a single texture image unit in a
compute program) will generate validation errors for graphics Draw*
calls.
We chose to specify this behavior for several reasons. First, using the
same logic in both places ensures a single result for ValidateProgram
and ValidateProgramPipeline (a single VALIDATE_STATUS value wouldn't be
good enough if the result could be different for compute and graphics).
Additionally, a single test allows implementations to set up state and
perform validation tests for compute and graphics operations at the same
time, without requiring additional irregular graphics- or
compute-specific logic.
12) We specify an INVALID_OPERATION error for DispatchCompute when there
is no active program on the compute stage. Should we specify similar
errors for Draw* calls if the current program specified by UseProgram
is a compute program?
RESOLVED: Not in the current spec. If a compute shader is made
current with UseProgram, there will be no active program for either the
vertex and fragment stages. In this case, the results of vertex and
fragment processing are undefined, but no error is generated. This
behavior is already specified in unextended OpenGL 4.2.
We don't generate errors in this case for several reasons:
* For the compatibility profile, fixed-function vertex and fragment
processing is available, and INVALID_OPERATION wouldn't make sense
there.
* Even in the core profile, there are cases where no active fragment
shader is needed (e.g., primitives with RASTERIZER_DISCARD enabled).
While there is no case where having only a compute program makes sense,
at least in the core profile, we chose to keep the same undefined
behavior that's already in place.
13) Should we provide any additional support extending the memoryBarrier()
GLSL built-in function provided by ARB_shader_image_load_store and
GLSL 4.20?
RESOLVED: Yes. The memoryBarrier() function provided by GLSL 4.20
requires (a) synchronizing all memory transactions that might be visible
to other shader invocations and (b) ordering memory transactions so that
all other shader invocations never see stores issued after the barrier
before seeing stores issued before the barrier. Hardware
implementations of GLSL 4.20 may have a high degree of parallelism,
where the memory subsystem servicing shader loads and stores may have
multiple independent sub-units, and where the shader invocations
themselves may be executed in parallel on many shader cores. The
memoryBarrier() command may be fairly heavyweight, requiring
synchronization with all memory sub-units and shader cores.
We provide new functions in two different directions that might serve as
lighter weight alternatives to memoryBarrier(). In particular, we
provide four new functions
void memoryBarrierAtomicCounter();
void memoryBarrierBuffer();
void memoryBarrierImage();
void memoryBarrierShared();
that order transactions of only a specific memory type and might require
synchronization with fewer sub-units of the memory subsystem and a new
function:
void groupMemoryBarrier();
that only order transactions as viewed by other threads in the same
workgroup, which might not require synchronization with other shader cores.
Since shared memory is only accessible to threads within a single
workgroup, memoryBarrierShared() also only requires synchronization with
other threads in the same workgroup.
Revision History
Rev. Date Author Changes
---- -------- --------- -----------------------------------------
28 12/10/18 Jon Leech Use 'workgroup' consistently throughout (Bug
11723, internal API issue 87).
27 07/24/14 Jon Leech Change value of GLSL limit
gl_MaxComputeUniformComponents to 512 for
consistency with the API (Bug 12370).
26 01/30/14 Jon Leech Add table 6.31 COMPUTE_SHADER entry for
program pipeline objects (Bug 11539).
25 10/23/12 pbrown Remove the restriction forbidding the use of
barrier() inside potentially divergent flow
control. Instead, we will allow barrier() to
be executed anywhere, but specify undefined
results (including hangs or program termination)
if the flow control is divergent (bug 9367).
24 07/01/12 Jon Leech Fix typo (bug 8984).
23 06/28/12 johnk Remove two other references to "thread", add
"Only available in compute shaders" to the table
for memoryBarrierShared() and groupMemoryBarrier(),
fixed a typo.
22 06/22/12 pbrown Add a new built-in memoryBarrierBuffer() as an
interaction with ARB_shader_storage_buffer. Add
a new built-in groupMemoryBarrier() that orders
memory transactions only as observed by other
shader invocations in the same work group.
Enhance the description of the GLSL memory
barrier functions. Add issue 13 about the new
memory barrier functions added in this extension
(bug 9199). Mark issues 11 and 12 as resolved.
Add NV_vertex_buffer_unified_memory interaction
allowing DispatchComputeIndirect to read its
arguments from any resident buffer object
instead of the single bound indirect dispatch
buffer.
21 06/21/12 gsellers Clarify that there are no built-in inputs or
outputs in compute shaders (bug 9200).
20 06/21/12 gsellers Throw INVALID_OPERATION if querying
COMPUTE_WORK_GROUP_SIZE from unlinked program or
program with no compute shader (bug 9117).
19 06/18/12 pbrown DispatchComputeIndirect throws INVALID_VALUE
if <indirect> is negative or misaligned (bug
9181).
18 06/17/12 pbrown Clarify that compute-only programs can be used
by both UseProgram and UseProgramStages, and add
a COMPUTE_SHADER_BIT for UseProgramStages (bug
9155). Specify that validation errors checking
programs against each other and the GL state
apply equally to graphics primitives (Draw*) and
compute dispatches. Update issue 7; add new
issues 11 and 12. Clarify that compute shader
invocations in a workgroup are run "potentially
in parallel", but not "in lockstep" (bug 9151).
Other minor wording improvements.
17 06/15/12 johnk Don't allow location layout qualifiers for
compute shader inputs.
16 06/15/12 johnk In the intro material, allow work groups to
only potentially execute in parallel, and use
control barriers to synchronize. Other minor
fixes.
15 06/15/12 dgkoch Added Additions to Ch.2 of Shading Language.
Renamed shader built-in variables, explained
them better, made them uvec3 instead of int[3].
Added derived shading language variables.
Renamed and changed built-in constants for
consistency with the variables. Removed
gl_MaxComputeWorkDimensions since it is no
longer necessary. Renamed API constants to
be consistent with shading language terminology.
Remove a few rogue references to variable
number of dispatch arguments. Added Issue 10.
(bugs 9151, 9167)
14 06/14/12 pbrown Modify DispatchComputeIndirect to accept an
"intptr"-typed offset instead of a "void *",
since doesn't accept pointers to client memory.
Modify DispatchComputeIndirect to use a new
buffer binding (DISPATCH_INDIRECT_BUFFER)
instead of sharing the binding used by
Draw*Indirect. Add missing entries in the "New
Tokens" section and assign values. Update
documentation of COMMAND_BARRIER_BIT to reflect
the new dispatch indirect binding. Document
DispatchComputeIndirect errors for offsets that
are negative, misaligned, or run off the end of
the bound buffer. Increase minimums for
combined texture image units and uniform buffer
bindings to reflect the new stage. Update
various issues, add new issue 9 (bug 9130).
13 06/14/12 Jon Leech Copy description of MAX_COMPUTE_SHARED_MEMORY_SIZE
into API spec from GLSL spec (bug 9069).
12 05/14/12 pbrown Add interaction with ARB_shader_storage_buffer_
object. The built-in functions provided there
for atomic memory operations on buffer variables
are also supported for the shared variables
provided here. The functions themselves are
documented fully in the other specification.
11 05/14/12 johnk Keep the previous logical contents of the last
paragraph of the memory shader control functions.
10 04/26/12 gsellers Count max compute shared variable size in bytes.
Make shared variables implicitly coherent.
Add MAX_COMPUTE_UNIFORM_COMPONENTS.
Clean up MAX_COMPUTE_IMAGE_UNIFORMS.
9 04/25/12 gsellers Add UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER
and ATOMIC_COUNTER_BUFFER_REFERENCED_BY_-
COMPUTE_SHADER. Remove <program> from dispatch
APIs. Add memoryBarrier{Image,Shared,
AtomicCounter}().
8 04/05/12 gsellers Remove ARB suffixes.
7 02/02/12 gsellers Require OpenGL 4.2.
Add issue 8.
Up various minimums.
Remove variable dimensionality.
6 01/24/12 gsellers Require OpenGL 3.0.
Incorporate feedback from bmerry.
Add compute shader constants to sec. 7.7.
Add modifications to sec. 8.15 of the GLSL spec.
Add issue 7.
5 01/20/12 gsellers Make compute dispatch honor conditional
rendering. Add indirect dispatch.
Change 'global work size' to 'num work groups',
make global size in multiples of work group size.
4 01/10/12 gsellers Fix typos and other small corrections.
Make specification of work group size at compile
time compulsory.
Add COMPUTE_WORK_DIMENSION_ARB and
COMPUTE_LOCAL_WORK_SIZE_ARB queries.
Add issue (5), resolve issues (3) and (4).
3 01/09/12 gsellers Change from AMD to ARB.
Update to be relative to OpenGL 4.2 (+GLSL 4.20).
Add <shared> variables.
Add issues (1) - (4).
Add link failure for programs that contain
compute and non-compute shaders.
2 06/10/11 gsellers Add error behavior.
Shading language changes.
Add global_offset parameter.
Add implementation dependent limits.
1 09/24/10 gsellers Initial revision