extensions/ARB/ARB_compute_shader.txt - external/github.com/KhronosGroup/OpenGL-Registry - Git at Google

 Name

     ARB_compute_shader

 Name Strings

     GL_ARB_compute_shader

 Contact

     Graham Sellers, AMD (graham.sellers 'at' amd.com)

 Contributors

     Pat Brown, NVIDIA
     Daniel Koch, TransGaming
     John Kessenich
     Members of the ARB working group

 Notice

     Copyright (c) 2012-2014 The Khronos Group Inc. Copyright terms at
         http://www.khronos.org/registry/speccopyright.html

 Status

     Complete.
     Approved by the ARB on 2012/06/12.

 Version

     Last Modified Date: July 24, 2014
     Revision: 27

 Number

     ARB Extension #122

 Dependencies

     OpenGL 4.2 is required.

     This extension is written based on the wording of the OpenGL 4.2 (Core
     Profile) specification, and on the wording of the OpenGL Shading Language
     (GLSL) Specification, version 4.20.

     This extension interacts with OpenGL 4.3 and
     ARB_shader_storage_buffer_object.

     This extension interacts with NV_vertex_buffer_unified_memory.

 Overview

     Recent graphics hardware has become extremely powerful and a strong desire
     to harness this power for work (both graphics and non-graphics) that does
     not fit the traditional graphics pipeline well has emerged. To address
     this, this extension adds a new single-stage program type known as a
     compute program. This program may contain one or more compute shaders
     which may be launched in a manner that is essentially stateless. This allows
     arbitrary workloads to be sent to the graphics hardware with minimal
     disturbance to the GL state machine.

     In most respects, a compute program is identical to a traditional OpenGL
     program object, with similar status, uniforms, and other such properties.
     It has access to many of the same resources as fragment and other shader
     types, such as textures, image variables, atomic counters, and so on.
     However, it has no predefined inputs nor any fixed-function outputs. It
     cannot be part of a pipeline and its visible side effects are through its
     actions on images and atomic counters.

     OpenCL is another solution for using graphics processors as generalized
     compute devices. This extension addresses a different need. For example,
     OpenCL is designed to be usable on a wide range of devices ranging from
     CPUs, GPUs, and DSPs through to FPGAs. While one could implement GL on these
     types of devices, the target here is clearly GPUs. Another difference is
     that OpenCL is more full featured and includes features such as multiple
     devices, asynchronous queues and strict IEEE semantics for floating point
     operations. This extension follows the semantics of OpenGL - implicitly
     synchronous, in-order operation with single-device, single queue
     logical architecture and somewhat more relaxed numerical precision
     requirements. Although not as feature rich, this extension offers several
     advantages for applications that can tolerate the omission of these
     features. Compute shaders are written in GLSL, for example and so code may
     be shared between compute and other shader types. Objects are created and
     owned by the same context as the rest of the GL, and therefore no
     interoperability API is required and objects may be freely used by both
     compute and graphics simultaneously without acquire-release semantics or
     object type translation.

 New Procedures and Functions

         void DispatchCompute(uint num_groups_x,
                              uint num_groups_y,
                              uint num_groups_z);

         void DispatchComputeIndirect(intptr indirect);

 New Tokens

     Accepted by the <type> parameter of CreateShader and returned in the
     <params> parameter by GetShaderiv:

         COMPUTE_SHADER                                  0x91B9

     Accepted by the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv,
     GetDoublev and GetInteger64v:

         MAX_COMPUTE_UNIFORM_BLOCKS                      0x91BB
         MAX_COMPUTE_TEXTURE_IMAGE_UNITS                 0x91BC
         MAX_COMPUTE_IMAGE_UNIFORMS                      0x91BD
         MAX_COMPUTE_SHARED_MEMORY_SIZE                  0x8262
         MAX_COMPUTE_UNIFORM_COMPONENTS                  0x8263
         MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS              0x8264
         MAX_COMPUTE_ATOMIC_COUNTERS                     0x8265
         MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS         0x8266
         MAX_COMPUTE_WORK_GROUP_INVOCATIONS              0x90EB

     Accepted by the <pname> parameter of GetIntegeri_v, GetBooleani_v,
     GetFloati_v, GetDoublei_v and GetInteger64i_v:

         MAX_COMPUTE_WORK_GROUP_COUNT                    0x91BE
         MAX_COMPUTE_WORK_GROUP_SIZE                     0x91BF

     Accepted by the <pname> parameter of GetProgramiv:

         COMPUTE_WORK_GROUP_SIZE                         0x8267

     Accepted by the <pname> parameter of GetActiveUniformBlockiv:

         UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER      0x90EC

     Accepted by the <pname> parameter of GetActiveAtomicCounterBufferiv:

         ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER  0x90ED

     Accepted by the <target> parameters of BindBuffer, BufferData,
     BufferSubData, MapBuffer, UnmapBuffer, GetBufferSubData, and
     GetBufferPointerv:

         DISPATCH_INDIRECT_BUFFER                        0x90EE

     Accepted by the <value> parameter of GetIntegerv, GetBooleanv,
     GetInteger64v, GetFloatv, and GetDoublev:

         DISPATCH_INDIRECT_BUFFER_BINDING                0x90EF

     Accepted by the <stages> parameter of UseProgramStages:

         COMPUTE_SHADER_BIT                              0x00000020

 Additions to Chapter 2 of the OpenGL 4.2 (Core Profile) Specification
 (OpenGL Operation)

     In section 2.9.1, "Creating and Binding Buffer Objects", add to table 2.8
     (p.43):

                                                                 Described
       Target name                 Purpose                     in sections(s)
       -----------------------     -------------------------  ---------------
       DISPATCH_INDIRECT_BUFFER    Indirect compute dispatch       5.5
                                   commands

     Add to the end of section 2.9.8, "Indirect Commands In Buffer Objects"
     (p. 53):

     Arguments to the DispatchComputeIndirect command are stored in buffer
     objects as a group of three unsigned integers.

     A buffer object is bound to DISPATCH_INDIRECT_BUFFER by calling BindBuffer
     with target set to DISPATCH_INDIRECT_BUFFER, and buffer set to the name of
     the buffer object. If no corresponding buffer object exists, one is
     initialized as defined in section 2.9.

     DispatchComputeIndirect sources its arguments from the buffer object whose
     name is bound to DISPATCH_INDIRECT_BUFFER, using the <indirect> parameter as
     an offset into the buffer object in the same fashion as described in
     section 2.9.6. An INVALID_OPERATION error is generated if this command
     sources data beyond the end of the buffer object, if zero is bound to
     DISPATCH_INDIRECT_BUFFER, or if <indirect> is less than zero or not a
     multiple of the size, in basic machine units, of uint.

     In section 2.11, "Vertex Shaders", modify the introductory text on shaders
     to include compute shaders (second paragraph, p. 56):

     In addition to vertex shaders, tessellation control..., geometry shaders,
     fragment shaders, and compute shders can be created, compiled, and linked
     into program objects.  ....  (section 3.10).  Compute shaders perform
     general computations for dispatched arrays of shader invocations (section
     5.5), but do not operate on primitives processed by the other shader
     types. ...

     In section 2.11.3, "Program Objects", add to the reasons that LinkProgram
     may fail, p. 61:

         * The program object contains objects to form a compute shader (see
           section 5.5) and objects to form any other type of shader.

     In section 2.11.3, modify the description of active programs (last
     paragraph, p. 61, first paragraph, p. 62):

     ... geometry shader stages, those stages are ignored.  If there is no
     active program for the compute shader stage, compute dispatches will
     generate an error.  The active program for the compute shader stage has no
     effect on the processing of vertices, geometric primitives, and fragments,
     and the active program for all other shader stages has no effect on
     compute dispatches.

     In section 2.11.4, "Program Pipeline Objects", modify the description of
     UseProgramStages, p. 65:

     The executables in a program object... becomes current.  These stages may
     include vertex, tessellation control, tessellation evaluation, geometry,
     fragment, or compute, indicated by VERTEX_SHADER_BIT,
     TESS_CONTROL_SHADER_BIT, TESS_EVALUATION_SHADER_BIT, GEOMETRY_SHADER_BIT,
     FRAGMENT_SHADER_BIT, or COMPUTE_SHADER_BIT, respectively. ...

     In the unnumbered "Validation" section of section 2.11.12 "Shader
     Execution", modify the list of validation errors, pp. 112-113:

     This error is generated by any command that transfers vertices to the GL
     or launches compute work if:

       * (last bullet, p. 112) One program object is active... first program
         object was active.  The active compute shader is ignored for the
         purposes of this test.

       * (2nd bullet, p. 113) There is no current program specified by
         UseProgram, there is a current program pipeline object, and the
         current program for any shader stage has been relinked since...

       * (3rd bullet, p. 113) Any two active samplers in the set of active
         program objects are of different types but refer to the same texture
         image unit.

       * (4th bullet, p. 113) The sum of the number of active samplers for each
         active program exceeds the maximum number of texture image units
         allowed.

     Modify the paragraph describing ValidateProgram, p. 113:

     ... If validation succeeded, ... set to FALSE.  If validation succeeded,
     no INVALID_OPERATION validation error will be generated if <program> were
     made current via UseProgram, given the current state.  If validation
     failed, such errors will be generated under the current state.

     Modify the paragraph describing ValidateProgramPipeline, p. 114:

     ... can be queried with GetProgramPipelineiv (see section 6.1.12).  If
     validation succeeded, no INVALID_OPERATION validation error will be
     generated if <pipeline> were bound and no program were made current via
     UseProgram, given the current state.  If validation failed, such errors
     will be generated under the current state.

     In subsection 2.11.12, "Shader Execution":

         Add to the list of implementation dependent constants under the
     "Texture Access" sub-heading:

         MAX_COMPUTE_TEXTURE_IMAGE_UNITS (for compute shaders),

         Add to the list of implementation dependent constants under the "Atomic
     Counter Access" sub-heading:

         MAX_COMPUTE_ATOMIC_COUNTERS (for compute shaders),

         Add to the list of implementation dependent constants under the "Image
     Access" sub-heading:

         MAX_COMPUTE_IMAGE_UNIFORMS (for compute shaders),

     In section 2.16, "Conditional Rendering", modify the sentence describing
     conditional rendering, starting with "In this case"...

     In this case, all drawing commands (see section 2.8.3), as well as
     Clear and ClearBuffer* (see section 4.2.3), and compute dispatch
     through DispacthCompute* (see section 5.5), have no effect.
     In the "Shared Memory Access Synchronization" subsection of section
     2.11.13, "Shader Memory Access", modify the description of
     COMMAND_BARRIER_BIT (p. 118):

       * COMMAND_BARRIER_BIT:  Command data sourced from buffer objects by
         Draw*Indirect and DispatchComputeIndirect commands ... The buffer
         objects affected by this bit are derived from the DRAW_INDIRECT_BUFFER
         and DISPATCH_INDIRECT_BUFFER bindings.

     In subection 2.17.7, "Uniform Variables", replace the paragraph beginning
     "If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,"... with:

         If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,
     UNIFORM_BLOCK_REFERENCED_BY_TESS_CONTROL_SHADER,
     UNIFORM_BLOCK_REFERENCED_BY_TESS_EVALUATION_SHADER,
     UNIFORM_BLOCK_REFERENCED_BY_GEOMETRY_SHADER,
     UNIFORM_BLOCK_REFERENCED_BY_FRAGMENT_SHADER or
     UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER, then a boolean value indicating
     whether the uniform block identified by uniformBlockIndex is referenced
     by the vertex, tessellation control, tessellation evaluation, geometry,
     fragment or compute programming stages of <program>, respectively, is
     returned.

     Also in subsection 2.17.7, "Uniform Variables", replace the paragraph
     beginning, "If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER"
     on p.80 with:

         If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER,
     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_CONTROL_SHADER,
     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_EVALUATION_SHADER,
     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_GEOMETRY_SHADER,
     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_FRAGMENT_SHADER or
     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER, then a single boolean
     value indicating whether the atomic counter buffer identified by
     bufferIndex is referenced by the vertex, tessellation control, tessellation
     evaluation, geometry, fragment or compute programming stages of
     <program>, respectively, is returned.

     Under the sub-heading "Uniform Blocks" in subsection 2.11.17, replace the
     sentence beginning "The limits for vertex, tessellation ..." on p.92
     with:

         The limits for vertex, tessellation, geometry, fragment and compute
     shaders can be obtained by calling GetIntegerv with <pname> set to
     MAX_VERTEX_UNIFORM_BLOCKS, MAX_TESS_CONTROL_UNIFORM_BLOCKS,
     MAX_TESS_EVALUATION_UNIFORM_BLOCKS, MAX_GEOMETRY_UNIFORM_BLOCKS,
     MAX_FRAGMENT_UNIFORM_BLOCKS and MAX_COMPUTE_UNIFORM_BLOCKS, respectively.

     Under the sub-heading "Atomic Counter Buffers" in subsection 2.11.17,
     replace the sentence beginning "The limits for vertex, geometry, ..."
     on p.96 with:

         The limits for vertex, tessellation, geometry, fragment and compute
     shaders can be obtained by calling GetIntegerv with <pname> set to
     MAX_VERTEX_ATOMIC_COUNTER_BUFFERS, MAX_TESS_CONTROL_ATOMIC_COUNTER_BUFFERS,
     MAX_TESS_EVALUATION_ATOMIC_COUNTER_BUFFERS,
     MAX_GEOMETRY_ATOMIC_COUNTER_BUFFERS, MAX_FRAGMENT_ATOMIC_COUNTER_BUFFERS and
     MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS, respectively.

 Additions to Chapter 3 of the OpenGL 4.2 (Core Profile) Specification
 (Rasterization)

     None.

 Additions to Chapter 4 of the OpenGL 4.2 (Core Profile) Specification
 (Per-Fragment Operations and the Framebuffer)

     None.

 Additions to Chapter 5 of the OpenGL 4.2 (Core Profile) Specification
 (Special Functions)

     Add Section 5.5, "Compute Shaders"

         In addition to graphics-oriented shading operations such as vertex,
     tessellation, geometry and fragment shading, generic computation may be
     performed by the GL through the use of compute shaders. The compute pipeline
     is a form of single-stage machine that runs generic shaders. Compute shaders
     are created as described in section 2.11.1 using a <type> parameter of
     COMPUTE_SHADER. They are attached to and used in program objects as
     described in section 2.11.3.

         Compute workloads are formed from groups of work items called work
     groups and processed by the executable code for a compute program. A work
     group is a collection of shader invocations that execute the same code,
     potentially in parallel. An invocation within a work group may share data
     with other members of the same work group through shared variables and
     issue memory and control barriers to synchronize with other members of the
     same work group.  One or more work groups is launched by calling:

         void DispatchCompute(uint num_groups_x,
                              uint num_groups_y,
                              uint num_groups_z);

         Each work group is processed by the active program object for the
     compute shader stage.  The error INVALID_OPERATION will be generated if
     there is no active program object for the compute shader stage.  The
     active program for the compute shader stage will be determined in the same
     manner as the active program for other pipeline stages, as described in
     section 2.11.3.  While the individual shader invocations within a work
     group are executed as a unit, work groups are executed completely
     independently and in unspecified order.

         <num_groups_x>, <num_groups_y> and <num_groups_z> specify the number of
     local work groups that will be dispatched in the X, Y and Z dimensions,
     respectively. The builtin vector variable gl_NumWorkGroups will be
     initialized with the contents of the <num_groups_x>, <num_groups_y> and
     <num_groups_z> parameters. The maximum number of work groups that may be
     dispatched at one time may be determined by calling GetIntegeri_v with
     <pname> set to MAX_COMPUTE_WORK_GROUP_COUNT and <index> must be zero, one,
     or two, representing the X, Y, and Z dimensions, respectively. The
     values in the <num_groups_x>, <num_groups_y> and <num_groups_z> array must
     be less than or equal to the maximum work group count for the corresponding
     dimension, otherwise an INVALID_VALUE error is generated. If the work group
     count in any dimension is zero, no work groups are dispatched.

         The local work size in each dimension are specified at compile time
     using an input layout qualifier in one or more of the compute shaders
     attached to the program (see Section 4 of the OpenGL Shading Language
     Specification). After the program has been linked, the local work group size
     of the program may be retrieved by calling GetProgramiv with <pname> set to
     COMPUTE_WORK_GROUP_SIZE. This will return an array of three integers
     containing the local work group size of the compute program as specified by
     its input layout qualifier(s). If <program> is the name of a program that
     has not been successfully linked, or is the name of a linked program object
     that contains no compute shaders, then an INVALID_OPERATION error is
     generated.

         The maximum size of a local work group may be determined by calling
     GetIntegeri_v with <pname> set to MAX_COMPUTE_WORK_GROUP_SIZE
     and <index> set to 0, 1, or 2 to retrieve the maximum work size in the
     X, Y and Z dimension, respectively. Furthermore, the maximum number of
     invocations in a single local work group (i.e., the product of the three
     dimensions) may be determined by calling GetIntegerv with <pname> set to
     MAX_COMPUTE_WORK_GROUP_INVOCATIONS.

         The command

         void DispatchComputeIndirect(intptr indirect);

     is equivalent (assuming no errors are generated) to calling
     DispatchCompute with <num_groups_x>, <num_groups_y> and <num_groups_z>
     initialized with the three uint values contained in the buffer currently
     bound to the DISPATCH_INDIRECT_BUFFER binding at an offset, in basic
     machine units, specified by <indirect>.  The error INVALID_VALUE is
     generated if <indirect> is less than zero or is not a multiple of four.
     The error INVALID_OPERATION is generated if no buffer is bound to
     DISPATCH_INDIRECT_BUFFER, if the command would source data beyond the end
     of the buffer object, or if there is no active program for the compute
     shader stage.  If any of <num_groups_x>, <num_groups_y> or <num_groups_z>
     is greater than MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding
     dimension then the results are undefined.

     Add Subsection 5.5.1, "Compute Shader Variables"

         Compute shaders can access variables belonging to the current program
     object. The amount of storage in the default uniform block accessed by a
     compute shader is specified by the value of the implementation dependent
     constant MAX_COMPUTE_UNIFORM_COMPONENTS. The total amount of
     combined storage available for uniform variables in all uniform blocks
     accessed by a compute shader (including the default unifom block) is
     specified by the implementation dependent constant
     MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS.

         There is a limit to the total size of all variables declared as
     <shared> in a single program object. This limit, expressed in units of
     basic machine units, may be queried as the value of
     MAX_COMPUTE_SHARED_MEMORY_SIZE.

 Additions to Chapter 6 of the OpenGL 4.2 (Core Profile) Specification
 (State and State Requests)

     None.

 Additions to Chapter 2 of the OpenGL Shading Language Specification, Version
 4.20 (Overview of OpenGL Shading)

     Replace the last sentence of the first paragraph of the overview with
     the following:

     "Currently, these processors are the vertex, tessellation control,
      tessellation evaluation, geometry, fragment, and compute processors."

     Replace the last sentence of the second paragraph of the overview with
     the following:

     "The specific languages will be referred to by the name of the processor
      they target: vertex, tessellation control, tessellation evaluation,
      geometry, fragment, or compute."

     Add a new Section 2.6 titled "Compute Processor" with the following text:

     "The <compute processor> is a programmable unit that operates independently
     from the other shader processors. Compilation units written in the OpenGL
     Shading Language to run on this processor are called <compute shaders>.
     When a complete set of compute shaders are compiled and linked, they
     result in a <compute shader executable> that runs on the compute processor.

     A compute shader has access to many of the same resources as fragment and
     other shader processors, such as textures, buffers, image variables,
     atomic counters, and so on. It does not have any predefined inputs
     nor any fixed-function outputs.  It is not part of the graphics pipeline
     and its visible side effects are through actions on images, storage
     buffers, and atomic counters.

     A compute shader operates on a group of work items called a work group.
     A work group is a collection of shader invocations that execute the same
     code, potentially in parallel. An invocation within a work group may share data with
     other members of the same work group through shared variables and issue
     memory and control barriers to synchronize with other members of the same work group."

 Additions to Chapter 4 of the OpenGL Shading Language Specification, Version
 4.20 (Variables and Types)

     Modify section 4.4.1, second paragraph from

     "All shaders allow input layout qualifiers on input variable declarations."

     to

     "All shaders, except compute shaders, allow input layout location qualifiers on
      input variable declarations."

     Modify Section 4.3. Add to the table at the start of Section 4.3:

     +-------------------+-----------------------------------------------------------+
     | Storage Qualifier | Meaning                                                   |
     +-------------------+-----------------------------------------------------------+
     | <shared>          | variable storage is shared across all work items in a     |
     |                   | local work group for compute shaders                      |
     +-------------------+-----------------------------------------------------------+

     Add the following paragraph to Section 4.3.4, "Input Variables"

         Compute shaders do not permit user-defined input variables and do not
     form a formal interface with any other shader stage. See section 7.1
     for a description of built-in compute shader input variables. All other
     input to a compute shader is retrieved explicitly through image loads,
     texture fetches, loads from uniforms or uniform buffers, or other user
     supplied code. Redeclaration of built-in input variables in compute
     shaders is not permitted.

     Add the following paragraph to Section 4.3.6, "Output Variables"

         Compute shaders have no built-in output variables, do not support
     user-defined output variables and do not form a formal interface with any
     other shader stage. All outputs from a compute shader take the form of the
     side effects such as image stores and operations on atomic counters.

     Add Section 4.3.7, "Shared", renumber subsequent sections

         The <shared> qualifier is used to declare variables that have storage
     shared between all work items of a compute shader local work
     group. Variables declared as <shared> may only be used in compute shaders
     (see Section 5.5, "Compute Shaders"). Shared variables are implicitly
     coherent. That is, writes to shared variables from one shader invocation
     will eventually be seen by other invocations within the same local work
     group.

         Variables declared as <shared> may not have initializers and their
     contents are undefined at the beginning of shader execution. Any data
     written to <shared> variables will be visible to other shaders executing
     the same shader within the same local work group. Order of execution
     with regards to reads and writes to the same <shared> variables by different
     invocations of a shader is not defined. In order to achieve ordering with
     respect to reads and writes to <shared> variables, memory barriers must be
     employed using the barrier() function (see Section 8.15).

         There is a limit to the total size of all variables declared as
     <shared> in a single program object. This limit, expressed in units of
     basic machine units may be determined by using the OpenGL API to query the
     value of MAX_COMPUTE_SHARED_MEMORY_SIZE.

     Add Section 4.4.1.4, "Compute-Shader Inputs"

     There are no layout location qualifiers for compute shader inputs.

     Layout qualifier identifiers for compute shader inputs are the work-group
     size qualifiers:

         layout-qualifier-id
             local_size_x = integer-constant
             local_size_y = integer-constant
             local_size_z = integer-constant

     <local_size_x>, <local_size_y>, and <local_size_z> are used to define the
     local size of the kernel defined by the compute shader in the first,
     second, and third dimension, respectively. The default size in each
     dimension is 1. If a shader does not specify a size for one of the
     dimensions, that dimension will have a size of 1.

     For example, the following declaration in a compute shader

         layout (local_size_x = 32, local_size_y = 32) in;

     is used to declare a two-dimensional compute shader with a local size of
     32 x 32 elements as a three-dimensional compute shader where the third dimension is
     one element deep.

     As another example, the declaration

         layout (local_size_x = 8) in;

     effectively specifies that a one-dimensional compute shader is being
     compiled, and its size is 8 elements.

         If the local size of the shader in any dimension is greater than the
     maximum size supported by the implementation for that dimension, a
     compile-time error results. Also, if such a layout qualifier is declared more
     than once in the same shader, all those declarations must indicate the same local
     work-group size; otherwise a compile-time error results. If multiple compute
     shaders attached to a single program object declare local work-group size,
     the declarations must be identical; otherwise a link-time error results.
     Furthermore, if a program object contains any compute shaders, at
     least one must contain an input layout qualifier specifying the local work
     sizes of the program, or a link-time error will occur.

 Additions to Chapter 7 of the OpenGL Shading Language Specification, Version
 4.20 (Built-in Variables)

     Add to the start of Section 7.1, "Built-In Language Variables", before the
     description of the vertex language built-in variables:

         In the compute language, the built-in variables are declared as follows:

         // work group dimensions
         in    uvec3 gl_NumWorkGroups;
         const uvec3 gl_WorkGroupSize;

         // work group and invocation IDs
         in    uvec3 gl_WorkGroupID;
         in    uvec3 gl_LocalInvocationID;

         // derived variables
         in    uvec3 gl_GlobalInvocationID;
         in    uint  gl_LocalInvocationIndex;

     Add the end of Section 7.1, before Section 7.1.1:

         The built-in variable <gl_NumWorkGroups> is a compute-shader input
     variable containing the total number of global work items in each
     dimension of the work group that will execute the compute shader.
     Its content is equal to the values specified in the <num_groups_x>,
     <num_groups_y>, and <num_groups_z> parameters passed to the
     DispatchCompute API entry point.

         The built-in constant <gl_WorkGroupSize> is a compute-shader constant
     containing the local work-group size of the shader. The size of the work
     group in the X, Y, and Z dimensions is stored in the x, y, and z components.
     The values stored in <gl_WorkGroupSize> match those specified in the
     required <local_size_x>, <local_size_y>, and <local_size_z> layout
     qualifiers for the current shader. This value is constant so that
     it can be used to size arrays of memory that can be shared within
     the local work group.

         The built-in variable <gl_WorkGroupID> is a compute-shader input
     variable containing the 3-dimensional index of the global work group
     that the current invocation is executing in. The possible values range
     across the parameters passed into DispatchCompute, i.e., from (0, 0, 0) to
     (gl_NumWorkGroups.x - 1, gl_NumWorkGroups.y - 1, gl_NumWorkGroups.z - 1).

         The built-in variable <gl_LocalInvocationID> is a compute-shader input
     variable containing the 3-dimensional index of the local work group
     within the global work group that the current invocation is executing in.
     The possible values for this variable range across the local work group
     size, i.e. (0,0,0) to (gl_WorkGroupSize.x - 1, gl_WorkGroupSize.y - 1,
     gl_WorkGroupSize.z - 1).

         The built-in variable <gl_GlobalInvocationID> is a compute shader input
     variable containing the global index of the current work item.  This
     value uniquely identifies this invocation from all other invocations
     across all local and global work groups initiated by the current
     DispatchCompute call.  This is computed as:

         gl_GlobalInvocationID =
             gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID.

         The built-in variable <gl_LocalInvocationIndex> is a compute shader
     input variable that contains the 1-dimensional representation of the
     gl_LocalInvocationID. This is useful for uniquely identifying a
     unique region of shared memory within the local work group for this
     invocation to use. This is computed as:
         gl_LocalInvocationIndex =
             gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +
             gl_LocalInvocationID.y * gl_WorkGroupSize.x +
             gl_LocalInvocationID.x;

     Add to the list of built-in constants in Section 7.3:

         const ivec3 gl_MaxComputeWorkGroupCount = { 65535, 65535, 65535 };
         const ivec3 gl_MaxComputeWorkGroupSize = { 1024, 1024, 64 };
         const int gl_MaxComputeUniformComponents = 512;
         const int gl_MaxComputeTextureImageUnits = 16;
         const int gl_MaxComputeImageUniforms = 8;
         const int gl_MaxComputeAtomicCounters = 8;
         const int gl_MaxComputeAtomicCounterBuffers = 1;

 Additions to Chapter 8 of the OpenGL Shading Language Specification, Version
 4.20 (Built-in Variables)

     Insert "Atomic Memory Functions" section after Section 8.10, Atomic
     Counter Functions (p. 149).  Atomic memory operations are supported on
     shared variables; the set of operations and their definitions are similar
     to those for the imageAtomic*() functions.  These functions are fully
     documented in the ARB_shader_storage_buffer_object extension (see
     dependencies).

     Modify the first paragraph of Section 8.15, "Shader Invocation Control
     Functions" to read:

         The shader invocation control function is only available in tessellation
     control shaders and compute shaders. It is used to control the relative
     execution order of multiple shader invocations used to process a patch
     (in the case of tessellation control shaders) or a local work group (in the
     case of compute shaders), which are otherwise executed with an undefined
     order.

     +----------------+--------------------------------------------------------------------------+
     | Syntax         | Description                                                              |
     +----------------+--------------------------------------------------------------------------+
     | barrier        | For any given static instance of barrier() appearing in a tessellation   |
     |                | control shader or compute shader, all invocations for a single patch     |
     |                | or work group, respectively, must enter it before any will continue      |
     |                | beyond it.                                                               |
     +----------------+--------------------------------------------------------------------------+

     Modify the second paragraph as follows:

     ... Because invocations may execute in an undefined order between these
     barrier calls, the values of a per-vertex or per-patch output variable in
     a tessellation control shader or shared variables for compute shaders
     will be undefined in a number of cases enumerated in Section 4.3.7 "Output
     Variables" (for tessellation control shaders) and Section 4.3.6 "Shared
     Variables" (for compute shaders).

     Replace the third paragraph with the following:

     For tessellation control shaders, the barrier() function may only be
     placed inside the function main() of the tessellation control shader and
     may not be called within any control flow. Barriers are also disallowed
     after a return statement in the function main(). Any such misplaced
     barriers result in a compile-time error.

     For compute shaders, the barrier() function may be placed within flow
     control, but that flow control must be uniform flow control. That is, all
     the controlling expressions that lead to execution of the barrier must be
     dynamically uniform expressions. This ensures that if any shader
     invocation enters a conditional statement, then all invocations will enter
     it. While compilers are encouraged to give warnings if they can detect
     this might not happen, compilers cannot completely determine this. Hence,
     it is the author's responsibility to ensure barrier() only exists inside
     uniform flow control. Otherwise, some shader invocations will stall
     indefinitely, waiting for a barrier that is never reached by other
     invocations.

     Modify the table of memory control functions on p.160,

     +-----------------------------------+----------------------------------------------------------------------------------------+
     | Syntax                            | Description                                                                            |
     +-----------------------------------+----------------------------------------------------------------------------------------+
     | void memoryBarrier()              | Control the ordering of all memory transactions issued by a single shader invocation.  |
     +-----------------------------------+----------------------------------------------------------------------------------------+
     | void memoryBarrierAtomicCounter() | Control the ordering of accesses to atomic counter variables issued by a single shader |
     |                                   | invocation.                                                                            |
     +-----------------------------------+----------------------------------------------------------------------------------------+
     | void memoryBarrierBuffer()        | Control the ordering of memory transactions to buffer variables issued within a        |
     |                                   | single shader invocation.                                                              |
     +-----------------------------------+----------------------------------------------------------------------------------------+
     | void memoryBarrierImage()         | Control the ordering of memory transactions to images issued within a single shader    |
     |                                   | invocation.                                                                            |
     +-----------------------------------+----------------------------------------------------------------------------------------+
     | void memoryBarrierShared()        | Control the ordering of memory transactions to shared variables issued within a single |
     |                                   | shader invocation.                                                                     |
     |                                   | Only available in compute shaders.                                                     |
     +-----------------------------------+----------------------------------------------------------------------------------------+
     | void groupMemoryBarrier()         | Control the ordering of all memory transactions issued within a single shader          |
     |                                   | invocation, as viewed by other invocations in the same work group.                     |
     |                                   | Only available in compute shaders.                                                     |
     +-----------------------------------+----------------------------------------------------------------------------------------+

     Modify the subsequent paragraph as follows:

     The memory barrier built-in functions can be used to order reads and
     writes to variables stored in memory accessible to other shader
     invocations.  When called, these functions will wait for the completion of
     all reads and writes previously performed by the caller that access
     selected variable types, and then return with no other effect.  The
     built-in functions memoryBarrierAtomicCounter(), memoryBarrierBuffer(),
     memoryBarrierImage(), and memoryBarrierShared() wait for the completion of
     accesses to atomic counter, buffer, image, and shared variables,
     respectively.  The built-in functions memoryBarrier() and
     groupMemoryBarrier() wait for the completion of accesses to all of the
     above variable types.  The functions memoryBarrierShared() and
     groupMemoryBarrier() are available only in compute shaders; the other
     functions are available in all shader types.

     When these functions return, any memory stores performed using coherent
     variables prior to the call will be visible to any future coherent access
     to the same memory performed by any other shader invocation.  In
     particular, the values written this way in one shader stage are guaranteed
     to be visible to coherent memory accesses performed by shader invocations
     in subsequent stages when those invocations were triggered by the
     execution of the original shader invocation (e.g., fragment shader
     invocations for a primitive resulting from a particular geometry shader
     invocation).

     Additionally, memory barrier functions order stores performed by the
     calling invocation, as observed by other shader invocations.  Without
     memory barriers, if one shader invocation performs two stores to coherent
     variables, a second shader invocation might see the values written by the
     second store prior to seeing those written by the first.  However, if the
     first shader invocation calls a memory barrier function between the two
     stores, selected other shader invocations will never see the results of
     the second store before seeing those of the first.  When using the
     function groupMemoryBarrier(), this ordering guarantee applies only to
     other shader invocations in the same compute shader work group; all other
     memory barrier functions provide the guarantee to all other shader
     invocations.  No memory barrier is required to guarantee the order of
     memory stores as observed by the invocation performing the stores; an
     invocation reading from a variable that it previously wrote will always
     see the most recently written value unless another shader invocation also
     wrote to the same memory.

 Dependencies on OpenGL 4.3 and ARB_shader_storage_buffer_object

     If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, the
     spec language adding the built-in functions atomicAdd(), atomicMin(),
     atomicMax(), atomicAnd(), atomicOr(), atomicXor(), atomicExchange(), and
     atomicCompSwap() should be considered to be incorporated into this
     extension as-is, except that buffer variables will not be supported and
     thus cannot be used with these functions.  No "#extension" directive is
     necessary to use these functions in compute shaders.

     If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported,
     references to the GLSL built-in function memoryBarrierBuffer() should be
     removed.

 Dependencies on NV_vertex_buffer_unified_memory

     If NV_vertex_buffer_unified_memory is supported, a new buffer address
     range and enable is provided to permit the use with
     DispatchComputeIndirect with a resident buffer object without requiring
     that it be bound to the DISPATCH_INDIRECT_BUFFER target.  The following
     additional edits apply:

     Accepted by the <cap> parameter of GetBufferParameterui64vNV:

         DISPATCH_INDIRECT_BUFFER                        (defined above)

     Accepted by the <cap> parameter of Disable, Enable, and IsEnabled, and by
     the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv, GetDoublev
     and GetInteger64v:

         DISPATCH_INDIRECT_UNIFIED_NV                    0x90FD

     Accepted by the <pname> parameter of BufferAddressRangeNV
     and the <value> parameter of GetIntegerui64vNV:

         DISPATCH_INDIRECT_ADDRESS_NV                    0x90FE

     Accepted by the <value> parameter of GetIntegerv:

         DISPATCH_INDIRECT_LENGTH_NV                     0x90FF

     Add to the end of Section 5.5, after discussion of
     DispatchComputeIndirect:

     If DISPATCH_INDIRECT_UNIFIED_NV is enabled, DispatchComputeIndirect does
     not use the buffer bound to DISPATCH_INDIRECT_BUFFER.  Instead, it sources
     its arguments from the GPU address range specified by calling
     BufferAddressRangeNV with a <pname> of DISPATCH_INDIRECT_ADDRESS_NV and an
     <index> of zero.  The address is obtained by adding the <indirect>
     parameter to the base address of the range, specified by the <address>
     parameter of BufferAddressRangeNV.  If the command sources data outside
     the specified address range, the error INVALID_OPERATION will be
     generated.  The DISPATCH_INDIRECT_BUFFER binding will be ignored in this
     case, and no errors will be generated due to the use of this binding.  The
     error INVALID_VALUE will still be generated if <indirect> is negative.  No
     INVALID_VALUE error will be generated if <indirect> is not a multiple of
     four, but INVALID_OPERATION will be generated if the effective address is
     not a multiple of four.  If the indirect dispatch address range does not
     belong to a buffer object that is resident at the time of the
     DispatchComputeIndirect call, undefined results, possibly including
     program termination, may occur.

     Add the following to the "Compute Dispatch State" table defined in this
     extension:

     Get Value                           Type    Get Command         Initial Value   Sec     Attribute
     ---------                           ----    -----------         -------------   ---     ---------
     DISPATCH_INDIRECT_UNIFIED_NV         B      IsEnabled               FALSE       5.5     none
     DISPATCH_INDIRECT_ADDRESS_NV        Z64+    GetIntegerui64vNV         0         5.5     none
     DISPATCH_INDIRECT_LENGTH_NV          Z+     GetIntegerv               0         5.5     none

 Errors

     INVALID_OPERATION is generated by DispatchCompute or
     DispatchComputeIndirect if there is no active program for the compute
     shader stage.

     INVALID_VALUE is generated by DispatchCompute if any of <num_groups_x>,
     <num_groups_y> or <num_groups_z> is greater than the value of
     MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding dimension.

     INVALID_VALUE is generated by DispatchComputeIndirect if <indirect> is
     less than zero or not a multiple of four.

     INVALID_OPERATION is generated by DispatchComputeIndirect if no buffer is
     bound to DISPATCH_INDIRECT_BUFFER or if the command would source data
     beyond the end of the bound buffer object.

     INVALID_OPERATION is generated by GetProgramiv is <pname> is
     COMPUTE_WORK_GROUP_SIZE and either the program has not been linked
     successfully, or has been linked but contains no compute shaders.

     LinkProgram will fail if <program> contains a combination of compute and
     non-compute shaders.

 New State

     None.

 New Implementation Dependent State

     Add to Table 6.31, "Program Pipeline Object State"

     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
     | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    |
     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
     | COMPUTE_SHADER                                     | Z+        | GetProgramPipelineiv    | 0             | Name of current compute shader project object                         | 2.11.4  |
     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+

     Add to Table 6.32, "Program Object State"

     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
     | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    |
     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
     | COMPUTE_WORK_GROUP_SIZE                            | 3 x Z+    | GetProgramiv            | { 0, ... }    | Local work size of a linked compute program                           | 5.5     |
     | UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER         | B         | GetActiveUniformBlockiv | FALSE         | True if uniform block is referenced by the compute stage              | 2.17.7  |
     | ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER | B         | GetActiveAtomicCounter- | FALSE         | AACB has a counter used by compute shaders                            | 2.17.7  |
     |                                                    |           |   Bufferiv              | FALSE         |                                                                       |         |
     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+

     Insert new table named "Compute Dispatch State", after Table 6.46 "Hints":

     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
     | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    |
     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
     | DISPATCH_INDIRECT_BUFFER_BINDING                   | Z+        | GetIntegerv             | 0             | Indirect dispatch buffer binding                                      | 5.5     |
     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+

     Insert Table 6.50, "Implementation Dependent Compute Shader Limits",
     renumber subsequent tables.

     +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+
     | Get Value                               | Type      | Get Command   | Minimum Value       | Description                                                           | Sec.    |
     +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+
     | MAX_COMPUTE_WORK_GROUP_COUNT            | 3 x Z+    | GetIntegeri_v | 65535               | Maximum number of work groups that may be dispatched by a single      | 5.5     |
     |                                         |           |               |                     | dispatch command (per dimension)                                      |         |
     | MAX_COMPUTE_WORK_GROUP_SIZE             | 3 x Z+    | GetIntegeri_v | 1024 (x, y), 64 (z) | Maximum local size of a compute work group (per dimension)            | 5.5     |
     | MAX_COMPUTE_WORK_GROUP_INVOCATIONS      | Z+        | GetIntegerv   | 1024                | Maximum total compute shader invocations in a single local work group | 5.5     |
     | MAX_COMPUTE_UNIFORM_BLOCKS              | Z+        | GetIntegerv   | 12                  | Maximum number of uniform blocks per compute program                  | 2.11.7  |
     | MAX_COMPUTE_TEXTURE_IMAGE_UNITS         | Z+        | GetIntegerv   | 16                  | Maximum number of texture image units accessible by a compute shader  | 2.11.12 |
     | MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS      | Z+        | GetIntegerv   | 8                   | Number of atomic counter buffers accessed by a compute shader         | 2.11.17 |
     | MAX_COMPUTE_ATOMIC_COUNTERS             | Z+        | GetIntegerv   | 8                   | Number of atomic counters accessed by a compute shader                | 2.11.12 |
     | MAX_COMPUTE_SHARED_MEMORY_SIZE          | Z+        | GetIntegerv   | 32768               | Maximum total storage size of all variables declared as <shared> in   |         |
     |                                         |           |               |                     | all compute shaders linked into a single program object               |         |
     | MAX_COMPUTE_UNIFORM_COMPONENTS          | Z+        | GetIntegerv   | 512                 | Number of components for compute shader uniform variables             | 5.5.1   |
     | MAX_COMPUTE_IMAGE_UNIFORMS              | Z+        | GetIntegerv   | 8                   | Number of image variables in compute shaders                          | 2.11.12 |
     | MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS | Z+        | GetIntegerv   | *                   | Number of words for compute shader uniform variables in all uniform   | 5.5.1   |
     |                                         |           |               |                     | blocks, including the default                                         |         |
     +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+

     Modify Table 6.55, increasing the following minimum values:

            MAX_COMBINED_TEXTURE_IMAGE_UNITS     96 (6*16), was 80
            MAX_UNIFORM_BUFFER_BINDINGS          72 (6*12), was 60

 Issues

     1) Should <shared> variables be usable only in compute shaders, or in other
        stages too?

        RESOLVED:  Support only in compute shaders.  While some hardware may be
        able to support shared variables in shader stages other than compute,
        it is difficult to clearly define what the semantics are as far as
        sharing. For example, what is the equivalent for a local work group for
        vertex shaders?

     2) Can we expose atomics on <shared> variables?

        RESOLVED:  Yes.  The existing atomics in OpenGL 4.2 (via image
        variables) don't map well to the <shared> declaration.  Instead, we've
        defined new atomic functions that take a variable as a first input.
        These functions are specified in the ARB_shader_storage_buffer_object
        extension and are incorporated into this extension via the interaction
        described above.  We could have also chosen to define operators +=, &=,
        etc. to be atomic when applied to <shared> variables, but shaders may
        want to use such variables in cases where atomic access (and the
        related overhead) is not required.

     3) Should the local size and dimensions of the work group be specified at
        compile time? What is the default local dimensions?

        RESOLVED: Dimension is always 3 and a local size declaration is
        compulsory at compile time. There is no default. The value used is
        queriable.  To use a 1- or 2-dimensional work group, the extra
        dimensions can be set to 1.

     4) Do we need the local_work_size parameter in dispatch if the local size
        may be specified at compile time in the shader?

        RESOLVED: The specification of the local work size is now mandatory in
        the shader source at compile time and the local_work_size may no longer
        be specified at dispatch time.

     5) How do multiple shaders attached to a single program object work?

        RESOLVED:  Just as with any other shader stage. Exactly one of the
        shaders must provide the 'main' entry point. All shaders attached to a
        program object effectively get compiled into a single, large program at
        link time.  The program is dispatched as one big entity. Über shader
        type functionality can be achieved through the use of subroutine
        uniforms, which also work exactly as for other shader stages.

     6) Should compute dispatch honor conditional rendering?

        RESOLVED: Yes, it does honor conditional rendering.

     7) Is it possible to pass compute programs to UseProgram, etc.?

        RESOLVED: Yes, compute programs can be made current via UseProgram and
        can be made current in a program pipeline object via UseProgramStages.
        Note that a compute program must be linked with PROGRAM_SEPARABLE set
        to TRUE to be passed to UseProgramStages, even though the compute
        pipeline has only a single shader stage.

        The active compute program that will be used by DispatchCompute will be
        determined in the same manner as the active program for any other
        program stage:

          * If there is a current program specified via UseProgram, that
            program is considered current for all stages, including compute.

          * Otherwise, if there is a current program pipeline object, the
            program current for the compute stage of the pipeline object is
            considered current for the compute stage.

          * If neither of the former apply, no program is current for the
            compute stage.

        The program that is current for the compute stage is considered to be
        active if and only if it has a compute shader executable.  For example,
        if a non-compute program is made current via UseProgram, it will also
        be considered "current" for the compute stage, but won't be considered
        active.

        When using program pipeline objects, it's possible to switch between
        graphics and compute work without switching programs.  For example, in:

          glBindProgramPipeline(pipeline);
          glUseProgramStages(pipeline, GL_VERTEX_SHADER_BIT, programA);
          glUseProgramStages(pipeline, GL_FRAGMENT_SHADER_BIT, programB);
          glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC);
          glDrawArrays(GL_TRIANGLES, 0, 900);
          glDispatchCompute(5, 5, 5);

        the triangles will be processed by programA and programB, while the
        compute dispatch will be processed by programC.  Similarly,

          glUseProgramStages(pipeline, ~GL_COMPUTE_SHADER_BIT, programAB);
          glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC);
          glDrawArrays(GL_TRIANGLES, 0, 900);
          glDispatchCompute(5, 5, 5);

        will have the triangles processed by the multi-stage programAB.

     8) What happens if you try to draw with no active compute program?

        RESOLVED:  An INVALID_OPERATION error is generated if there is no
        active program for the compute shader stage.

     9) Should we increase minimums on certain replicated state bindings
        (texture image units, uniform buffer bindings) to reflect the addition
        of a sixth shader stage?

        RESOLVED:  Yes, for MAX_COMBINED_TEXTURE_IMAGE_UNITS and
        MAX_UNIFORM_BUFFER_BINDINGS.  These limits permit applications to
        statically partition the shared set of texture bindings into six
        separate sets, one per shader stage.

        The limit MAX_COMBINED_UNIFORM_BLOCKS is not increased, because it
        reflects the sum of the number of uniform blocks used in each stage of
        a single program.  Since no single program can have more than five
        stages, these limits don't need to be increased.

     10) How do the shader built-in variables relate to DirectCompute's
        built-in system values (SV_*)?

         OpenGL Compute             DirectCompute
         --------------------------------------------------
         gl_NumWorkGroups           --
         gl_WorkGroupSize           --
         gl_WorkGroupID             SV_GroupID
         gl_LocalInvocationID       SV_GroupThreadID
         gl_GlobalInvocationID      SV_DispatchThreadID
         gl_LocalInvocationIndex    SV_GroupIndex

     11) How does "program validation" (checking the active programs against
         the current state) apply to DispatchCompute?

       RESOLVED:  The same program validation logic will be applied to both
       graphics primitives (e.g., DrawArrays) and compute dispatches.
       Conditions that will cause validation errors for graphics primitives
       will also cause validation errors for compute dispatch, even if the
       conditions wouldn't otherwise affect compute, for example:

         * Mis-configured program pipeline objects (e.g., inserting a geometry
           program A between the linked vertex and fragment shaders of of
           program B).

         * A graphics program has a vertex shader that uses a 2D texture from
           texture image unit 0 and a fragment shader that uses a 3D texture
           from texture image unit 0.

       Similarly, validation errors specific to the compute shader executable
       (e.g., using different targets on a single texture image unit in a
       compute program) will generate validation errors for graphics Draw*
       calls.

       We chose to specify this behavior for several reasons.  First, using the
       same logic in both places ensures a single result for ValidateProgram
       and ValidateProgramPipeline (a single VALIDATE_STATUS value wouldn't be
       good enough if the result could be different for compute and graphics).
       Additionally, a single test allows implementations to set up state and
       perform validation tests for compute and graphics operations at the same
       time, without requiring additional irregular graphics- or
       compute-specific logic.

     12) We specify an INVALID_OPERATION error for DispatchCompute when there
         is no active program on the compute stage.  Should we specify similar
         errors for Draw* calls if the current program specified by UseProgram
         is a compute program?

       RESOLVED:  Not in the current spec.  If a compute shader is made
       current with UseProgram, there will be no active program for either the
       vertex and fragment stages.  In this case, the results of vertex and
       fragment processing are undefined, but no error is generated.  This
       behavior is already specified in unextended OpenGL 4.2.

       We don't generate errors in this case for several reasons:

         * For the compatibility profile, fixed-function vertex and fragment
           processing is available, and INVALID_OPERATION wouldn't make sense
           there.

         * Even in the core profile, there are cases where no active fragment
           shader is needed (e.g., primitives with RASTERIZER_DISCARD enabled).

       While there is no case where having only a compute program makes sense,
       at least in the core profile, we chose to keep the same undefined
       behavior that's already in place.

     13) Should we provide any additional support extending the memoryBarrier()
         GLSL built-in function provided by ARB_shader_image_load_store and
         GLSL 4.20?

       RESOLVED:  Yes.  The memoryBarrier() function provided by GLSL 4.20
       requires (a) synchronizing all memory transactions that might be visible
       to other shader invocations and (b) ordering memory transactions so that
       all other shader invocations never see stores issued after the barrier
       before seeing stores issued before the barrier.  Hardware
       implementations of GLSL 4.20 may have a high degree of parallelism,
       where the memory subsystem servicing shader loads and stores may have
       multiple independent sub-units, and where the shader invocations
       themselves may be executed in parallel on many shader cores.  The
       memoryBarrier() command may be fairly heavyweight, requiring
       synchronization with all memory sub-units and shader cores.

       We provide new functions in two different directions that might serve as
       lighter weight alternatives to memoryBarrier().  In particular, we
       provide four new functions

         void memoryBarrierAtomicCounter();
         void memoryBarrierBuffer();
         void memoryBarrierImage();
         void memoryBarrierShared();

       that order transactions of only a specific memory type and might require
       synchronization with fewer sub-units of the memory subsystem and a new
       function:

         void groupMemoryBarrier();

       that only order transactions as viewed by other threads in the same work
       group, which might not require synchronization with other shader cores.
       Since shared memory is only accessible to threads within a single work
       group, memoryBarrierShared() also only requires synchronization with
       other threads in the same work group.

 Revision History

     Rev.    Date    Author    Changes
     ----  --------  --------  -----------------------------------------
     27    07/24/14  Jon Leech Change value of GLSL limit
                               gl_MaxComputeUniformComponents to 512 for
                               consistency with the API (Bug 12370).
     26    01/30/14  Jon Leech Add table 6.31 COMPUTE_SHADER entry for
                               program pipeline objects (Bug 11539).
     25    10/23/12  pbrown    Remove the restriction forbidding the use of
                               barrier() inside potentially divergent flow
                               control.  Instead, we will allow barrier() to
                               be executed anywhere, but specify undefined
                               results (including hangs or program termination)
                               if the flow control is divergent (bug 9367).
     24    07/01/12  Jon Leech Fix typo (bug 8984).
     23    06/28/12  johnk     Remove two other references to "thread", add
                               "Only available in compute shaders" to the table
                               for memoryBarrierShared() and groupMemoryBarrier(),
                               fixed a typo.
     22    06/22/12  pbrown    Add a new built-in memoryBarrierBuffer() as an
                               interaction with ARB_shader_storage_buffer.  Add
                               a new built-in groupMemoryBarrier() that orders
                               memory transactions only as observed by other
                               shader invocations in the same work group.
                               Enhance the description of the GLSL memory
                               barrier functions.  Add issue 13 about the new
                               memory barrier functions added in this extension
                               (bug 9199).  Mark issues 11 and 12 as resolved.
                               Add NV_vertex_buffer_unified_memory interaction
                               allowing DispatchComputeIndirect to read its
                               arguments from any resident buffer object
                               instead of the single bound indirect dispatch
                               buffer.
     21    06/21/12  gsellers  Clarify that there are no built-in inputs or
                               outputs in compute shaders (bug 9200).
     20    06/21/12  gsellers  Throw INVALID_OPERATION if querying
                               COMPUTE_WORK_GROUP_SIZE from unlinked program or
                               program with no compute shader (bug 9117).
     19    06/18/12  pbrown    DispatchComputeIndirect throws INVALID_VALUE
                               if <indirect> is negative or misaligned (bug
                               9181).
     18    06/17/12  pbrown    Clarify that compute-only programs can be used
                               by both UseProgram and UseProgramStages, and add
                               a COMPUTE_SHADER_BIT for UseProgramStages (bug
                               9155).  Specify that validation errors checking
                               programs against each other and the GL state
                               apply equally to graphics primitives (Draw*) and
                               compute dispatches.  Update issue 7; add new
                               issues 11 and 12.  Clarify that compute shader
                               invocations in a workgroup are run "potentially
                               in parallel", but not "in lockstep" (bug 9151).
                               Other minor wording improvements.
     17    06/15/12  johnk     Don't allow location layout qualifiers for
                               compute shader inputs.
     16    06/15/12  johnk     In the intro material, allow work groups to
                               only potentially execute in parallel, and use
                               control barriers to synchronize.  Other minor
                               fixes.
     15    06/15/12  dgkoch    Added Additions to Ch.2 of Shading Language.
                               Renamed shader built-in variables, explained
                               them better, made them uvec3 instead of int[3].
                               Added derived shading language variables.
                               Renamed and changed built-in constants for
                               consistency with the variables. Removed
                               gl_MaxComputeWorkDimensions since it is no
                               longer necessary. Renamed API constants to
                               be consistent with shading language terminology.
                               Remove a few rogue references to variable
                               number of dispatch arguments. Added Issue 10.
                               (bugs 9151, 9167)
     14    06/14/12  pbrown    Modify DispatchComputeIndirect to accept an
                               "intptr"-typed offset instead of a "void *",
                               since doesn't accept pointers to client memory.
                               Modify DispatchComputeIndirect to use a new
                               buffer binding (DISPATCH_INDIRECT_BUFFER)
                               instead of sharing the binding used by
                               Draw*Indirect.  Add missing entries in the "New
                               Tokens" section and assign values.  Update
                               documentation of COMMAND_BARRIER_BIT to reflect
                               the new dispatch indirect binding.  Document
                               DispatchComputeIndirect errors for offsets that
                               are negative, misaligned, or run off the end of
                               the bound buffer.  Increase minimums for
                               combined texture image units and uniform buffer
                               bindings to reflect the new stage.  Update
                               various issues, add new issue 9 (bug 9130).
     13    06/14/12  Jon Leech Copy description of MAX_COMPUTE_SHARED_MEMORY_SIZE
                               into API spec from GLSL spec (bug 9069).
     12    05/14/12  pbrown    Add interaction with ARB_shader_storage_buffer_
                               object. The built-in functions provided there
                               for atomic memory operations on buffer variables
                               are also supported for the shared variables
                               provided here.  The functions themselves are
                               documented fully in the other specification.
     11    05/14/12  johnk     Keep the previous logical contents of the last
                               paragraph of the memory shader control functions.
     10    04/26/12  gsellers  Count max compute shared variable size in bytes.
                               Make shared variables implicitly coherent.
                               Add MAX_COMPUTE_UNIFORM_COMPONENTS.
                               Clean up MAX_COMPUTE_IMAGE_UNIFORMS.
      9    04/25/12  gsellers  Add UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER
                               and ATOMIC_COUNTER_BUFFER_REFERENCED_BY_-
                               COMPUTE_SHADER.  Remove <program> from dispatch
                               APIs.  Add memoryBarrier{Image,Shared,
                               AtomicCounter}().
      8    04/05/12  gsellers  Remove ARB suffixes.
      7    02/02/12  gsellers  Require OpenGL 4.2.
                               Add issue 8.
                               Up various minimums.
                               Remove variable dimensionality.
      6    01/24/12  gsellers  Require OpenGL 3.0.
                               Incorporate feedback from bmerry.
                               Add compute shader constants to sec. 7.7.
                               Add modifications to sec. 8.15 of the GLSL spec.
                               Add issue 7.
      5    01/20/12  gsellers  Make compute dispatch honor conditional
                               rendering.  Add indirect dispatch.
                               Change 'global work size' to 'num work groups',
                               make global size in multiples of local work size.
      4    01/10/12  gsellers  Fix typos and other small corrections.
                               Make specification of local work size at compile
                               time compulsory.
                               Add COMPUTE_WORK_DIMENSION_ARB and
                               COMPUTE_LOCAL_WORK_SIZE_ARB queries.
                               Add issue (5), resolve issues (3) and (4).
      3    01/09/12  gsellers  Change from AMD to ARB.
                               Update to be relative to OpenGL 4.2 (+GLSL 4.20).
                               Add <shared> variables.
                               Add issues (1) - (4).
                               Add link failure for programs that contain
                               compute and non-compute shaders.
      2    06/10/11  gsellers  Add error behavior.
                               Shading language changes.
                               Add global_offset parameter.
                               Add implementation dependent limits.
      1    09/24/10  gsellers  Initial revision