| Name | 
 |  | 
 |     ARB_compute_shader | 
 |  | 
 | Name Strings | 
 |  | 
 |     GL_ARB_compute_shader | 
 |  | 
 | Contact | 
 |  | 
 |     Graham Sellers, AMD (graham.sellers 'at' amd.com) | 
 |  | 
 | Contributors | 
 |  | 
 |     Pat Brown, NVIDIA | 
 |     Daniel Koch, TransGaming | 
 |     John Kessenich | 
 |     Members of the ARB working group | 
 |  | 
 | Notice | 
 |  | 
 |     Copyright (c) 2012-2014 The Khronos Group Inc. Copyright terms at | 
 |         http://www.khronos.org/registry/speccopyright.html | 
 |  | 
 | Status | 
 |  | 
 |     Complete. | 
 |     Approved by the ARB on 2012/06/12. | 
 |  | 
 | Version | 
 |  | 
 |     Last Modified Date: July 24, 2014 | 
 |     Revision: 27 | 
 |  | 
 | Number | 
 |  | 
 |     ARB Extension #122 | 
 |  | 
 | Dependencies | 
 |  | 
 |     OpenGL 4.2 is required. | 
 |  | 
 |     This extension is written based on the wording of the OpenGL 4.2 (Core | 
 |     Profile) specification, and on the wording of the OpenGL Shading Language | 
 |     (GLSL) Specification, version 4.20. | 
 |  | 
 |     This extension interacts with OpenGL 4.3 and | 
 |     ARB_shader_storage_buffer_object. | 
 |  | 
 |     This extension interacts with NV_vertex_buffer_unified_memory. | 
 |  | 
 | Overview | 
 |  | 
 |     Recent graphics hardware has become extremely powerful and a strong desire | 
 |     to harness this power for work (both graphics and non-graphics) that does | 
 |     not fit the traditional graphics pipeline well has emerged. To address | 
 |     this, this extension adds a new single-stage program type known as a | 
 |     compute program. This program may contain one or more compute shaders | 
 |     which may be launched in a manner that is essentially stateless. This allows | 
 |     arbitrary workloads to be sent to the graphics hardware with minimal | 
 |     disturbance to the GL state machine. | 
 |  | 
 |     In most respects, a compute program is identical to a traditional OpenGL | 
 |     program object, with similar status, uniforms, and other such properties. | 
 |     It has access to many of the same resources as fragment and other shader | 
 |     types, such as textures, image variables, atomic counters, and so on. | 
 |     However, it has no predefined inputs nor any fixed-function outputs. It | 
 |     cannot be part of a pipeline and its visible side effects are through its | 
 |     actions on images and atomic counters. | 
 |  | 
 |     OpenCL is another solution for using graphics processors as generalized | 
 |     compute devices. This extension addresses a different need. For example, | 
 |     OpenCL is designed to be usable on a wide range of devices ranging from | 
 |     CPUs, GPUs, and DSPs through to FPGAs. While one could implement GL on these | 
 |     types of devices, the target here is clearly GPUs. Another difference is | 
 |     that OpenCL is more full featured and includes features such as multiple | 
 |     devices, asynchronous queues and strict IEEE semantics for floating point | 
 |     operations. This extension follows the semantics of OpenGL - implicitly | 
 |     synchronous, in-order operation with single-device, single queue | 
 |     logical architecture and somewhat more relaxed numerical precision | 
 |     requirements. Although not as feature rich, this extension offers several | 
 |     advantages for applications that can tolerate the omission of these | 
 |     features. Compute shaders are written in GLSL, for example and so code may | 
 |     be shared between compute and other shader types. Objects are created and | 
 |     owned by the same context as the rest of the GL, and therefore no | 
 |     interoperability API is required and objects may be freely used by both | 
 |     compute and graphics simultaneously without acquire-release semantics or | 
 |     object type translation. | 
 |  | 
 | New Procedures and Functions | 
 |  | 
 |         void DispatchCompute(uint num_groups_x, | 
 |                              uint num_groups_y, | 
 |                              uint num_groups_z); | 
 |  | 
 |         void DispatchComputeIndirect(intptr indirect); | 
 |  | 
 | New Tokens | 
 |  | 
 |     Accepted by the <type> parameter of CreateShader and returned in the | 
 |     <params> parameter by GetShaderiv: | 
 |  | 
 |         COMPUTE_SHADER                                  0x91B9 | 
 |  | 
 |     Accepted by the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv, | 
 |     GetDoublev and GetInteger64v: | 
 |  | 
 |         MAX_COMPUTE_UNIFORM_BLOCKS                      0x91BB | 
 |         MAX_COMPUTE_TEXTURE_IMAGE_UNITS                 0x91BC | 
 |         MAX_COMPUTE_IMAGE_UNIFORMS                      0x91BD | 
 |         MAX_COMPUTE_SHARED_MEMORY_SIZE                  0x8262 | 
 |         MAX_COMPUTE_UNIFORM_COMPONENTS                  0x8263 | 
 |         MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS              0x8264 | 
 |         MAX_COMPUTE_ATOMIC_COUNTERS                     0x8265 | 
 |         MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS         0x8266 | 
 |         MAX_COMPUTE_WORK_GROUP_INVOCATIONS              0x90EB | 
 |  | 
 |     Accepted by the <pname> parameter of GetIntegeri_v, GetBooleani_v, | 
 |     GetFloati_v, GetDoublei_v and GetInteger64i_v: | 
 |  | 
 |         MAX_COMPUTE_WORK_GROUP_COUNT                    0x91BE | 
 |         MAX_COMPUTE_WORK_GROUP_SIZE                     0x91BF | 
 |  | 
 |     Accepted by the <pname> parameter of GetProgramiv: | 
 |  | 
 |         COMPUTE_WORK_GROUP_SIZE                         0x8267 | 
 |  | 
 |     Accepted by the <pname> parameter of GetActiveUniformBlockiv: | 
 |  | 
 |         UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER      0x90EC | 
 |  | 
 |     Accepted by the <pname> parameter of GetActiveAtomicCounterBufferiv: | 
 |  | 
 |         ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER  0x90ED | 
 |  | 
 |     Accepted by the <target> parameters of BindBuffer, BufferData, | 
 |     BufferSubData, MapBuffer, UnmapBuffer, GetBufferSubData, and | 
 |     GetBufferPointerv: | 
 |  | 
 |         DISPATCH_INDIRECT_BUFFER                        0x90EE | 
 |  | 
 |     Accepted by the <value> parameter of GetIntegerv, GetBooleanv, | 
 |     GetInteger64v, GetFloatv, and GetDoublev: | 
 |  | 
 |         DISPATCH_INDIRECT_BUFFER_BINDING                0x90EF | 
 |  | 
 |     Accepted by the <stages> parameter of UseProgramStages: | 
 |  | 
 |         COMPUTE_SHADER_BIT                              0x00000020 | 
 |  | 
 | Additions to Chapter 2 of the OpenGL 4.2 (Core Profile) Specification | 
 | (OpenGL Operation) | 
 |  | 
 |     In section 2.9.1, "Creating and Binding Buffer Objects", add to table 2.8 | 
 |     (p.43): | 
 |  | 
 |                                                                 Described | 
 |       Target name                 Purpose                     in sections(s) | 
 |       -----------------------     -------------------------  --------------- | 
 |       DISPATCH_INDIRECT_BUFFER    Indirect compute dispatch       5.5 | 
 |                                   commands | 
 |  | 
 |     Add to the end of section 2.9.8, "Indirect Commands In Buffer Objects" | 
 |     (p. 53): | 
 |  | 
 |     Arguments to the DispatchComputeIndirect command are stored in buffer | 
 |     objects as a group of three unsigned integers. | 
 |  | 
 |     A buffer object is bound to DISPATCH_INDIRECT_BUFFER by calling BindBuffer | 
 |     with target set to DISPATCH_INDIRECT_BUFFER, and buffer set to the name of | 
 |     the buffer object. If no corresponding buffer object exists, one is | 
 |     initialized as defined in section 2.9. | 
 |  | 
 |     DispatchComputeIndirect sources its arguments from the buffer object whose | 
 |     name is bound to DISPATCH_INDIRECT_BUFFER, using the <indirect> parameter as | 
 |     an offset into the buffer object in the same fashion as described in | 
 |     section 2.9.6. An INVALID_OPERATION error is generated if this command | 
 |     sources data beyond the end of the buffer object, if zero is bound to | 
 |     DISPATCH_INDIRECT_BUFFER, or if <indirect> is less than zero or not a | 
 |     multiple of the size, in basic machine units, of uint. | 
 |  | 
 |     In section 2.11, "Vertex Shaders", modify the introductory text on shaders | 
 |     to include compute shaders (second paragraph, p. 56): | 
 |  | 
 |     In addition to vertex shaders, tessellation control..., geometry shaders, | 
 |     fragment shaders, and compute shders can be created, compiled, and linked | 
 |     into program objects.  ....  (section 3.10).  Compute shaders perform | 
 |     general computations for dispatched arrays of shader invocations (section | 
 |     5.5), but do not operate on primitives processed by the other shader | 
 |     types. ... | 
 |  | 
 |     In section 2.11.3, "Program Objects", add to the reasons that LinkProgram | 
 |     may fail, p. 61: | 
 |  | 
 |         * The program object contains objects to form a compute shader (see | 
 |           section 5.5) and objects to form any other type of shader. | 
 |  | 
 |     In section 2.11.3, modify the description of active programs (last | 
 |     paragraph, p. 61, first paragraph, p. 62): | 
 |  | 
 |     ... geometry shader stages, those stages are ignored.  If there is no | 
 |     active program for the compute shader stage, compute dispatches will | 
 |     generate an error.  The active program for the compute shader stage has no | 
 |     effect on the processing of vertices, geometric primitives, and fragments, | 
 |     and the active program for all other shader stages has no effect on | 
 |     compute dispatches. | 
 |  | 
 |     In section 2.11.4, "Program Pipeline Objects", modify the description of | 
 |     UseProgramStages, p. 65: | 
 |  | 
 |     The executables in a program object... becomes current.  These stages may | 
 |     include vertex, tessellation control, tessellation evaluation, geometry, | 
 |     fragment, or compute, indicated by VERTEX_SHADER_BIT, | 
 |     TESS_CONTROL_SHADER_BIT, TESS_EVALUATION_SHADER_BIT, GEOMETRY_SHADER_BIT, | 
 |     FRAGMENT_SHADER_BIT, or COMPUTE_SHADER_BIT, respectively. ... | 
 |  | 
 |     In the unnumbered "Validation" section of section 2.11.12 "Shader | 
 |     Execution", modify the list of validation errors, pp. 112-113: | 
 |  | 
 |     This error is generated by any command that transfers vertices to the GL | 
 |     or launches compute work if: | 
 |  | 
 |       * (last bullet, p. 112) One program object is active... first program | 
 |         object was active.  The active compute shader is ignored for the | 
 |         purposes of this test. | 
 |  | 
 |       * (2nd bullet, p. 113) There is no current program specified by | 
 |         UseProgram, there is a current program pipeline object, and the | 
 |         current program for any shader stage has been relinked since... | 
 |  | 
 |       * (3rd bullet, p. 113) Any two active samplers in the set of active | 
 |         program objects are of different types but refer to the same texture | 
 |         image unit. | 
 |  | 
 |       * (4th bullet, p. 113) The sum of the number of active samplers for each | 
 |         active program exceeds the maximum number of texture image units | 
 |         allowed. | 
 |  | 
 |     Modify the paragraph describing ValidateProgram, p. 113: | 
 |  | 
 |     ... If validation succeeded, ... set to FALSE.  If validation succeeded, | 
 |     no INVALID_OPERATION validation error will be generated if <program> were | 
 |     made current via UseProgram, given the current state.  If validation | 
 |     failed, such errors will be generated under the current state. | 
 |  | 
 |     Modify the paragraph describing ValidateProgramPipeline, p. 114: | 
 |  | 
 |     ... can be queried with GetProgramPipelineiv (see section 6.1.12).  If | 
 |     validation succeeded, no INVALID_OPERATION validation error will be | 
 |     generated if <pipeline> were bound and no program were made current via | 
 |     UseProgram, given the current state.  If validation failed, such errors | 
 |     will be generated under the current state.     | 
 |  | 
 |     In subsection 2.11.12, "Shader Execution": | 
 |  | 
 |         Add to the list of implementation dependent constants under the | 
 |     "Texture Access" sub-heading: | 
 |  | 
 |         MAX_COMPUTE_TEXTURE_IMAGE_UNITS (for compute shaders), | 
 |  | 
 |         Add to the list of implementation dependent constants under the "Atomic | 
 |     Counter Access" sub-heading: | 
 |  | 
 |         MAX_COMPUTE_ATOMIC_COUNTERS (for compute shaders), | 
 |  | 
 |         Add to the list of implementation dependent constants under the "Image | 
 |     Access" sub-heading: | 
 |  | 
 |         MAX_COMPUTE_IMAGE_UNIFORMS (for compute shaders), | 
 |  | 
 |     In section 2.16, "Conditional Rendering", modify the sentence describing | 
 |     conditional rendering, starting with "In this case"... | 
 |  | 
 |     In this case, all drawing commands (see section 2.8.3), as well as | 
 |     Clear and ClearBuffer* (see section 4.2.3), and compute dispatch | 
 |     through DispacthCompute* (see section 5.5), have no effect. | 
 |     In the "Shared Memory Access Synchronization" subsection of section | 
 |     2.11.13, "Shader Memory Access", modify the description of | 
 |     COMMAND_BARRIER_BIT (p. 118): | 
 |  | 
 |       * COMMAND_BARRIER_BIT:  Command data sourced from buffer objects by | 
 |         Draw*Indirect and DispatchComputeIndirect commands ... The buffer | 
 |         objects affected by this bit are derived from the DRAW_INDIRECT_BUFFER | 
 |         and DISPATCH_INDIRECT_BUFFER bindings. | 
 |  | 
 |     In subection 2.17.7, "Uniform Variables", replace the paragraph beginning | 
 |     "If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,"... with: | 
 |  | 
 |         If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER, | 
 |     UNIFORM_BLOCK_REFERENCED_BY_TESS_CONTROL_SHADER, | 
 |     UNIFORM_BLOCK_REFERENCED_BY_TESS_EVALUATION_SHADER, | 
 |     UNIFORM_BLOCK_REFERENCED_BY_GEOMETRY_SHADER, | 
 |     UNIFORM_BLOCK_REFERENCED_BY_FRAGMENT_SHADER or | 
 |     UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER, then a boolean value indicating | 
 |     whether the uniform block identified by uniformBlockIndex is referenced | 
 |     by the vertex, tessellation control, tessellation evaluation, geometry, | 
 |     fragment or compute programming stages of <program>, respectively, is | 
 |     returned. | 
 |  | 
 |     Also in subsection 2.17.7, "Uniform Variables", replace the paragraph | 
 |     beginning, "If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER" | 
 |     on p.80 with: | 
 |  | 
 |         If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER, | 
 |     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_CONTROL_SHADER, | 
 |     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_EVALUATION_SHADER, | 
 |     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_GEOMETRY_SHADER, | 
 |     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_FRAGMENT_SHADER or | 
 |     ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER, then a single boolean | 
 |     value indicating whether the atomic counter buffer identified by | 
 |     bufferIndex is referenced by the vertex, tessellation control, tessellation | 
 |     evaluation, geometry, fragment or compute programming stages of | 
 |     <program>, respectively, is returned. | 
 |  | 
 |     Under the sub-heading "Uniform Blocks" in subsection 2.11.17, replace the | 
 |     sentence beginning "The limits for vertex, tessellation ..." on p.92 | 
 |     with: | 
 |  | 
 |         The limits for vertex, tessellation, geometry, fragment and compute | 
 |     shaders can be obtained by calling GetIntegerv with <pname> set to | 
 |     MAX_VERTEX_UNIFORM_BLOCKS, MAX_TESS_CONTROL_UNIFORM_BLOCKS, | 
 |     MAX_TESS_EVALUATION_UNIFORM_BLOCKS, MAX_GEOMETRY_UNIFORM_BLOCKS, | 
 |     MAX_FRAGMENT_UNIFORM_BLOCKS and MAX_COMPUTE_UNIFORM_BLOCKS, respectively. | 
 |  | 
 |     Under the sub-heading "Atomic Counter Buffers" in subsection 2.11.17, | 
 |     replace the sentence beginning "The limits for vertex, geometry, ..." | 
 |     on p.96 with: | 
 |  | 
 |         The limits for vertex, tessellation, geometry, fragment and compute | 
 |     shaders can be obtained by calling GetIntegerv with <pname> set to | 
 |     MAX_VERTEX_ATOMIC_COUNTER_BUFFERS, MAX_TESS_CONTROL_ATOMIC_COUNTER_BUFFERS, | 
 |     MAX_TESS_EVALUATION_ATOMIC_COUNTER_BUFFERS, | 
 |     MAX_GEOMETRY_ATOMIC_COUNTER_BUFFERS, MAX_FRAGMENT_ATOMIC_COUNTER_BUFFERS and | 
 |     MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS, respectively. | 
 |  | 
 | Additions to Chapter 3 of the OpenGL 4.2 (Core Profile) Specification | 
 | (Rasterization) | 
 |  | 
 |     None. | 
 |  | 
 | Additions to Chapter 4 of the OpenGL 4.2 (Core Profile) Specification | 
 | (Per-Fragment Operations and the Framebuffer) | 
 |  | 
 |     None. | 
 |  | 
 | Additions to Chapter 5 of the OpenGL 4.2 (Core Profile) Specification | 
 | (Special Functions) | 
 |  | 
 |     Add Section 5.5, "Compute Shaders" | 
 |  | 
 |         In addition to graphics-oriented shading operations such as vertex, | 
 |     tessellation, geometry and fragment shading, generic computation may be | 
 |     performed by the GL through the use of compute shaders. The compute pipeline | 
 |     is a form of single-stage machine that runs generic shaders. Compute shaders | 
 |     are created as described in section 2.11.1 using a <type> parameter of | 
 |     COMPUTE_SHADER. They are attached to and used in program objects as | 
 |     described in section 2.11.3. | 
 |  | 
 |         Compute workloads are formed from groups of work items called work | 
 |     groups and processed by the executable code for a compute program. A work | 
 |     group is a collection of shader invocations that execute the same code, | 
 |     potentially in parallel. An invocation within a work group may share data | 
 |     with other members of the same work group through shared variables and | 
 |     issue memory and control barriers to synchronize with other members of the | 
 |     same work group.  One or more work groups is launched by calling: | 
 |  | 
 |         void DispatchCompute(uint num_groups_x, | 
 |                              uint num_groups_y, | 
 |                              uint num_groups_z); | 
 |  | 
 |         Each work group is processed by the active program object for the | 
 |     compute shader stage.  The error INVALID_OPERATION will be generated if | 
 |     there is no active program object for the compute shader stage.  The | 
 |     active program for the compute shader stage will be determined in the same | 
 |     manner as the active program for other pipeline stages, as described in | 
 |     section 2.11.3.  While the individual shader invocations within a work | 
 |     group are executed as a unit, work groups are executed completely | 
 |     independently and in unspecified order. | 
 |  | 
 |         <num_groups_x>, <num_groups_y> and <num_groups_z> specify the number of | 
 |     local work groups that will be dispatched in the X, Y and Z dimensions, | 
 |     respectively. The builtin vector variable gl_NumWorkGroups will be | 
 |     initialized with the contents of the <num_groups_x>, <num_groups_y> and | 
 |     <num_groups_z> parameters. The maximum number of work groups that may be | 
 |     dispatched at one time may be determined by calling GetIntegeri_v with | 
 |     <pname> set to MAX_COMPUTE_WORK_GROUP_COUNT and <index> must be zero, one, | 
 |     or two, representing the X, Y, and Z dimensions, respectively. The | 
 |     values in the <num_groups_x>, <num_groups_y> and <num_groups_z> array must | 
 |     be less than or equal to the maximum work group count for the corresponding | 
 |     dimension, otherwise an INVALID_VALUE error is generated. If the work group | 
 |     count in any dimension is zero, no work groups are dispatched. | 
 |  | 
 |         The local work size in each dimension are specified at compile time | 
 |     using an input layout qualifier in one or more of the compute shaders | 
 |     attached to the program (see Section 4 of the OpenGL Shading Language | 
 |     Specification). After the program has been linked, the local work group size | 
 |     of the program may be retrieved by calling GetProgramiv with <pname> set to | 
 |     COMPUTE_WORK_GROUP_SIZE. This will return an array of three integers | 
 |     containing the local work group size of the compute program as specified by | 
 |     its input layout qualifier(s). If <program> is the name of a program that | 
 |     has not been successfully linked, or is the name of a linked program object | 
 |     that contains no compute shaders, then an INVALID_OPERATION error is | 
 |     generated. | 
 |  | 
 |         The maximum size of a local work group may be determined by calling | 
 |     GetIntegeri_v with <pname> set to MAX_COMPUTE_WORK_GROUP_SIZE | 
 |     and <index> set to 0, 1, or 2 to retrieve the maximum work size in the | 
 |     X, Y and Z dimension, respectively. Furthermore, the maximum number of | 
 |     invocations in a single local work group (i.e., the product of the three | 
 |     dimensions) may be determined by calling GetIntegerv with <pname> set to | 
 |     MAX_COMPUTE_WORK_GROUP_INVOCATIONS. | 
 |  | 
 |         The command | 
 |  | 
 |         void DispatchComputeIndirect(intptr indirect); | 
 |  | 
 |     is equivalent (assuming no errors are generated) to calling | 
 |     DispatchCompute with <num_groups_x>, <num_groups_y> and <num_groups_z> | 
 |     initialized with the three uint values contained in the buffer currently | 
 |     bound to the DISPATCH_INDIRECT_BUFFER binding at an offset, in basic | 
 |     machine units, specified by <indirect>.  The error INVALID_VALUE is | 
 |     generated if <indirect> is less than zero or is not a multiple of four. | 
 |     The error INVALID_OPERATION is generated if no buffer is bound to | 
 |     DISPATCH_INDIRECT_BUFFER, if the command would source data beyond the end | 
 |     of the buffer object, or if there is no active program for the compute | 
 |     shader stage.  If any of <num_groups_x>, <num_groups_y> or <num_groups_z> | 
 |     is greater than MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding | 
 |     dimension then the results are undefined. | 
 |  | 
 |     Add Subsection 5.5.1, "Compute Shader Variables" | 
 |  | 
 |         Compute shaders can access variables belonging to the current program | 
 |     object. The amount of storage in the default uniform block accessed by a | 
 |     compute shader is specified by the value of the implementation dependent | 
 |     constant MAX_COMPUTE_UNIFORM_COMPONENTS. The total amount of | 
 |     combined storage available for uniform variables in all uniform blocks | 
 |     accessed by a compute shader (including the default unifom block) is | 
 |     specified by the implementation dependent constant | 
 |     MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS. | 
 |  | 
 |         There is a limit to the total size of all variables declared as | 
 |     <shared> in a single program object. This limit, expressed in units of | 
 |     basic machine units, may be queried as the value of | 
 |     MAX_COMPUTE_SHARED_MEMORY_SIZE. | 
 |  | 
 | Additions to Chapter 6 of the OpenGL 4.2 (Core Profile) Specification | 
 | (State and State Requests) | 
 |  | 
 |     None. | 
 |  | 
 | Additions to Chapter 2 of the OpenGL Shading Language Specification, Version | 
 | 4.20 (Overview of OpenGL Shading) | 
 |  | 
 |     Replace the last sentence of the first paragraph of the overview with | 
 |     the following:  | 
 |  | 
 |     "Currently, these processors are the vertex, tessellation control,  | 
 |      tessellation evaluation, geometry, fragment, and compute processors." | 
 |  | 
 |     Replace the last sentence of the second paragraph of the overview with | 
 |     the following: | 
 |  | 
 |     "The specific languages will be referred to by the name of the processor | 
 |      they target: vertex, tessellation control, tessellation evaluation,  | 
 |      geometry, fragment, or compute." | 
 |  | 
 |     Add a new Section 2.6 titled "Compute Processor" with the following text: | 
 |  | 
 |     "The <compute processor> is a programmable unit that operates independently | 
 |     from the other shader processors. Compilation units written in the OpenGL | 
 |     Shading Language to run on this processor are called <compute shaders>.  | 
 |     When a complete set of compute shaders are compiled and linked, they  | 
 |     result in a <compute shader executable> that runs on the compute processor.  | 
 |  | 
 |     A compute shader has access to many of the same resources as fragment and | 
 |     other shader processors, such as textures, buffers, image variables,  | 
 |     atomic counters, and so on. It does not have any predefined inputs  | 
 |     nor any fixed-function outputs.  It is not part of the graphics pipeline | 
 |     and its visible side effects are through actions on images, storage  | 
 |     buffers, and atomic counters.   | 
 |  | 
 |     A compute shader operates on a group of work items called a work group. | 
 |     A work group is a collection of shader invocations that execute the same | 
 |     code, potentially in parallel. An invocation within a work group may share data with | 
 |     other members of the same work group through shared variables and issue | 
 |     memory and control barriers to synchronize with other members of the same work group." | 
 |  | 
 | Additions to Chapter 4 of the OpenGL Shading Language Specification, Version | 
 | 4.20 (Variables and Types) | 
 |  | 
 |     Modify section 4.4.1, second paragraph from  | 
 |  | 
 |     "All shaders allow input layout qualifiers on input variable declarations." | 
 |  | 
 |     to | 
 |   | 
 |     "All shaders, except compute shaders, allow input layout location qualifiers on  | 
 |      input variable declarations." | 
 |  | 
 |     Modify Section 4.3. Add to the table at the start of Section 4.3: | 
 |  | 
 |     +-------------------+-----------------------------------------------------------+ | 
 |     | Storage Qualifier | Meaning                                                   | | 
 |     +-------------------+-----------------------------------------------------------+ | 
 |     | <shared>          | variable storage is shared across all work items in a     | | 
 |     |                   | local work group for compute shaders                      | | 
 |     +-------------------+-----------------------------------------------------------+ | 
 |  | 
 |     Add the following paragraph to Section 4.3.4, "Input Variables" | 
 |  | 
 |         Compute shaders do not permit user-defined input variables and do not | 
 |     form a formal interface with any other shader stage. See section 7.1 | 
 |     for a description of built-in compute shader input variables. All other | 
 |     input to a compute shader is retrieved explicitly through image loads, | 
 |     texture fetches, loads from uniforms or uniform buffers, or other user | 
 |     supplied code. Redeclaration of built-in input variables in compute | 
 |     shaders is not permitted. | 
 |  | 
 |     Add the following paragraph to Section 4.3.6, "Output Variables" | 
 |  | 
 |         Compute shaders have no built-in output variables, do not support | 
 |     user-defined output variables and do not form a formal interface with any | 
 |     other shader stage. All outputs from a compute shader take the form of the | 
 |     side effects such as image stores and operations on atomic counters. | 
 |  | 
 |     Add Section 4.3.7, "Shared", renumber subsequent sections | 
 |  | 
 |         The <shared> qualifier is used to declare variables that have storage | 
 |     shared between all work items of a compute shader local work | 
 |     group. Variables declared as <shared> may only be used in compute shaders | 
 |     (see Section 5.5, "Compute Shaders"). Shared variables are implicitly | 
 |     coherent. That is, writes to shared variables from one shader invocation | 
 |     will eventually be seen by other invocations within the same local work | 
 |     group. | 
 |  | 
 |         Variables declared as <shared> may not have initializers and their | 
 |     contents are undefined at the beginning of shader execution. Any data | 
 |     written to <shared> variables will be visible to other shaders executing | 
 |     the same shader within the same local work group. Order of execution | 
 |     with regards to reads and writes to the same <shared> variables by different | 
 |     invocations of a shader is not defined. In order to achieve ordering with  | 
 |     respect to reads and writes to <shared> variables, memory barriers must be  | 
 |     employed using the barrier() function (see Section 8.15). | 
 |  | 
 |         There is a limit to the total size of all variables declared as | 
 |     <shared> in a single program object. This limit, expressed in units of | 
 |     basic machine units may be determined by using the OpenGL API to query the  | 
 |     value of MAX_COMPUTE_SHARED_MEMORY_SIZE. | 
 |  | 
 |     Add Section 4.4.1.4, "Compute-Shader Inputs" | 
 |  | 
 |     There are no layout location qualifiers for compute shader inputs. | 
 |  | 
 |     Layout qualifier identifiers for compute shader inputs are the work-group  | 
 |     size qualifiers: | 
 |  | 
 |         layout-qualifier-id | 
 |             local_size_x = integer-constant | 
 |             local_size_y = integer-constant | 
 |             local_size_z = integer-constant | 
 |  | 
 |     <local_size_x>, <local_size_y>, and <local_size_z> are used to define the | 
 |     local size of the kernel defined by the compute shader in the first, | 
 |     second, and third dimension, respectively. The default size in each | 
 |     dimension is 1. If a shader does not specify a size for one of the | 
 |     dimensions, that dimension will have a size of 1. | 
 |  | 
 |     For example, the following declaration in a compute shader | 
 |  | 
 |         layout (local_size_x = 32, local_size_y = 32) in; | 
 |  | 
 |     is used to declare a two-dimensional compute shader with a local size of | 
 |     32 x 32 elements as a three-dimensional compute shader where the third dimension is | 
 |     one element deep. | 
 |  | 
 |     As another example, the declaration | 
 |  | 
 |         layout (local_size_x = 8) in; | 
 |  | 
 |     effectively specifies that a one-dimensional compute shader is being | 
 |     compiled, and its size is 8 elements.  | 
 |  | 
 |         If the local size of the shader in any dimension is greater than the | 
 |     maximum size supported by the implementation for that dimension, a | 
 |     compile-time error results. Also, if such a layout qualifier is declared more | 
 |     than once in the same shader, all those declarations must indicate the same local | 
 |     work-group size; otherwise a compile-time error results. If multiple compute | 
 |     shaders attached to a single program object declare local work-group size, | 
 |     the declarations must be identical; otherwise a link-time error results. | 
 |     Furthermore, if a program object contains any compute shaders, at | 
 |     least one must contain an input layout qualifier specifying the local work | 
 |     sizes of the program, or a link-time error will occur. | 
 |  | 
 | Additions to Chapter 7 of the OpenGL Shading Language Specification, Version | 
 | 4.20 (Built-in Variables) | 
 |  | 
 |     Add to the start of Section 7.1, "Built-In Language Variables", before the | 
 |     description of the vertex language built-in variables: | 
 |  | 
 |         In the compute language, the built-in variables are declared as follows: | 
 |  | 
 |         // work group dimensions | 
 |         in    uvec3 gl_NumWorkGroups; | 
 |         const uvec3 gl_WorkGroupSize; | 
 |  | 
 |         // work group and invocation IDs | 
 |         in    uvec3 gl_WorkGroupID; | 
 |         in    uvec3 gl_LocalInvocationID; | 
 |  | 
 |         // derived variables | 
 |         in    uvec3 gl_GlobalInvocationID; | 
 |         in    uint  gl_LocalInvocationIndex; | 
 |  | 
 |     Add the end of Section 7.1, before Section 7.1.1: | 
 |  | 
 |         The built-in variable <gl_NumWorkGroups> is a compute-shader input | 
 |     variable containing the total number of global work items in each | 
 |     dimension of the work group that will execute the compute shader.  | 
 |     Its content is equal to the values specified in the <num_groups_x>, | 
 |     <num_groups_y>, and <num_groups_z> parameters passed to the  | 
 |     DispatchCompute API entry point. | 
 |  | 
 |         The built-in constant <gl_WorkGroupSize> is a compute-shader constant | 
 |     containing the local work-group size of the shader. The size of the work | 
 |     group in the X, Y, and Z dimensions is stored in the x, y, and z components. | 
 |     The values stored in <gl_WorkGroupSize> match those specified in the  | 
 |     required <local_size_x>, <local_size_y>, and <local_size_z> layout | 
 |     qualifiers for the current shader. This value is constant so that | 
 |     it can be used to size arrays of memory that can be shared within | 
 |     the local work group. | 
 |  | 
 |         The built-in variable <gl_WorkGroupID> is a compute-shader input | 
 |     variable containing the 3-dimensional index of the global work group  | 
 |     that the current invocation is executing in. The possible values range | 
 |     across the parameters passed into DispatchCompute, i.e., from (0, 0, 0) to | 
 |     (gl_NumWorkGroups.x - 1, gl_NumWorkGroups.y - 1, gl_NumWorkGroups.z - 1). | 
 |  | 
 |         The built-in variable <gl_LocalInvocationID> is a compute-shader input | 
 |     variable containing the 3-dimensional index of the local work group | 
 |     within the global work group that the current invocation is executing in. | 
 |     The possible values for this variable range across the local work group | 
 |     size, i.e. (0,0,0) to (gl_WorkGroupSize.x - 1, gl_WorkGroupSize.y - 1, | 
 |     gl_WorkGroupSize.z - 1). | 
 |  | 
 |         The built-in variable <gl_GlobalInvocationID> is a compute shader input | 
 |     variable containing the global index of the current work item.  This | 
 |     value uniquely identifies this invocation from all other invocations  | 
 |     across all local and global work groups initiated by the current  | 
 |     DispatchCompute call.  This is computed as: | 
 |  | 
 |         gl_GlobalInvocationID =  | 
 |             gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID. | 
 |  | 
 |         The built-in variable <gl_LocalInvocationIndex> is a compute shader | 
 |     input variable that contains the 1-dimensional representation of the | 
 |     gl_LocalInvocationID. This is useful for uniquely identifying a  | 
 |     unique region of shared memory within the local work group for this | 
 |     invocation to use. This is computed as: | 
 |         gl_LocalInvocationIndex =  | 
 |             gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +  | 
 |             gl_LocalInvocationID.y * gl_WorkGroupSize.x +  | 
 |             gl_LocalInvocationID.x; | 
 |  | 
 |     Add to the list of built-in constants in Section 7.3: | 
 |  | 
 |         const ivec3 gl_MaxComputeWorkGroupCount = { 65535, 65535, 65535 }; | 
 |         const ivec3 gl_MaxComputeWorkGroupSize = { 1024, 1024, 64 }; | 
 |         const int gl_MaxComputeUniformComponents = 512; | 
 |         const int gl_MaxComputeTextureImageUnits = 16; | 
 |         const int gl_MaxComputeImageUniforms = 8; | 
 |         const int gl_MaxComputeAtomicCounters = 8; | 
 |         const int gl_MaxComputeAtomicCounterBuffers = 1; | 
 |  | 
 | Additions to Chapter 8 of the OpenGL Shading Language Specification, Version | 
 | 4.20 (Built-in Variables) | 
 |  | 
 |     Insert "Atomic Memory Functions" section after Section 8.10, Atomic | 
 |     Counter Functions (p. 149).  Atomic memory operations are supported on | 
 |     shared variables; the set of operations and their definitions are similar | 
 |     to those for the imageAtomic*() functions.  These functions are fully | 
 |     documented in the ARB_shader_storage_buffer_object extension (see | 
 |     dependencies). | 
 |  | 
 |     Modify the first paragraph of Section 8.15, "Shader Invocation Control | 
 |     Functions" to read: | 
 |  | 
 |         The shader invocation control function is only available in tessellation | 
 |     control shaders and compute shaders. It is used to control the relative | 
 |     execution order of multiple shader invocations used to process a patch | 
 |     (in the case of tessellation control shaders) or a local work group (in the | 
 |     case of compute shaders), which are otherwise executed with an undefined | 
 |     order. | 
 |  | 
 |     +----------------+--------------------------------------------------------------------------+ | 
 |     | Syntax         | Description                                                              | | 
 |     +----------------+--------------------------------------------------------------------------+ | 
 |     | barrier        | For any given static instance of barrier() appearing in a tessellation   | | 
 |     |                | control shader or compute shader, all invocations for a single patch     | | 
 |     |                | or work group, respectively, must enter it before any will continue      | | 
 |     |                | beyond it.                                                               | | 
 |     +----------------+--------------------------------------------------------------------------+ | 
 |  | 
 |     Modify the second paragraph as follows: | 
 |  | 
 |     ... Because invocations may execute in an undefined order between these | 
 |     barrier calls, the values of a per-vertex or per-patch output variable in | 
 |     a tessellation control shader or shared variables for compute shaders | 
 |     will be undefined in a number of cases enumerated in Section 4.3.7 "Output | 
 |     Variables" (for tessellation control shaders) and Section 4.3.6 "Shared | 
 |     Variables" (for compute shaders). | 
 |  | 
 |     Replace the third paragraph with the following: | 
 |  | 
 |     For tessellation control shaders, the barrier() function may only be | 
 |     placed inside the function main() of the tessellation control shader and | 
 |     may not be called within any control flow. Barriers are also disallowed | 
 |     after a return statement in the function main(). Any such misplaced | 
 |     barriers result in a compile-time error. | 
 |  | 
 |     For compute shaders, the barrier() function may be placed within flow | 
 |     control, but that flow control must be uniform flow control. That is, all | 
 |     the controlling expressions that lead to execution of the barrier must be | 
 |     dynamically uniform expressions. This ensures that if any shader | 
 |     invocation enters a conditional statement, then all invocations will enter | 
 |     it. While compilers are encouraged to give warnings if they can detect | 
 |     this might not happen, compilers cannot completely determine this. Hence, | 
 |     it is the author's responsibility to ensure barrier() only exists inside | 
 |     uniform flow control. Otherwise, some shader invocations will stall | 
 |     indefinitely, waiting for a barrier that is never reached by other | 
 |     invocations. | 
 |  | 
 |     Modify the table of memory control functions on p.160, | 
 |  | 
 |     +-----------------------------------+----------------------------------------------------------------------------------------+ | 
 |     | Syntax                            | Description                                                                            | | 
 |     +-----------------------------------+----------------------------------------------------------------------------------------+ | 
 |     | void memoryBarrier()              | Control the ordering of all memory transactions issued by a single shader invocation.  | | 
 |     +-----------------------------------+----------------------------------------------------------------------------------------+ | 
 |     | void memoryBarrierAtomicCounter() | Control the ordering of accesses to atomic counter variables issued by a single shader | | 
 |     |                                   | invocation.                                                                            | | 
 |     +-----------------------------------+----------------------------------------------------------------------------------------+ | 
 |     | void memoryBarrierBuffer()        | Control the ordering of memory transactions to buffer variables issued within a        | | 
 |     |                                   | single shader invocation.                                                              | | 
 |     +-----------------------------------+----------------------------------------------------------------------------------------+ | 
 |     | void memoryBarrierImage()         | Control the ordering of memory transactions to images issued within a single shader    | | 
 |     |                                   | invocation.                                                                            | | 
 |     +-----------------------------------+----------------------------------------------------------------------------------------+ | 
 |     | void memoryBarrierShared()        | Control the ordering of memory transactions to shared variables issued within a single | | 
 |     |                                   | shader invocation.                                                                     | | 
 |     |                                   | Only available in compute shaders.                                                     | | 
 |     +-----------------------------------+----------------------------------------------------------------------------------------+ | 
 |     | void groupMemoryBarrier()         | Control the ordering of all memory transactions issued within a single shader          | | 
 |     |                                   | invocation, as viewed by other invocations in the same work group.                     | | 
 |     |                                   | Only available in compute shaders.                                                     | | 
 |     +-----------------------------------+----------------------------------------------------------------------------------------+ | 
 |  | 
 |     Modify the subsequent paragraph as follows: | 
 |  | 
 |     The memory barrier built-in functions can be used to order reads and | 
 |     writes to variables stored in memory accessible to other shader | 
 |     invocations.  When called, these functions will wait for the completion of | 
 |     all reads and writes previously performed by the caller that access | 
 |     selected variable types, and then return with no other effect.  The | 
 |     built-in functions memoryBarrierAtomicCounter(), memoryBarrierBuffer(), | 
 |     memoryBarrierImage(), and memoryBarrierShared() wait for the completion of | 
 |     accesses to atomic counter, buffer, image, and shared variables, | 
 |     respectively.  The built-in functions memoryBarrier() and | 
 |     groupMemoryBarrier() wait for the completion of accesses to all of the | 
 |     above variable types.  The functions memoryBarrierShared() and | 
 |     groupMemoryBarrier() are available only in compute shaders; the other | 
 |     functions are available in all shader types. | 
 |  | 
 |     When these functions return, any memory stores performed using coherent | 
 |     variables prior to the call will be visible to any future coherent access | 
 |     to the same memory performed by any other shader invocation.  In | 
 |     particular, the values written this way in one shader stage are guaranteed | 
 |     to be visible to coherent memory accesses performed by shader invocations | 
 |     in subsequent stages when those invocations were triggered by the | 
 |     execution of the original shader invocation (e.g., fragment shader | 
 |     invocations for a primitive resulting from a particular geometry shader | 
 |     invocation). | 
 |  | 
 |     Additionally, memory barrier functions order stores performed by the | 
 |     calling invocation, as observed by other shader invocations.  Without | 
 |     memory barriers, if one shader invocation performs two stores to coherent | 
 |     variables, a second shader invocation might see the values written by the | 
 |     second store prior to seeing those written by the first.  However, if the | 
 |     first shader invocation calls a memory barrier function between the two | 
 |     stores, selected other shader invocations will never see the results of | 
 |     the second store before seeing those of the first.  When using the | 
 |     function groupMemoryBarrier(), this ordering guarantee applies only to | 
 |     other shader invocations in the same compute shader work group; all other | 
 |     memory barrier functions provide the guarantee to all other shader | 
 |     invocations.  No memory barrier is required to guarantee the order of | 
 |     memory stores as observed by the invocation performing the stores; an | 
 |     invocation reading from a variable that it previously wrote will always | 
 |     see the most recently written value unless another shader invocation also | 
 |     wrote to the same memory. | 
 |  | 
 | Dependencies on OpenGL 4.3 and ARB_shader_storage_buffer_object | 
 |  | 
 |     If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, the | 
 |     spec language adding the built-in functions atomicAdd(), atomicMin(), | 
 |     atomicMax(), atomicAnd(), atomicOr(), atomicXor(), atomicExchange(), and | 
 |     atomicCompSwap() should be considered to be incorporated into this | 
 |     extension as-is, except that buffer variables will not be supported and | 
 |     thus cannot be used with these functions.  No "#extension" directive is | 
 |     necessary to use these functions in compute shaders. | 
 |  | 
 |     If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, | 
 |     references to the GLSL built-in function memoryBarrierBuffer() should be | 
 |     removed. | 
 |  | 
 | Dependencies on NV_vertex_buffer_unified_memory | 
 |  | 
 |     If NV_vertex_buffer_unified_memory is supported, a new buffer address | 
 |     range and enable is provided to permit the use with | 
 |     DispatchComputeIndirect with a resident buffer object without requiring | 
 |     that it be bound to the DISPATCH_INDIRECT_BUFFER target.  The following | 
 |     additional edits apply: | 
 |          | 
 |     Accepted by the <cap> parameter of GetBufferParameterui64vNV: | 
 |  | 
 |         DISPATCH_INDIRECT_BUFFER                        (defined above) | 
 |  | 
 |     Accepted by the <cap> parameter of Disable, Enable, and IsEnabled, and by | 
 |     the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv, GetDoublev | 
 |     and GetInteger64v: | 
 |  | 
 |         DISPATCH_INDIRECT_UNIFIED_NV                    0x90FD | 
 |  | 
 |     Accepted by the <pname> parameter of BufferAddressRangeNV  | 
 |     and the <value> parameter of GetIntegerui64vNV:  | 
 |  | 
 |         DISPATCH_INDIRECT_ADDRESS_NV                    0x90FE | 
 |  | 
 |     Accepted by the <value> parameter of GetIntegerv: | 
 |  | 
 |         DISPATCH_INDIRECT_LENGTH_NV                     0x90FF | 
 |  | 
 |     Add to the end of Section 5.5, after discussion of | 
 |     DispatchComputeIndirect: | 
 |  | 
 |     If DISPATCH_INDIRECT_UNIFIED_NV is enabled, DispatchComputeIndirect does | 
 |     not use the buffer bound to DISPATCH_INDIRECT_BUFFER.  Instead, it sources | 
 |     its arguments from the GPU address range specified by calling | 
 |     BufferAddressRangeNV with a <pname> of DISPATCH_INDIRECT_ADDRESS_NV and an | 
 |     <index> of zero.  The address is obtained by adding the <indirect> | 
 |     parameter to the base address of the range, specified by the <address> | 
 |     parameter of BufferAddressRangeNV.  If the command sources data outside | 
 |     the specified address range, the error INVALID_OPERATION will be | 
 |     generated.  The DISPATCH_INDIRECT_BUFFER binding will be ignored in this | 
 |     case, and no errors will be generated due to the use of this binding.  The | 
 |     error INVALID_VALUE will still be generated if <indirect> is negative.  No | 
 |     INVALID_VALUE error will be generated if <indirect> is not a multiple of | 
 |     four, but INVALID_OPERATION will be generated if the effective address is | 
 |     not a multiple of four.  If the indirect dispatch address range does not | 
 |     belong to a buffer object that is resident at the time of the | 
 |     DispatchComputeIndirect call, undefined results, possibly including | 
 |     program termination, may occur. | 
 |  | 
 |     Add the following to the "Compute Dispatch State" table defined in this | 
 |     extension: | 
 |  | 
 |     Get Value                           Type    Get Command         Initial Value   Sec     Attribute | 
 |     ---------                           ----    -----------         -------------   ---     --------- | 
 |     DISPATCH_INDIRECT_UNIFIED_NV         B      IsEnabled               FALSE       5.5     none | 
 |     DISPATCH_INDIRECT_ADDRESS_NV        Z64+    GetIntegerui64vNV         0         5.5     none | 
 |     DISPATCH_INDIRECT_LENGTH_NV          Z+     GetIntegerv               0         5.5     none | 
 |  | 
 | Errors | 
 |  | 
 |     INVALID_OPERATION is generated by DispatchCompute or | 
 |     DispatchComputeIndirect if there is no active program for the compute | 
 |     shader stage. | 
 |  | 
 |     INVALID_VALUE is generated by DispatchCompute if any of <num_groups_x>, | 
 |     <num_groups_y> or <num_groups_z> is greater than the value of | 
 |     MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding dimension. | 
 |  | 
 |     INVALID_VALUE is generated by DispatchComputeIndirect if <indirect> is | 
 |     less than zero or not a multiple of four. | 
 |  | 
 |     INVALID_OPERATION is generated by DispatchComputeIndirect if no buffer is | 
 |     bound to DISPATCH_INDIRECT_BUFFER or if the command would source data | 
 |     beyond the end of the bound buffer object. | 
 |  | 
 |     INVALID_OPERATION is generated by GetProgramiv is <pname> is | 
 |     COMPUTE_WORK_GROUP_SIZE and either the program has not been linked | 
 |     successfully, or has been linked but contains no compute shaders. | 
 |  | 
 |     LinkProgram will fail if <program> contains a combination of compute and  | 
 |     non-compute shaders. | 
 |  | 
 | New State | 
 |  | 
 |     None. | 
 |  | 
 | New Implementation Dependent State | 
 |  | 
 |     Add to Table 6.31, "Program Pipeline Object State" | 
 |  | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |     | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    | | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |     | COMPUTE_SHADER                                     | Z+        | GetProgramPipelineiv    | 0             | Name of current compute shader project object                         | 2.11.4  | | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |  | 
 |     Add to Table 6.32, "Program Object State" | 
 |  | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |     | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    | | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |     | COMPUTE_WORK_GROUP_SIZE                            | 3 x Z+    | GetProgramiv            | { 0, ... }    | Local work size of a linked compute program                           | 5.5     | | 
 |     | UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER         | B         | GetActiveUniformBlockiv | FALSE         | True if uniform block is referenced by the compute stage              | 2.17.7  | | 
 |     | ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER | B         | GetActiveAtomicCounter- | FALSE         | AACB has a counter used by compute shaders                            | 2.17.7  | | 
 |     |                                                    |           |   Bufferiv              | FALSE         |                                                                       |         | | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |  | 
 |     Insert new table named "Compute Dispatch State", after Table 6.46 "Hints": | 
 |  | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |     | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    | | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |     | DISPATCH_INDIRECT_BUFFER_BINDING                   | Z+        | GetIntegerv             | 0             | Indirect dispatch buffer binding                                      | 5.5     | | 
 |     +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | 
 |  | 
 |     Insert Table 6.50, "Implementation Dependent Compute Shader Limits", | 
 |     renumber subsequent tables. | 
 |  | 
 |     +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ | 
 |     | Get Value                               | Type      | Get Command   | Minimum Value       | Description                                                           | Sec.    | | 
 |     +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ | 
 |     | MAX_COMPUTE_WORK_GROUP_COUNT            | 3 x Z+    | GetIntegeri_v | 65535               | Maximum number of work groups that may be dispatched by a single      | 5.5     | | 
 |     |                                         |           |               |                     | dispatch command (per dimension)                                      |         | | 
 |     | MAX_COMPUTE_WORK_GROUP_SIZE             | 3 x Z+    | GetIntegeri_v | 1024 (x, y), 64 (z) | Maximum local size of a compute work group (per dimension)            | 5.5     | | 
 |     | MAX_COMPUTE_WORK_GROUP_INVOCATIONS      | Z+        | GetIntegerv   | 1024                | Maximum total compute shader invocations in a single local work group | 5.5     | | 
 |     | MAX_COMPUTE_UNIFORM_BLOCKS              | Z+        | GetIntegerv   | 12                  | Maximum number of uniform blocks per compute program                  | 2.11.7  | | 
 |     | MAX_COMPUTE_TEXTURE_IMAGE_UNITS         | Z+        | GetIntegerv   | 16                  | Maximum number of texture image units accessible by a compute shader  | 2.11.12 | | 
 |     | MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS      | Z+        | GetIntegerv   | 8                   | Number of atomic counter buffers accessed by a compute shader         | 2.11.17 | | 
 |     | MAX_COMPUTE_ATOMIC_COUNTERS             | Z+        | GetIntegerv   | 8                   | Number of atomic counters accessed by a compute shader                | 2.11.12 | | 
 |     | MAX_COMPUTE_SHARED_MEMORY_SIZE          | Z+        | GetIntegerv   | 32768               | Maximum total storage size of all variables declared as <shared> in   |         | | 
 |     |                                         |           |               |                     | all compute shaders linked into a single program object               |         | | 
 |     | MAX_COMPUTE_UNIFORM_COMPONENTS          | Z+        | GetIntegerv   | 512                 | Number of components for compute shader uniform variables             | 5.5.1   | | 
 |     | MAX_COMPUTE_IMAGE_UNIFORMS              | Z+        | GetIntegerv   | 8                   | Number of image variables in compute shaders                          | 2.11.12 | | 
 |     | MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS | Z+        | GetIntegerv   | *                   | Number of words for compute shader uniform variables in all uniform   | 5.5.1   | | 
 |     |                                         |           |               |                     | blocks, including the default                                         |         | | 
 |     +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ | 
 |  | 
 |     Modify Table 6.55, increasing the following minimum values: | 
 |  | 
 |            MAX_COMBINED_TEXTURE_IMAGE_UNITS     96 (6*16), was 80 | 
 |            MAX_UNIFORM_BUFFER_BINDINGS          72 (6*12), was 60 | 
 |  | 
 | Issues | 
 |  | 
 |     1) Should <shared> variables be usable only in compute shaders, or in other | 
 |        stages too? | 
 |  | 
 |        RESOLVED:  Support only in compute shaders.  While some hardware may be | 
 |        able to support shared variables in shader stages other than compute, | 
 |        it is difficult to clearly define what the semantics are as far as | 
 |        sharing. For example, what is the equivalent for a local work group for | 
 |        vertex shaders? | 
 |  | 
 |     2) Can we expose atomics on <shared> variables? | 
 |  | 
 |        RESOLVED:  Yes.  The existing atomics in OpenGL 4.2 (via image | 
 |        variables) don't map well to the <shared> declaration.  Instead, we've | 
 |        defined new atomic functions that take a variable as a first input. | 
 |        These functions are specified in the ARB_shader_storage_buffer_object | 
 |        extension and are incorporated into this extension via the interaction | 
 |        described above.  We could have also chosen to define operators +=, &=, | 
 |        etc. to be atomic when applied to <shared> variables, but shaders may | 
 |        want to use such variables in cases where atomic access (and the | 
 |        related overhead) is not required. | 
 |  | 
 |     3) Should the local size and dimensions of the work group be specified at | 
 |        compile time? What is the default local dimensions? | 
 |  | 
 |        RESOLVED: Dimension is always 3 and a local size declaration is | 
 |        compulsory at compile time. There is no default. The value used is | 
 |        queriable.  To use a 1- or 2-dimensional work group, the extra | 
 |        dimensions can be set to 1. | 
 |  | 
 |     4) Do we need the local_work_size parameter in dispatch if the local size | 
 |        may be specified at compile time in the shader? | 
 |  | 
 |        RESOLVED: The specification of the local work size is now mandatory in | 
 |        the shader source at compile time and the local_work_size may no longer | 
 |        be specified at dispatch time. | 
 |  | 
 |     5) How do multiple shaders attached to a single program object work? | 
 |  | 
 |        RESOLVED:  Just as with any other shader stage. Exactly one of the | 
 |        shaders must provide the 'main' entry point. All shaders attached to a | 
 |        program object effectively get compiled into a single, large program at | 
 |        link time.  The program is dispatched as one big entity. Über shader | 
 |        type functionality can be achieved through the use of subroutine | 
 |        uniforms, which also work exactly as for other shader stages. | 
 |  | 
 |     6) Should compute dispatch honor conditional rendering? | 
 |  | 
 |        RESOLVED: Yes, it does honor conditional rendering. | 
 |  | 
 |     7) Is it possible to pass compute programs to UseProgram, etc.? | 
 |  | 
 |        RESOLVED: Yes, compute programs can be made current via UseProgram and | 
 |        can be made current in a program pipeline object via UseProgramStages. | 
 |        Note that a compute program must be linked with PROGRAM_SEPARABLE set | 
 |        to TRUE to be passed to UseProgramStages, even though the compute | 
 |        pipeline has only a single shader stage. | 
 |  | 
 |        The active compute program that will be used by DispatchCompute will be | 
 |        determined in the same manner as the active program for any other | 
 |        program stage: | 
 |  | 
 |          * If there is a current program specified via UseProgram, that | 
 |            program is considered current for all stages, including compute. | 
 |  | 
 |          * Otherwise, if there is a current program pipeline object, the | 
 |            program current for the compute stage of the pipeline object is | 
 |            considered current for the compute stage. | 
 |  | 
 |          * If neither of the former apply, no program is current for the | 
 |            compute stage. | 
 |  | 
 |        The program that is current for the compute stage is considered to be | 
 |        active if and only if it has a compute shader executable.  For example, | 
 |        if a non-compute program is made current via UseProgram, it will also | 
 |        be considered "current" for the compute stage, but won't be considered | 
 |        active. | 
 |  | 
 |        When using program pipeline objects, it's possible to switch between | 
 |        graphics and compute work without switching programs.  For example, in: | 
 |  | 
 |          glBindProgramPipeline(pipeline); | 
 |          glUseProgramStages(pipeline, GL_VERTEX_SHADER_BIT, programA); | 
 |          glUseProgramStages(pipeline, GL_FRAGMENT_SHADER_BIT, programB); | 
 |          glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC); | 
 |          glDrawArrays(GL_TRIANGLES, 0, 900); | 
 |          glDispatchCompute(5, 5, 5); | 
 |  | 
 |        the triangles will be processed by programA and programB, while the | 
 |        compute dispatch will be processed by programC.  Similarly, | 
 |  | 
 |          glUseProgramStages(pipeline, ~GL_COMPUTE_SHADER_BIT, programAB); | 
 |          glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC); | 
 |          glDrawArrays(GL_TRIANGLES, 0, 900); | 
 |          glDispatchCompute(5, 5, 5); | 
 |  | 
 |        will have the triangles processed by the multi-stage programAB. | 
 |  | 
 |     8) What happens if you try to draw with no active compute program? | 
 |  | 
 |        RESOLVED:  An INVALID_OPERATION error is generated if there is no | 
 |        active program for the compute shader stage. | 
 |  | 
 |     9) Should we increase minimums on certain replicated state bindings | 
 |        (texture image units, uniform buffer bindings) to reflect the addition | 
 |        of a sixth shader stage? | 
 |  | 
 |        RESOLVED:  Yes, for MAX_COMBINED_TEXTURE_IMAGE_UNITS and | 
 |        MAX_UNIFORM_BUFFER_BINDINGS.  These limits permit applications to | 
 |        statically partition the shared set of texture bindings into six | 
 |        separate sets, one per shader stage. | 
 |  | 
 |        The limit MAX_COMBINED_UNIFORM_BLOCKS is not increased, because it | 
 |        reflects the sum of the number of uniform blocks used in each stage of | 
 |        a single program.  Since no single program can have more than five | 
 |        stages, these limits don't need to be increased. | 
 |  | 
 |     10) How do the shader built-in variables relate to DirectCompute's  | 
 |        built-in system values (SV_*)? | 
 |  | 
 |         OpenGL Compute             DirectCompute | 
 |         -------------------------------------------------- | 
 |         gl_NumWorkGroups           -- | 
 |         gl_WorkGroupSize           -- | 
 |         gl_WorkGroupID             SV_GroupID | 
 |         gl_LocalInvocationID       SV_GroupThreadID | 
 |         gl_GlobalInvocationID      SV_DispatchThreadID | 
 |         gl_LocalInvocationIndex    SV_GroupIndex | 
 |  | 
 |     11) How does "program validation" (checking the active programs against | 
 |         the current state) apply to DispatchCompute? | 
 |  | 
 |       RESOLVED:  The same program validation logic will be applied to both | 
 |       graphics primitives (e.g., DrawArrays) and compute dispatches. | 
 |       Conditions that will cause validation errors for graphics primitives | 
 |       will also cause validation errors for compute dispatch, even if the | 
 |       conditions wouldn't otherwise affect compute, for example: | 
 |  | 
 |         * Mis-configured program pipeline objects (e.g., inserting a geometry | 
 |           program A between the linked vertex and fragment shaders of of | 
 |           program B). | 
 |  | 
 |         * A graphics program has a vertex shader that uses a 2D texture from | 
 |           texture image unit 0 and a fragment shader that uses a 3D texture | 
 |           from texture image unit 0. | 
 |  | 
 |       Similarly, validation errors specific to the compute shader executable | 
 |       (e.g., using different targets on a single texture image unit in a | 
 |       compute program) will generate validation errors for graphics Draw* | 
 |       calls. | 
 |  | 
 |       We chose to specify this behavior for several reasons.  First, using the | 
 |       same logic in both places ensures a single result for ValidateProgram | 
 |       and ValidateProgramPipeline (a single VALIDATE_STATUS value wouldn't be | 
 |       good enough if the result could be different for compute and graphics). | 
 |       Additionally, a single test allows implementations to set up state and | 
 |       perform validation tests for compute and graphics operations at the same | 
 |       time, without requiring additional irregular graphics- or | 
 |       compute-specific logic. | 
 |  | 
 |     12) We specify an INVALID_OPERATION error for DispatchCompute when there | 
 |         is no active program on the compute stage.  Should we specify similar | 
 |         errors for Draw* calls if the current program specified by UseProgram | 
 |         is a compute program? | 
 |  | 
 |       RESOLVED:  Not in the current spec.  If a compute shader is made  | 
 |       current with UseProgram, there will be no active program for either the  | 
 |       vertex and fragment stages.  In this case, the results of vertex and  | 
 |       fragment processing are undefined, but no error is generated.  This  | 
 |       behavior is already specified in unextended OpenGL 4.2. | 
 |  | 
 |       We don't generate errors in this case for several reasons: | 
 |  | 
 |         * For the compatibility profile, fixed-function vertex and fragment | 
 |           processing is available, and INVALID_OPERATION wouldn't make sense | 
 |           there. | 
 |  | 
 |         * Even in the core profile, there are cases where no active fragment | 
 |           shader is needed (e.g., primitives with RASTERIZER_DISCARD enabled). | 
 |  | 
 |       While there is no case where having only a compute program makes sense, | 
 |       at least in the core profile, we chose to keep the same undefined | 
 |       behavior that's already in place. | 
 |  | 
 |     13) Should we provide any additional support extending the memoryBarrier() | 
 |         GLSL built-in function provided by ARB_shader_image_load_store and | 
 |         GLSL 4.20? | 
 |  | 
 |       RESOLVED:  Yes.  The memoryBarrier() function provided by GLSL 4.20 | 
 |       requires (a) synchronizing all memory transactions that might be visible | 
 |       to other shader invocations and (b) ordering memory transactions so that | 
 |       all other shader invocations never see stores issued after the barrier | 
 |       before seeing stores issued before the barrier.  Hardware | 
 |       implementations of GLSL 4.20 may have a high degree of parallelism, | 
 |       where the memory subsystem servicing shader loads and stores may have | 
 |       multiple independent sub-units, and where the shader invocations | 
 |       themselves may be executed in parallel on many shader cores.  The | 
 |       memoryBarrier() command may be fairly heavyweight, requiring | 
 |       synchronization with all memory sub-units and shader cores. | 
 |  | 
 |       We provide new functions in two different directions that might serve as | 
 |       lighter weight alternatives to memoryBarrier().  In particular, we | 
 |       provide four new functions | 
 |  | 
 |         void memoryBarrierAtomicCounter(); | 
 |         void memoryBarrierBuffer(); | 
 |         void memoryBarrierImage(); | 
 |         void memoryBarrierShared(); | 
 |  | 
 |       that order transactions of only a specific memory type and might require | 
 |       synchronization with fewer sub-units of the memory subsystem and a new | 
 |       function: | 
 |  | 
 |         void groupMemoryBarrier(); | 
 |  | 
 |       that only order transactions as viewed by other threads in the same work | 
 |       group, which might not require synchronization with other shader cores. | 
 |       Since shared memory is only accessible to threads within a single work | 
 |       group, memoryBarrierShared() also only requires synchronization with | 
 |       other threads in the same work group. | 
 |  | 
 | Revision History | 
 |  | 
 |     Rev.    Date    Author    Changes | 
 |     ----  --------  --------  ----------------------------------------- | 
 |     27    07/24/14  Jon Leech Change value of GLSL limit | 
 |                               gl_MaxComputeUniformComponents to 512 for | 
 |                               consistency with the API (Bug 12370). | 
 |     26    01/30/14  Jon Leech Add table 6.31 COMPUTE_SHADER entry for | 
 |                               program pipeline objects (Bug 11539). | 
 |     25    10/23/12  pbrown    Remove the restriction forbidding the use of  | 
 |                               barrier() inside potentially divergent flow  | 
 |                               control.  Instead, we will allow barrier() to | 
 |                               be executed anywhere, but specify undefined  | 
 |                               results (including hangs or program termination)  | 
 |                               if the flow control is divergent (bug 9367). | 
 |     24    07/01/12  Jon Leech Fix typo (bug 8984). | 
 |     23    06/28/12  johnk     Remove two other references to "thread", add | 
 |                               "Only available in compute shaders" to the table | 
 |                               for memoryBarrierShared() and groupMemoryBarrier(), | 
 |                               fixed a typo. | 
 |     22    06/22/12  pbrown    Add a new built-in memoryBarrierBuffer() as an | 
 |                               interaction with ARB_shader_storage_buffer.  Add | 
 |                               a new built-in groupMemoryBarrier() that orders | 
 |                               memory transactions only as observed by other | 
 |                               shader invocations in the same work group. | 
 |                               Enhance the description of the GLSL memory | 
 |                               barrier functions.  Add issue 13 about the new | 
 |                               memory barrier functions added in this extension | 
 |                               (bug 9199).  Mark issues 11 and 12 as resolved. | 
 |                               Add NV_vertex_buffer_unified_memory interaction | 
 |                               allowing DispatchComputeIndirect to read its | 
 |                               arguments from any resident buffer object | 
 |                               instead of the single bound indirect dispatch | 
 |                               buffer. | 
 |     21    06/21/12  gsellers  Clarify that there are no built-in inputs or | 
 |                               outputs in compute shaders (bug 9200). | 
 |     20    06/21/12  gsellers  Throw INVALID_OPERATION if querying | 
 |                               COMPUTE_WORK_GROUP_SIZE from unlinked program or | 
 |                               program with no compute shader (bug 9117). | 
 |     19    06/18/12  pbrown    DispatchComputeIndirect throws INVALID_VALUE | 
 |                               if <indirect> is negative or misaligned (bug | 
 |                               9181). | 
 |     18    06/17/12  pbrown    Clarify that compute-only programs can be used | 
 |                               by both UseProgram and UseProgramStages, and add | 
 |                               a COMPUTE_SHADER_BIT for UseProgramStages (bug | 
 |                               9155).  Specify that validation errors checking | 
 |                               programs against each other and the GL state | 
 |                               apply equally to graphics primitives (Draw*) and | 
 |                               compute dispatches.  Update issue 7; add new | 
 |                               issues 11 and 12.  Clarify that compute shader | 
 |                               invocations in a workgroup are run "potentially | 
 |                               in parallel", but not "in lockstep" (bug 9151). | 
 |                               Other minor wording improvements. | 
 |     17    06/15/12  johnk     Don't allow location layout qualifiers for | 
 |                               compute shader inputs. | 
 |     16    06/15/12  johnk     In the intro material, allow work groups to  | 
 |                               only potentially execute in parallel, and use  | 
 |                               control barriers to synchronize.  Other minor | 
 |                               fixes. | 
 |     15    06/15/12  dgkoch    Added Additions to Ch.2 of Shading Language. | 
 |                               Renamed shader built-in variables, explained  | 
 |                               them better, made them uvec3 instead of int[3]. | 
 |                               Added derived shading language variables. | 
 |                               Renamed and changed built-in constants for | 
 |                               consistency with the variables. Removed | 
 |                               gl_MaxComputeWorkDimensions since it is no | 
 |                               longer necessary. Renamed API constants to  | 
 |                               be consistent with shading language terminology. | 
 |                               Remove a few rogue references to variable | 
 |                               number of dispatch arguments. Added Issue 10. | 
 |                               (bugs 9151, 9167) | 
 |     14    06/14/12  pbrown    Modify DispatchComputeIndirect to accept an | 
 |                               "intptr"-typed offset instead of a "void *", | 
 |                               since doesn't accept pointers to client memory. | 
 |                               Modify DispatchComputeIndirect to use a new | 
 |                               buffer binding (DISPATCH_INDIRECT_BUFFER) | 
 |                               instead of sharing the binding used by | 
 |                               Draw*Indirect.  Add missing entries in the "New | 
 |                               Tokens" section and assign values.  Update | 
 |                               documentation of COMMAND_BARRIER_BIT to reflect | 
 |                               the new dispatch indirect binding.  Document | 
 |                               DispatchComputeIndirect errors for offsets that | 
 |                               are negative, misaligned, or run off the end of | 
 |                               the bound buffer.  Increase minimums for | 
 |                               combined texture image units and uniform buffer | 
 |                               bindings to reflect the new stage.  Update | 
 |                               various issues, add new issue 9 (bug 9130). | 
 |     13    06/14/12  Jon Leech Copy description of MAX_COMPUTE_SHARED_MEMORY_SIZE | 
 |                               into API spec from GLSL spec (bug 9069). | 
 |     12    05/14/12  pbrown    Add interaction with ARB_shader_storage_buffer_ | 
 |                               object. The built-in functions provided there  | 
 |                               for atomic memory operations on buffer variables | 
 |                               are also supported for the shared variables | 
 |                               provided here.  The functions themselves are | 
 |                               documented fully in the other specification. | 
 |     11    05/14/12  johnk     Keep the previous logical contents of the last  | 
 |                               paragraph of the memory shader control functions. | 
 |     10    04/26/12  gsellers  Count max compute shared variable size in bytes. | 
 |                               Make shared variables implicitly coherent. | 
 |                               Add MAX_COMPUTE_UNIFORM_COMPONENTS. | 
 |                               Clean up MAX_COMPUTE_IMAGE_UNIFORMS. | 
 |      9    04/25/12  gsellers  Add UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER | 
 |                               and ATOMIC_COUNTER_BUFFER_REFERENCED_BY_- | 
 |                               COMPUTE_SHADER.  Remove <program> from dispatch | 
 |                               APIs.  Add memoryBarrier{Image,Shared, | 
 |                               AtomicCounter}(). | 
 |      8    04/05/12  gsellers  Remove ARB suffixes. | 
 |      7    02/02/12  gsellers  Require OpenGL 4.2. | 
 |                               Add issue 8. | 
 |                               Up various minimums. | 
 |                               Remove variable dimensionality. | 
 |      6    01/24/12  gsellers  Require OpenGL 3.0. | 
 |                               Incorporate feedback from bmerry. | 
 |                               Add compute shader constants to sec. 7.7. | 
 |                               Add modifications to sec. 8.15 of the GLSL spec. | 
 |                               Add issue 7. | 
 |      5    01/20/12  gsellers  Make compute dispatch honor conditional | 
 |                               rendering.  Add indirect dispatch. | 
 |                               Change 'global work size' to 'num work groups', | 
 |                               make global size in multiples of local work size. | 
 |      4    01/10/12  gsellers  Fix typos and other small corrections. | 
 |                               Make specification of local work size at compile | 
 |                               time compulsory. | 
 |                               Add COMPUTE_WORK_DIMENSION_ARB and | 
 |                               COMPUTE_LOCAL_WORK_SIZE_ARB queries. | 
 |                               Add issue (5), resolve issues (3) and (4). | 
 |      3    01/09/12  gsellers  Change from AMD to ARB. | 
 |                               Update to be relative to OpenGL 4.2 (+GLSL 4.20). | 
 |                               Add <shared> variables. | 
 |                               Add issues (1) - (4). | 
 |                               Add link failure for programs that contain | 
 |                               compute and non-compute shaders. | 
 |      2    06/10/11  gsellers  Add error behavior. | 
 |                               Shading language changes. | 
 |                               Add global_offset parameter. | 
 |                               Add implementation dependent limits. | 
 |      1    09/24/10  gsellers  Initial revision |