extensions/NV/NV_gpu_multicast.txt - external/github.com/KhronosGroup/OpenGL-Registry - Git at Google

 Name

     NV_gpu_multicast

 Name Strings

     GL_NV_gpu_multicast

 Contact

     Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com)
     Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com)

 Contributors

     Christoph Kubisch, NVIDIA
     Mark Kilgard, NVIDIA
     Robert Menzel, NVIDIA
     Kevin Lefebvre, NVIDIA
     Ralf Biermann, NVIDIA

 Status

     Shipping in NVIDIA release 370.XX drivers and up.

 Version

     Last Modified Date:         January 3, 2019
     Revision:                   6

 Number

     OpenGL Extension #494

 Dependencies

     This extension is written against the OpenGL 4.5 specification
     (Compatibility Profile), dated February 2, 2015.

     This extension requires ARB_copy_image.

     This extension interacts with ARB_sample_locations.

     This extension interacts with ARB_sparse_buffer.

     This extension requires EXT_direct_state_access.

 Overview

     This extension enables novel multi-GPU rendering techniques by providing application control
     over a group of linked GPUs with identical hardware configuration.

     Multi-GPU rendering techniques fall into two categories: implicit and explicit.  Existing
     explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and
     application complexity.  An application must manage one context per GPU and multi-pump the API
     stream.  Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering
     from one context to multiple GPUs.  Common implicit approaches include alternate-frame
     rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing.  They each have
     drawbacks.  AFR scales nicely but interacts poorly with inter-frame dependencies.  SFR can
     improve latency but has challenges with offscreen rendering and scaling of vertex processing.
     With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample
     positions and the driver blends the result to improve quality.  This also has issues with
     offscreen rendering and can conflict with other anti-aliasing techniques.

     These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks
     adequate knowledge to accelerate every application.  To resolve this, NV_gpu_multicast
     provides fine-grained, explicit application control over multiple GPUs with a single context.

     Key points:

     - One context controls multiple GPUs.  Every GPU in the linked group can access every object.

     - Rendering is broadcast.  Each draw is repeated across all GPUs in the linked group.

     - Each GPU gets its own instance of all framebuffers, allowing individualized output for each
       GPU.  Input data can be customized for each GPU using buffers created with the storage flag,
       PER_GPU_STORAGE_BIT_NV and a new API, MulticastBufferSubDataNV.

     - New interfaces provide mechanisms to transfer textures and buffers from one GPU to another.

 New Procedures and Functions

     void RenderGpuMaskNV(bitfield mask);

     void MulticastBufferSubDataNV(
         bitfield gpuMask, uint buffer,
         intptr offset, sizeiptr size,
         const void *data);

     void MulticastCopyBufferSubDataNV(
         uint readGpu, bitfield writeGpuMask,
         uint readBuffer, uint writeBuffer,
         intptr readOffset, intptr writeOffset, sizeiptr size);

     void MulticastCopyImageSubDataNV(
         uint srcGpu, bitfield dstGpuMask,
         uint srcName, enum srcTarget,
         int srcLevel,
         int srcX, int srcY, int srcZ,
         uint dstName, enum dstTarget,
         int dstLevel,
         int dstX, int dstY, int dstZ,
         sizei srcWidth, sizei srcHeight, sizei srcDepth);

     void MulticastBlitFramebufferNV(uint srcGpu, uint dstGpu,
                                     int srcX0, int srcY0, int srcX1, int srcY1,
                                     int dstX0, int dstY0, int dstX1, int dstY1,
                                     bitfield mask, enum filter);

     void MulticastFramebufferSampleLocationsfvNV(uint gpu, uint framebuffer, uint start,
                                                  sizei count, const float *v);

     void MulticastBarrierNV(void);

     void MulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask);

     void MulticastGetQueryObjectivNV(uint gpu, uint id, enum pname, int *params);
     void MulticastGetQueryObjectuivNV(uint gpu, uint id, enum pname, uint *params);
     void MulticastGetQueryObjecti64vNV(uint gpu, uint id, enum pname, int64 *params);
     void MulticastGetQueryObjectui64vNV(uint gpu, uint id, enum pname, uint64 *params);

 New Tokens

     Accepted in the <flags> parameter of BufferStorage and NamedBufferStorageEXT:

         PER_GPU_STORAGE_BIT_NV                     0x0800

     Accepted by the <pname> parameter of GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and
     GetDoublev:

         MULTICAST_GPUS_NV                          0x92BA
         RENDER_GPU_MASK_NV                         0x9558

     Accepted as a value for <pname> for the TexParameter{if}, TexParameter{if}v,
     TextureParameter{if}, TextureParameter{if}v, MultiTexParameter{if}EXT and
     MultiTexParameter{if}vEXT commands and for the <value> parameter of GetTexParameter{if}v,
     GetTextureParameter{if}vEXT and GetMultiTexParameter{if}vEXT:

         PER_GPU_STORAGE_NV                          0x9548

     Accepted by the <pname> parameter of GetMultisamplefv:

         MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV   0x9549

 Additions to the OpenGL 4.5 Specification (Compatibility Profile)

     (Add a new chapter after chapter 19 "Compute Shaders")

     20 Multicast Rendering

     Some implementations support multiple linked GPUs driven by a single context.  Often the
     distribution of work to individual GPUs is managed by the GL without client knowledge.  This
     chapter specifies commands for explicitly distributing work across GPUs in a linked group.
     Rendering can be enabled or disabled for specific GPUs.  Draw commands are multicast, or
     repeated across all enabled GPUs.  Objects are shared by all GPUs, however each GPU has its
     own instance (copy) of many resources, including framebuffers.  When each GPU has its own
     instance of a resource, it is considered to have per-GPU storage.  When all GPUs share a
     single instance of a resource, this is considered GPU-shared storage.

     The mechanism for linking GPUs is implementation specific, as is the mechanism for enabling
     multicast rendering support (if necessary).  The number of GPUs usable for multicast rendering
     by a context can be queried by calling GetIntegerv with the symbolic constant
     MULTICAST_GPUS_NV.  This number is constant for the lifetime of a context.  Individual GPUs
     are identified using zero-based indices in the range [0, n-1], where n is the number of
     multicast GPUs.  GPUs are also identified by bitmasks of the form 2^i, where i is the GPU
     index.  A set of GPUs is specified by the union of masks for each GPU in the set.

     20.1 Controlling Individual GPUs

     Render commands are restricted to a specific set of GPUs with

       void RenderGpuMaskNV(bitfield mask);

     The following errors apply to RenderGpuMaskNV:

     INVALID_OPERATION is generated
     * if <mask> is zero,
     * if <mask> is not zero and <mask> is greater than or equal to 2^n, where n is equal
     to MULTICAST_GPUS_NV,
     * if issued between BeginConditionalRender and the corresponding EndConditionalRender.

     If the command does not generate an error, RENDER_GPU_MASK_NV is set to <mask>.  The default
     value of RENDER_GPU_MASK_NV is (2^n)-1.

     Render commands are skipped for a GPU that is not present in RENDER_GPU_MASK_NV.  For example:
     draw calls, clears, compute dispatches, and copies or pixel path operations that write to a
     framebuffer (e.g. DrawPixels, BlitFramebuffer).  For a full list of render commands see
     section 2.4 (page 26).  MulticastBlitFramebufferNV is an exception to this policy: while it is
     a rendering command, it has its own source and destinations mask.  Note that buffer and
     textures updates are not affected by RENDER_GPU_MASK_NV.

     20.2 Multi-GPU Buffer Storage

     Like other resources, buffer objects can have two types of storage, per-GPU storage or
     GPU-shared storage.  Per-GPU storage can be explicitly requested using the
     PER_GPU_STORAGE_BIT_NV flag with BufferStorage/NamedBufferStorageEXT.  If this flag is not
     set, the type of storage used is undefined.  The implementation may use either type and
     transition between them at any time.  Client reads of a buffer with per-GPU storage may source
     from any GPU.

     The following rules apply to buffer objects with per-GPU storage:

       When mapped updates apply to all GPUs (only WRITE_ONLY access is supported).
       When bound to UNIFORM_BUFFER, client uniform updates apply to all GPUs.
       When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply
       to all GPUs.

     The following commands affect storage on all GPUs, even if the buffer object has per-GPU
     storage:

       BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData

     An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with
     PER_GPU_STORAGE_BIT_NV set with MAP_READ_BIT or SPARSE_STORAGE_BIT_ARB.
     An INVALID_OPERATION is generated if a buffer with PER_GPU_STORAGE_BIT_NV is bound to
     UNIFORM_BUFFER and GetUniformfv, GetUniformiv, GetUniformuiv or GetUniformdv is called.

     To modify buffer object data on one or more GPUs, the client may use the command

       void MulticastBufferSubDataNV(
           bitfield gpuMask, uint buffer,
           intptr offset, sizeiptr size,
           const void *data);

     This command operates similarly to NamedBufferSubData, except that it updates the per-GPU
     buffer data on the set of GPUs defined by <gpuMask>.  If <buffer> has GPU-shared storage,
     <gpuMask> is ignored and the shared instance of the buffer is updated.

     An INVALID_VALUE error is generated if <gpuMask> is zero or is greater than or equal to 2^n,
     where n is equal to MULTICAST_GPUS_NV.
     An INVALID_OPERATION error is generated if <buffer> is not the name of an existing buffer
     object.
     An INVALID_VALUE error is generated if <offset> or <size> is negative, or if <offset> + <size>
     is greater than the value of BUFFER_SIZE for the buffer object.
     An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped
     with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with
     MAP_PERSISTENT_BIT set in the MapBufferRange access flags.
     An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer
     object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the
     DYNAMIC_STORAGE_BIT set.

     To copy between buffers created with PER_GPU_STORAGE_BIT_NV, the client may use the command

       void MulticastCopyBufferSubDataNV(
         uint readGpu, bitfield writeGpuMask,
         uint readBuffer, uint writeBuffer,
         intptr readOffset, intptr writeOffset, sizeiptr size);

     This command operates similarly to CopyNamedBufferSubData, with the exception that it operates
     on per-GPU instances of the buffer object.  The read GPU index is specified by <readGpu> and
     the set of write GPUs is specified by the mask in <writeGpuMask>.  The following errors apply
     to MulticastCopyBufferSubDataNV:

     An INVALID_OPERATION error is generated if <readBuffer> or <writeBuffer> is not the name of an
     existing buffer object.
     An INVALID_VALUE error is generated if any of <readOffset>, <writeOffset>, or <size> are
     negative, if <readOffset> + <size> exceeds the size of the source buffer object, or if
     <writeOffset> + <size> exceeds the size of the destination buffer object.
     An INVALID_OPERATION error is generated if either the source or destination buffer objects is
     mapped, unless they were mapped with MAP_PERSISTENT_BIT set in the Map*BufferRange access
     flags.
     An INVALID_OPERATION error is generated if the value of BUFFER_STORAGE_FLAGS for <readBuffer>
     or <writeBuffer> does not have PER_GPU_STORAGE_BIT_NV set.
     An INVALID_VALUE error is generated if <readGpu> is greater than or equal to
     MULTICAST_GPUS_NV.
     An INVALID_OPERATION error is generated if <writeGpuMask> is zero.  An INVALID_VALUE error is
     generated if <writeGpuMask> is not zero and <writeGpuMask> is greater than or equal to 2^n,
     where n is equal to MULTICAST_GPUS_NV.
     An INVALID_VALUE error is generated if the source and destination are the same buffer object,
     <readGpu> is present in <writeGpuMask>, and the ranges [<readOffset>; <readOffset> + <size>)
     and [<writeOffset>; <writeOffset> + <size>) overlap.

     20.3 Multi-GPU Framebuffers and Textures

     All buffers in the default framebuffer as well as renderbuffers receive per-GPU storage.  By
     default, storage for textures is undefined: it may be per-GPU or GPU-shared and can transition
     between the types at any time.  Per-GPU storage can be specified via
     [Multi]Tex[ture]Parameter{if}[v] with PER_GPU_STORAGE_NV for the <pname> argument and TRUE for
     the value.  For this storage parameter to take effect, it must be specified after the texture
     object is created and before the texture contents are defined by TexImage*, TexStorage* or
     TextureStorage*.

     20.3.1 Copying Image Data Between GPUs

     To copy texel data between GPUs, the client may use the command:

     void MulticastCopyImageSubDataNV(
         uint srcGpu, bitfield dstGpuMask,
         uint srcName, enum srcTarget,
         int srcLevel,
         int srcX, int srcY, int srcZ,
         uint dstName, enum dstTarget,
         int dstLevel,
         int dstX, int dstY, int dstZ,
         sizei srcWidth, sizei srcHeight, sizei srcDepth);

     This command operates equivalently to CopyImageSubData, except that it takes a source GPU and
     a destination GPU set defined by <srcGpu> and <dstGpuMask> (respectively).  Texel data is
     copied from the source GPU to all destination GPUs.  The following errors apply to
     MulticastCopyImageSubDataNV:

     INVALID_ENUM is generated
      * if either <srcTarget> or <dstTarget>
       - is not RENDERBUFFER or a valid non-proxy texture target
       - is TEXTURE_BUFFER, or
       - is one of the cubemap face selectors described in table 3.17,
      * if the target does not match the type of the object.

     INVALID_OPERATION is generated
      * if either object is a texture and the texture is not complete,
      * if the source and destination formats are not compatible,
      * if the source and destination number of samples do not match,
      * if one image is compressed and the other is uncompressed and the
        block size of compressed image is not equal to the texel size
        of the compressed image.

     INVALID_VALUE is generated
      * if <srcGpu> is greater than or equal to MULTICAST_GPUS_NV,
      * if <dstGpuMask> is zero,
      * if <dstGpuMask> is greater than or equal to 2^n, where n is equal to
        MULTICAST_GPUS_NV,
      * if either <srcName> or <dstName> does not correspond to a valid
        renderbuffer or texture object according to the corresponding
        target parameter, or
      * if the specified level is not a valid level for the image, or
      * if the dimensions of the either subregion exceeds the boundaries
        of the corresponding image object, or
      * if the image format is compressed and the dimensions of the
        subregion fail to meet the alignment constraints of the format.

     To copy pixel values from one GPU to another use the following command:

     void MulticastBlitFramebufferNV(uint srcGpu, uint dstGpu,
                                     int srcX0, int srcY0, int srcX1, int srcY1,
                                     int dstX0, int dstY0, int dstX1, int dstY1,
                                     bitfield mask, enum filter);

     This command operates equivalently to BlitNamedFramebuffer except that it takes a source GPU
     and a destination GPU defined by <srcGpu> and <dstGpu> (respectively).  Pixel values are
     copied from the read framebuffer on the source GPU to the draw framebuffer on the destination
     GPU.

     In addition to the errors generated by BlitNamedFramebuffer (see listing starting on page
     634), calling MulticastBlitFramebufferNV will generate INVALID_VALUE if <srcGpu> or <dstGpu>
     is greater than or equal to MULTICAST_GPUS_NV.

     20.3.2 Per-GPU Sample Locations

     Programmable sample locations can be customized for each GPU and framebuffer using the
     following command:

     void MulticastFramebufferSampleLocationsfvNV(uint gpu, uint framebuffer, uint start,
                                                  sizei count, const float *v);

     An INVALID_OPERATION error is generated by MulticastFramebufferSampleLocationsfvNV if
     <framebuffer> is not the name of an existing framebuffer object.

     INVALID_VALUE is generated if the sum of <start> and <count> is greater than
     PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB.

     An INVALID_VALUE error is generated if <gpu> is greater than or equal to MULTICAST_GPUS_NV.

     This is equivalent to FramebufferSampleLocationsfvARB except that it sets
     MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV at the appropriate offset for the specified GPU.
     Just as with FramebufferSampleLocationsfvARB, FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_ARB
     must be enabled for these sample locations to take effect.  FramebufferSampleLocationsfvARB
     and NamedFramebufferSampleLocationsfvARB also set MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV
     but for the specified sample across all multicast GPUs.  If <gpu> is 0,
     MulticastFramebufferSampleLocationsfvNV updates PROGRAMMABLE_SAMPLE_LOCATION_ARB in addition
     to MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV.

     The programmed sample locations can be retrieved using GetMultisamplefv with <pname> set to
     MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV and indices calculated as follows:

         index_x = gpu * PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB + 2 * sample_i;
         index_y = gpu * PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB + 2 * sample_i + 1;

     20.4 Interactions with Other Copy Functions

     Many existing commands can be used to copy between resources with GPU-shared, per-GPU or
     undefined storage.  For example: ReadPixels, GetBufferSubData or TexImage2D with a pixel
     unpack buffer.  The following table defines how the storage of the resource influences the
     behavior of these copies.

     Table 20.1 Behavior of Copy Commands with Multi-GPU Storage

     Source     Destination Behavior
     ---------- ----------- -----------------------------------------------------------------------
     GPU-shared GPU-shared  There is just one source and one destination.  Copy from source to
                            destination.
     GPU-shared per-GPU     There is a single source.  Copy it to the destination on all GPUs.
     GPU-shared undefined   Either of the above behaviors for a GPU-shared source may apply.

     per-GPU    GPU-shared  Copy from the GPU with the lowest index set in RENDER_GPU_MASK_NV to
                            to the shared destination.
     per-GPU    per-GPU     Implementations are encouraged to copy from source to destination
                            separately on each GPU.  This is not required.  If and when this is not
                            feasible, the copy should source from the GPU with the lowest index set
                            in RENDER_GPU_MASK_NV.
     per-GPU    undefined   Either of the above behaviors for a per-GPU source may apply.

     undefined  GPU-shared  Either of the above behaviors for a GPU-shared destination may apply.
     undefined  per-GPU     Either of the above behaviors for a per-GPU destination may apply.
     undefined  undefined   Any of the above behaviors may apply.

     20.5 Multi-GPU Synchronization

     MulticastCopyImageSubDataNV and MulticastCopyBufferSubDataNV each provide implicit
     synchronization with previous work on the source GPU.  MulticastBlitFramebufferNV is
     different, providing implicit synchronization with previous work on the destination GPU.
     In both cases, synchronization of the copies can be achieved with calls to the barrier
     command:

       void MulticastBarrierNV(void);

     This is called to block all GPUs until all previous commands have been completed by all GPUs,
     and all writes have landed.  To guarantee consistency, synchronization must be placed between
     any two accesses by multiple GPUs to the same memory when at least one of the accesses is a
     write.  This includes accesses to both the source and the destination.  The safest approach is
     to call MulticastBarrierNV immediately before and after each copy that involves multiple GPUs.

     GPU writes and reads to/from GPU-shared locations require synchronization as well.  GPU writes
     such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not
     automatically synchronized with writes by other GPUs.  Neither are GPU reads such as texture
     fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs.
     Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees
     for rendering, writes and reads on a single GPU.

     In some cases it may be desirable to have one or more GPUs wait for an operation to complete
     on another GPU without synchronizing all GPUs with MulticastBarrierNV.  This can be performed
     with the following command:

       void MulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask);

     INVALID_VALUE is generated
      * if <signalGpu> is greater than or equal to MULTICAST_GPUS_NV,
      * if <waitGpuMask> is zero,
      * if <waitGpuMask> is greater than or equal to 2^n, where n is equal to
        MULTICAST_GPUS_NV, or
      * if <signalGpu> is present in <waitGpuMask>.

     MulticastWaitSyncNV provides the same consistency guarantees as MulticastBarrierNV but only
     between the GPUs specified by <signalGpu> and <waitGpuMask> in a single direction.  It forces
     the GPUs specified by waitGpuMask to wait until the GPU specified by <signalGpu> has completed
     all previous commands and writes associated with those commands.

     20.6 Multi-GPU Queries

     Queries are performed across all multicast GPUs.  Each query object stores independent result
     values for each GPU.  The result value for a specific GPU can be queried using one of the
     following commands:

     void MulticastGetQueryObjectivNV(uint gpu, uint id, enum pname, int *params);
     void MulticastGetQueryObjectuivNV(uint gpu, uint id, enum pname, uint *params);
     void MulticastGetQueryObjecti64vNV(uint gpu, uint id, enum pname, int64 *params);
     void MulticastGetQueryObjectui64vNV(uint gpu, uint id, enum pname, uint64 *params);

     The behavior of these commands matches the GetQueryObject* equivalent commands, except they
     return the result value for the specified GPU.  A query may be available on one GPU but not on
     another, so it may be necessary to check QUERY_RESULT_AVAILABLE for each GPU.  GetQueryObject*
     return query results and availability for GPU 0 only.

     In addition to the errors generated by GetQueryObject* (see the listing in section 4.2 on page
     49), calling MulticastGetQueryObject* will generate INVALID_VALUE if <gpu> is greater than or
     equal to MULTICAST_GPUS_NV.

 Additions to Chapter 8 of the OpenGL 4.5 (Compatibility Profile) Specification
 (Textures and Samplers)

     Modify Section 8.10 (Texture Parameters)

     Insert the following paragraph before Table 8.25 (Texture parameters and their values):

         If <pname> is PER_GPU_STORAGE_NV, then the state is stored in the texture, but only takes
     effect the next time storage is allocated for a texture using TexImage*, TexStorage* or
     TextureStorage*.  If the value of TEXTURE_IMMUTABLE_FORMAT is TRUE, then PER_GPU_STORAGE_NV
     cannot be changed and an error is generated.

     Additions to Table 8.26 Texture parameters and their values

     Name               Type    Legal values
     ------------------ ------- ------------
     PER_GPU_STORAGE_NV boolean TRUE, FALSE

 Additions to Chapter 10 of the OpenGL 4.5 (Compatibility Profile) Specification
 (Vertex Specification and Drawing Commands)

     Modify Section 10.9 (Conditional Rendering)

     Replace the following text:

         If the result (SAMPLES_PASSED) of the query is zero, or if the result (ANY_SAMPLES_PASSED
         or ANY_SAMPLES_- PASSED_CONSERVATIVE) is FALSE, all rendering commands described in
         section 2.4 are discarded and have no effect when issued between BeginConditional- Render
         and the corresponding EndConditionalRender

     with this text:

         For each active render GPU, if the result (SAMPLES_PASSED) of the query on that GPU is
         zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_- PASSED_CONSERVATIVE) is FALSE,
         all rendering commands described in section 2.4 are discarded by this GPU and have no
         effect when issued between BeginConditional- Render and the corresponding
         EndConditionalRender

     Similarly replace the following:

         If the result (SAMPLES_PASSED) of the query is non-zero, or if the result
         (ANY_SAMPLES_PASSED or ANY_SAMPLES_PASSED_- CONSERVATIVE) is TRUE, such commands are not
         discarded.

     with this:

         For each active render GPU, if the result (SAMPLES_PASSED) of the query on that GPU is
         non-zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_PASSED_- CONSERVATIVE) is
         TRUE, such commands are not discarded.

     Finally, replace all instances of "the GL" with "each active render GPU".

 Additions to Chapter 14 of the OpenGL 4.5 (Compatibility Profile) Specification
 (Fixed-Function Primitive Assembly and Rasterization)

     Modify Section 14.3.1 (Multisampling)

     Replace the following text:

         The location for sample <i> is taken from v[2*(i-start)] and v[2*(i-start)+1].

     with the following:

         These commands set the sample locations for all multicast GPUs in
         MULTICAST_FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_NV.  The location for sample <i> on
         gpu <g> is taken from v[g*N+2*(i-start)] and v[g*N+2*(i-start)+1].

     Replace the following error generated by GetMultisamplefv:

         An INVALID_ENUM error is generated if <pname> is not SAMPLE_LOCATION_ARB or
         PROGRAMMABLE_SAMPLE_LOCATION_ARB.

     with the following:

         An INVALID_ENUM error is generated if <pname> is not SAMPLE_LOCATION_ARB,
         PROGRAMMABLE_SAMPLE_LOCATION_ARB or MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV.

     Add the following to the list of errors generated by GetMultisamplefv:

         An INVALID_VALUE error is generated if <pname> is
         MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_ARB and <index> is greater than or equal to the
         value of PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB multiplied by the value of
         MULTICAST_GPUS_NV.

     Replace the following pseudocode (in both locations):

         float *table = FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_ARB;
         sample_location.xy = (table[2*sample_i], table[2*sample_i+1]);

     with the following:

         float *table = MULTICAST_FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_NV;
         table += PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB * gpu;
         sample_location.xy = (table[2*sample_i], table[2*sample_i+1]);

 Additions to the WGL/GLX/EGL/AGL Specifications

     None

 Dependencies on ARB_sample_locations

     If ARB_sample_locations is not supported, section 20.3.2 and any references to
     MulticastFramebufferSampleLocationsfvNV and MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV should
     be removed.  The modifications to Section 14.3.1 (Multisampling) should also be removed.

 Dependencies on ARB_sparse_buffer

     If ARB_sparse_buffer is not supported, any reference to SPARSE_STORAGE_BIT_ARB should be
     removed.

 Errors

     Relaxation of INVALID_ENUM errors
     ---------------------------------
     GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as
     described in the "New Tokens" section.

 New State

     Additions to Table 23.4 Rasterization
                                                    Initial
     Get Value                   Type  Get Command Value  Description               Sec.  Attribute
     -------------------------- ------ ----------- -----  -----------------------   ----  ---------
     RENDER_GPU_MASK_NV           Z+   GetIntegerv   *    Mask of GPUs that have    20.1     -
                                                            writes enabled
     * See section 20.1

     Additions to Table 23.19 Textures (state per texture object)

                                                     Initial
     Get Value                Type   Get Command      Value    Description                  Sec.
     ---------                ----   -----------      -------  -----------                  ----
     PER_GPU_STORAGE_NV       B      GetTexParameter  FALSE    Per-GPU storage requested    20.3


     Additions to Table 23.30 Framebuffer (state per framebuffer object)

     Get Value                Get Command      Type Initial Value    Description          Sec.    Attribute
     ---------                -----------      ---- -------------    -----------          ----    ---------
     MULTICAST_PROGRAMMABLE_- GetMultisamplefv  *    (0.5,0.5)       Programmable sample  20.3.2      -
         SAMPLE_LOCATION_NV

     * The type here is "2* x n x 2 x R[0,1]" which is is equivalent to PROGRAMMABLE_SAMPLE_LOCATION_ARB
     but with samples locations for all multicast GPUs (one after the other).

 New Implementation Dependent State

     Add to Table 23.82, Implementation-Dependent Values, p. 784

                                                      Minimum
     Get Value                     Type   Get Command  Value  Description               Sec.  Attribute
     ---------------------------- ------ ------------- -----  ----------------------    ----  ---------
     MULTICAST_GPUS_NV              Z+    GetIntegerv    1    Number of linked GPUs     20.0     -
                                                              usable for multicast

 Backwards Compatibility

     This extension replaces NVX_linked_gpu_multicast.  The enumerant values for MULTICAST_GPUS_NV
     and PER_GPU_STORAGE_BIT_NV match those of MAX_LGPU_GPUS_NVX and LGPU_SEPARATE_STORAGE_BIT_NVX
     (respectively).  MulticastBufferSubDataNV, MulticastCopyImageSubDataNV and MulticastBarrierNV
     behave analog to LGPUNamedBufferSubDataNVX, LGPUCopyImageSubDataNVX and LGPUInterlockNVX
     (respectively).

 Sample Code

     Binocular stereo rendering example using NV_gpu_multicast with single GPU fallback:

     struct ViewData {
         GLint viewport_index;
         GLfloat mvp[16];
         GLfloat modelview[16];
     };
     ViewData leftViewData = { 0, {...}, {...} };
     ViewData rightViewData = { 1, {...}, {...} };

     GLuint ubo[2];
     glCreateBuffers(2, &ubo[0]);

     if (has_NV_gpu_multicast) {
         glNamedBufferStorage(ubo[0], size, NULL, GL_PER_GPU_STORAGE_BIT_NV | GL_DYNAMIC_STORAGE_BIT);
         glMulticastBufferSubDataNV(0x1, ubo[0], 0, size, &leftViewData);
         glMulticastBufferSubDataNV(0x2, ubo[0], 0, size, &rightViewData);
     } else {
         glNamedBufferStorage(ubo[0], size, &leftViewData, 0);
         glNamedBufferStorage(ubo[1], size, &rightViewData, 0);
     }

     glViewportIndexedf(0, 0, 0, 640, 480);  // left viewport
     glViewportIndexedf(1, 640, 0, 640, 480);  // right viewport
     // Vertex shader sets gl_ViewportIndex according to viewport_index in UBO

     glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

     if (has_NV_gpu_multicast) {
         glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
         drawScene();
         // Make GPU 1 wait for glClear above to complete on GPU 0
         glMulticastWaitSyncNV(0, 0x2);
         // Copy right viewport from GPU 1 to GPU 0
         glMulticastCopyImageSubDataNV(1, 0x1,
                                       renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
                                       renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
                                       640, 480, 1);
         // Make GPU 0 wait for GPU 1 copy to GPU 0
         glMulticastWaitSyncNV(1, 0x1);
     } else {
         glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
         drawScene();
         glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]);
         drawScene();
     }
     // Both viewports are now present in GPU 0's renderbuffer

 Issues

   (1) Should we provide explicit inter-GPU synchronization API?  Will this make the implementation
     easier or harder for the driver and applications?

     RESOLVED. Yes. A naive implementation of implicit synchronization would simply synchronize the
     GPUs before and after each copy.  Smart implicit synchronization would have to track all APIs
     that can modify buffers and textures, creating an excessive burden for driver implementation
     and maintenance.  An application can track dependencies more easily and outperform a naive
     driver implementation using explicit synchronization.

   (2) How does this extension interact with queries (e.g. occlusion queries)?

     RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs
     return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve
     query results for all GPUs through a buffer with separate storage (PER_GPU_STORAGE_BIT_NV).

   (3) Are copy operations controlled by the render mask?

     RESOLVED. Copies which write to the framebuffer are considered render commands and implicitly
     controlled by the render mask.  Copies between textures and buffers are not considered render
     commands so they are not influenced by the mask.  If masked copies are desired, use
     MulticastCopyImageSubDataNV, MulticastCopyBufferSubDataNV or MulticastBlitFramebufferNV.
     These commands explicitly specify the GPU source and destination and are not influenced by the
     render mask.

   (4) What happens if the MulticastCopyBufferSubDataNV source and destination buffer is the same?

     RESOLVED.  When the source and destination involve the same GPU, MulticastCopyBufferSubDataNV
     matches the behavior of CopyBufferSubData: overlapped copies are not allowed and an
     INVALID_VALUE error results.  When the source and destination do not involve the same GPU,
     overlapping copies are allowed and no error is generated.

   (5) How does this extension interact with CopyTexImage2D?

     RESOLVED.  The behavior depends on the storage type of the target.  See section 20.4.  Since
     CopyTexImage* sources from the framebuffer, the source always has per-GPU storage.

   (6) Should we provide a mechanism to modify viewports independently for each GPU?

     RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array.

   (7) Should we add a present API that automatically displays content from a specific GPU? It
     could abstract the transport mechanism, copying when necessary.

     RESOLVED. No. Transfers should be avoided to maximize performance and minimize latency.
     Minimizing transfers requires application awareness of display connectivity to assign
     rendering appropriately.  Hiding transfers behind an API would also prevent some interesting
     multi-GPU rendering techniques (e.g. checkerboard-style split rendering).

     WGL_NV_bridged_display can be used to enable display from multiple GPUs without copies.

   (8) Should we expose the extension on single-GPU configurations?

     RESOLVED.  Yes, this is recommended.  It allows more code sharing between multi-GPU and
     single-GPU code paths.  If there is only one GPU present MULTICAST_GPUS_NV will be 1.  It
     may also be 1 if explicit GPU control is unavailable (e.g. if the active multi-GPU rendering
     mode prevents it).  Note that in revisions 5 and prior of this extension the minimum for
     MULTICAST_GPUS_NV was 2.

   (9) Should glGet*BufferParameter* return the PER_GPU_STORAGE_BIT_NV bit when
     BUFFER_STORAGE_FLAGS is queried?

     RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as
     specified in table 6.3.

   (10) Can a query be complete/available on one GPU and not another?

     RESOLVED. Yes. Independent query completion is important for conditional rendering.  It
     allows each GPU to begin conditional rendering in mode QUERY_WAIT without waiting on other
     GPUs.

   (11) How can custom texel data for be uploaded to each GPU for a given texture?

     The easiest way is to create staging textures with the custom texel data and then copy it
     to a texture with per-GPU storage using MulticastCopyImageSubDataNV.

   (12) Should we allow the waitGpuMask in MulticastWaitSyncNV to include the signal GPU?

     RESOLVED. No. There is no reason for a GPU to wait on itself.  This is effectively a no-op in
     the command stream.  Furthermore it is easy to confuse GPU indices and masks, so it is
     beneficial to explicitly generate an error in this case.

   (13) Will support for NVX_linked_gpu_multicast continue?

     RESOLVED. NVX_linked_gpu_multicast is deprecated and applications should switch to
     NV_gpu_multicast.  However, implementations are encouraged to continue supporting
     NVX_linked_gpu_multicast for backwards compatibility.

   (14) Does RenderGpuMaskNV work with immediate mode rendering?

     RESOLVED. Yes, the render GPU mask applies to immediate mode rendering the same as other
     rendering.  Note that RenderGpuMaskNV is not one of the commands allowed between Begin and End
     (see section 10.7.5) so the render mask must be set before Begin is called.

 Revision History

     Rev.    Date    Author    Changes
     ----  --------  --------  -----------------------------------------------
      6    01/03/19  jschnarr  reduce MULTICAST_GPUS_NV minimum to 1
                               clarify that MULTICAST_GPUS_NV is constant for a context
      5    10/07/16  jschnarr  trivial typo fix
      4    07/21/16  mjk       registered
      3    06/15/16  jschnarr  R370 release