blob: 0278792682f42315d6566e4bdeff8bf94ea5b2bb [file] [log] [blame]
Name
NV_gpu_multicast
Name Strings
GL_NV_gpu_multicast
Contact
Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com)
Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com)
Contributors
Christoph Kubisch, NVIDIA
Mark Kilgard, NVIDIA
Robert Menzel, NVIDIA
Kevin Lefebvre, NVIDIA
Ralf Biermann, NVIDIA
Status
Shipping in NVIDIA release 370.XX drivers and up.
Version
Last Modified Date: April 2, 2019
Revision: 7
Number
OpenGL Extension #494
Dependencies
This extension is written against the OpenGL 4.5 specification
(Compatibility Profile), dated February 2, 2015.
This extension requires ARB_copy_image.
This extension interacts with ARB_sample_locations.
This extension interacts with ARB_sparse_buffer.
This extension requires EXT_direct_state_access.
This extension interacts with EXT_bindable_uniform
Overview
This extension enables novel multi-GPU rendering techniques by providing application control
over a group of linked GPUs with identical hardware configuration.
Multi-GPU rendering techniques fall into two categories: implicit and explicit. Existing
explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and
application complexity. An application must manage one context per GPU and multi-pump the API
stream. Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering
from one context to multiple GPUs. Common implicit approaches include alternate-frame
rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing. They each have
drawbacks. AFR scales nicely but interacts poorly with inter-frame dependencies. SFR can
improve latency but has challenges with offscreen rendering and scaling of vertex processing.
With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample
positions and the driver blends the result to improve quality. This also has issues with
offscreen rendering and can conflict with other anti-aliasing techniques.
These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks
adequate knowledge to accelerate every application. To resolve this, NV_gpu_multicast
provides fine-grained, explicit application control over multiple GPUs with a single context.
Key points:
- One context controls multiple GPUs. Every GPU in the linked group can access every object.
- Rendering is broadcast. Each draw is repeated across all GPUs in the linked group.
- Each GPU gets its own instance of all framebuffers, allowing individualized output for each
GPU. Input data can be customized for each GPU using buffers created with the storage flag,
PER_GPU_STORAGE_BIT_NV and a new API, MulticastBufferSubDataNV.
- New interfaces provide mechanisms to transfer textures and buffers from one GPU to another.
New Procedures and Functions
void RenderGpuMaskNV(bitfield mask);
void MulticastBufferSubDataNV(
bitfield gpuMask, uint buffer,
intptr offset, sizeiptr size,
const void *data);
void MulticastCopyBufferSubDataNV(
uint readGpu, bitfield writeGpuMask,
uint readBuffer, uint writeBuffer,
intptr readOffset, intptr writeOffset, sizeiptr size);
void MulticastCopyImageSubDataNV(
uint srcGpu, bitfield dstGpuMask,
uint srcName, enum srcTarget,
int srcLevel,
int srcX, int srcY, int srcZ,
uint dstName, enum dstTarget,
int dstLevel,
int dstX, int dstY, int dstZ,
sizei srcWidth, sizei srcHeight, sizei srcDepth);
void MulticastBlitFramebufferNV(uint srcGpu, uint dstGpu,
int srcX0, int srcY0, int srcX1, int srcY1,
int dstX0, int dstY0, int dstX1, int dstY1,
bitfield mask, enum filter);
void MulticastFramebufferSampleLocationsfvNV(uint gpu, uint framebuffer, uint start,
sizei count, const float *v);
void MulticastBarrierNV(void);
void MulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask);
void MulticastGetQueryObjectivNV(uint gpu, uint id, enum pname, int *params);
void MulticastGetQueryObjectuivNV(uint gpu, uint id, enum pname, uint *params);
void MulticastGetQueryObjecti64vNV(uint gpu, uint id, enum pname, int64 *params);
void MulticastGetQueryObjectui64vNV(uint gpu, uint id, enum pname, uint64 *params);
New Tokens
Accepted in the <flags> parameter of BufferStorage and NamedBufferStorageEXT:
PER_GPU_STORAGE_BIT_NV 0x0800
Accepted by the <pname> parameter of GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and
GetDoublev:
MULTICAST_GPUS_NV 0x92BA
RENDER_GPU_MASK_NV 0x9558
Accepted as a value for <pname> for the TexParameter{if}, TexParameter{if}v,
TextureParameter{if}, TextureParameter{if}v, MultiTexParameter{if}EXT and
MultiTexParameter{if}vEXT commands and for the <value> parameter of GetTexParameter{if}v,
GetTextureParameter{if}vEXT and GetMultiTexParameter{if}vEXT:
PER_GPU_STORAGE_NV 0x9548
Accepted by the <pname> parameter of GetMultisamplefv:
MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV 0x9549
Additions to the OpenGL 4.5 Specification (Compatibility Profile)
(Add a new chapter after chapter 19 "Compute Shaders")
20 Multicast Rendering
Some implementations support multiple linked GPUs driven by a single context. Often the
distribution of work to individual GPUs is managed by the GL without client knowledge. This
chapter specifies commands for explicitly distributing work across GPUs in a linked group.
Rendering can be enabled or disabled for specific GPUs. Draw commands are multicast, or
repeated across all enabled GPUs. Objects are shared by all GPUs, however each GPU has its
own instance (copy) of many resources, including framebuffers. When each GPU has its own
instance of a resource, it is considered to have per-GPU storage. When all GPUs share a
single instance of a resource, this is considered GPU-shared storage.
The mechanism for linking GPUs is implementation specific, as is the mechanism for enabling
multicast rendering support (if necessary). The number of GPUs usable for multicast rendering
by a context can be queried by calling GetIntegerv with the symbolic constant
MULTICAST_GPUS_NV. This number is constant for the lifetime of a context. Individual GPUs
are identified using zero-based indices in the range [0, n-1], where n is the number of
multicast GPUs. GPUs are also identified by bitmasks of the form 2^i, where i is the GPU
index. A set of GPUs is specified by the union of masks for each GPU in the set.
20.1 Controlling Individual GPUs
Render commands are restricted to a specific set of GPUs with
void RenderGpuMaskNV(bitfield mask);
The following errors apply to RenderGpuMaskNV:
INVALID_OPERATION is generated
* if <mask> is zero,
* if <mask> is not zero and <mask> is greater than or equal to 2^n, where n is equal
to MULTICAST_GPUS_NV,
* if issued between BeginConditionalRender and the corresponding EndConditionalRender.
If the command does not generate an error, RENDER_GPU_MASK_NV is set to <mask>. The default
value of RENDER_GPU_MASK_NV is (2^n)-1.
Render commands are skipped for a GPU that is not present in RENDER_GPU_MASK_NV. For example:
draw calls, clears, compute dispatches, and copies or pixel path operations that write to a
framebuffer (e.g. DrawPixels, BlitFramebuffer). For a full list of render commands see
section 2.4 (page 26). MulticastBlitFramebufferNV is an exception to this policy: while it is
a rendering command, it has its own source and destinations mask. Note that buffer and
textures updates are not affected by RENDER_GPU_MASK_NV.
20.2 Multi-GPU Buffer Storage
Like other resources, buffer objects can have two types of storage, per-GPU storage or
GPU-shared storage. Per-GPU storage can be explicitly requested using the
PER_GPU_STORAGE_BIT_NV flag with BufferStorage/NamedBufferStorageEXT. If this flag is not
set, the type of storage used is undefined. The implementation may use either type and
transition between them at any time. Client reads of a buffer with per-GPU storage may source
from any GPU.
The following rules apply to buffer objects with per-GPU storage:
When mapped updates apply to all GPUs (only WRITE_ONLY access is supported).
When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply
to all GPUs.
The following commands affect storage on all GPUs, even if the buffer object has per-GPU
storage:
BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData
An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with
PER_GPU_STORAGE_BIT_NV set with MAP_READ_BIT or SPARSE_STORAGE_BIT_ARB.
To modify buffer object data on one or more GPUs, the client may use the command
void MulticastBufferSubDataNV(
bitfield gpuMask, uint buffer,
intptr offset, sizeiptr size,
const void *data);
This command operates similarly to NamedBufferSubData, except that it updates the per-GPU
buffer data on the set of GPUs defined by <gpuMask>. If <buffer> has GPU-shared storage,
<gpuMask> is ignored and the shared instance of the buffer is updated.
An INVALID_VALUE error is generated if <gpuMask> is zero or is greater than or equal to 2^n,
where n is equal to MULTICAST_GPUS_NV.
An INVALID_OPERATION error is generated if <buffer> is not the name of an existing buffer
object.
An INVALID_VALUE error is generated if <offset> or <size> is negative, or if <offset> + <size>
is greater than the value of BUFFER_SIZE for the buffer object.
An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped
with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with
MAP_PERSISTENT_BIT set in the MapBufferRange access flags.
An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer
object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the
DYNAMIC_STORAGE_BIT set.
To copy between buffers created with PER_GPU_STORAGE_BIT_NV, the client may use the command
void MulticastCopyBufferSubDataNV(
uint readGpu, bitfield writeGpuMask,
uint readBuffer, uint writeBuffer,
intptr readOffset, intptr writeOffset, sizeiptr size);
This command operates similarly to CopyNamedBufferSubData, while adding control over the
source and destination GPU(s). The read GPU index is specified by <readGpu> and
the set of write GPUs is specified by the mask in <writeGpuMask>.
Implementations may also support this command with buffers not created with
PER_GPU_STORAGE_BIT_NV. This support can be determined with one test copy with an error check
(see error discussion below). Note that a buffer created without PER_GPU_STORAGE_BIT_NV is
considered to have undefined storage and the behavior of the command depends on the storage
type (per-GPU or GPU-shared) currently used for <writeBuffer>. If <writeBuffer> is using
GPU-shared storage, the normal error checks apply but the command behaves as if <writeGpuMask>
includes all GPUs. If <writeBuffer> is using per-GPU storage, the command behaves as if
PER_GPU_STORAGE_BIT_NV were set, however performance may be reduced.
This following error may apply to MulticastCopyBufferSubDataNV on some implementations and not
on others. In earlier revisions of this extension the error was required, therefore
applications should perform a test copy using buffers without PER_GPU_STORAGE_BIT_NV before
relying on that functionality:
An INVALID_OPERATION error is generated if the value of BUFFER_STORAGE_FLAGS for <readBuffer>
or <writeBuffer> does not have PER_GPU_STORAGE_BIT_NV set.
The following errors apply to MulticastCopyBufferSubDataNV:
An INVALID_OPERATION error is generated if <readBuffer> or <writeBuffer> is not the name of an
existing buffer object.
An INVALID_VALUE error is generated if any of <readOffset>, <writeOffset>, or <size> are
negative, if <readOffset> + <size> exceeds the size of the source buffer object, or if
<writeOffset> + <size> exceeds the size of the destination buffer object.
An INVALID_OPERATION error is generated if either the source or destination buffer objects is
mapped, unless they were mapped with MAP_PERSISTENT_BIT set in the Map*BufferRange access
flags.
An INVALID_VALUE error is generated if <readGpu> is greater than or equal to
MULTICAST_GPUS_NV.
An INVALID_OPERATION error is generated if <writeGpuMask> is zero. An INVALID_VALUE error is
generated if <writeGpuMask> is not zero and <writeGpuMask> is greater than or equal to 2^n,
where n is equal to MULTICAST_GPUS_NV.
An INVALID_VALUE error is generated if the source and destination are the same buffer object,
<readGpu> is present in <writeGpuMask>, and the ranges [<readOffset>; <readOffset> + <size>)
and [<writeOffset>; <writeOffset> + <size>) overlap.
20.3 Multi-GPU Framebuffers and Textures
All buffers in the default framebuffer as well as renderbuffers receive per-GPU storage. By
default, storage for textures is undefined: it may be per-GPU or GPU-shared and can transition
between the types at any time. Per-GPU storage can be specified via
[Multi]Tex[ture]Parameter{if}[v] with PER_GPU_STORAGE_NV for the <pname> argument and TRUE for
the value. For this storage parameter to take effect, it must be specified after the texture
object is created and before the texture contents are defined by TexImage*, TexStorage* or
TextureStorage*.
20.3.1 Copying Image Data Between GPUs
To copy texel data between GPUs, the client may use the command:
void MulticastCopyImageSubDataNV(
uint srcGpu, bitfield dstGpuMask,
uint srcName, enum srcTarget,
int srcLevel,
int srcX, int srcY, int srcZ,
uint dstName, enum dstTarget,
int dstLevel,
int dstX, int dstY, int dstZ,
sizei srcWidth, sizei srcHeight, sizei srcDepth);
This command operates equivalently to CopyImageSubData, except that it takes a source GPU and
a destination GPU set defined by <srcGpu> and <dstGpuMask> (respectively). Texel data is
copied from the source GPU to all destination GPUs. The following errors apply to
MulticastCopyImageSubDataNV:
INVALID_ENUM is generated
* if either <srcTarget> or <dstTarget>
- is not RENDERBUFFER or a valid non-proxy texture target
- is TEXTURE_BUFFER, or
- is one of the cubemap face selectors described in table 3.17,
* if the target does not match the type of the object.
INVALID_OPERATION is generated
* if either object is a texture and the texture is not complete,
* if the source and destination formats are not compatible,
* if the source and destination number of samples do not match,
* if one image is compressed and the other is uncompressed and the
block size of compressed image is not equal to the texel size
of the compressed image.
INVALID_VALUE is generated
* if <srcGpu> is greater than or equal to MULTICAST_GPUS_NV,
* if <dstGpuMask> is zero,
* if <dstGpuMask> is greater than or equal to 2^n, where n is equal to
MULTICAST_GPUS_NV,
* if either <srcName> or <dstName> does not correspond to a valid
renderbuffer or texture object according to the corresponding
target parameter, or
* if the specified level is not a valid level for the image, or
* if the dimensions of the either subregion exceeds the boundaries
of the corresponding image object, or
* if the image format is compressed and the dimensions of the
subregion fail to meet the alignment constraints of the format.
To copy pixel values from one GPU to another use the following command:
void MulticastBlitFramebufferNV(uint srcGpu, uint dstGpu,
int srcX0, int srcY0, int srcX1, int srcY1,
int dstX0, int dstY0, int dstX1, int dstY1,
bitfield mask, enum filter);
This command operates equivalently to BlitNamedFramebuffer except that it takes a source GPU
and a destination GPU defined by <srcGpu> and <dstGpu> (respectively). Pixel values are
copied from the read framebuffer on the source GPU to the draw framebuffer on the destination
GPU.
In addition to the errors generated by BlitNamedFramebuffer (see listing starting on page
634), calling MulticastBlitFramebufferNV will generate INVALID_VALUE if <srcGpu> or <dstGpu>
is greater than or equal to MULTICAST_GPUS_NV.
20.3.2 Per-GPU Sample Locations
Programmable sample locations can be customized for each GPU and framebuffer using the
following command:
void MulticastFramebufferSampleLocationsfvNV(uint gpu, uint framebuffer, uint start,
sizei count, const float *v);
An INVALID_OPERATION error is generated by MulticastFramebufferSampleLocationsfvNV if
<framebuffer> is not the name of an existing framebuffer object.
INVALID_VALUE is generated if the sum of <start> and <count> is greater than
PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB.
An INVALID_VALUE error is generated if <gpu> is greater than or equal to MULTICAST_GPUS_NV.
This is equivalent to FramebufferSampleLocationsfvARB except that it sets
MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV at the appropriate offset for the specified GPU.
Just as with FramebufferSampleLocationsfvARB, FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_ARB
must be enabled for these sample locations to take effect. FramebufferSampleLocationsfvARB
and NamedFramebufferSampleLocationsfvARB also set MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV
but for the specified sample across all multicast GPUs. If <gpu> is 0,
MulticastFramebufferSampleLocationsfvNV updates PROGRAMMABLE_SAMPLE_LOCATION_ARB in addition
to MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV.
The programmed sample locations can be retrieved using GetMultisamplefv with <pname> set to
MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV and indices calculated as follows:
index_x = gpu * PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB + 2 * sample_i;
index_y = gpu * PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB + 2 * sample_i + 1;
20.4 Interactions with Other Copy Functions
Many existing commands can be used to copy between resources with GPU-shared, per-GPU or
undefined storage. For example: ReadPixels, GetBufferSubData or TexImage2D with a pixel
unpack buffer. The following table defines how the storage of the resource influences the
behavior of these copies.
Table 20.1 Behavior of Copy Commands with Multi-GPU Storage
Source Destination Behavior
---------- ----------- -----------------------------------------------------------------------
GPU-shared GPU-shared There is just one source and one destination. Copy from source to
destination.
GPU-shared per-GPU There is a single source. Copy it to the destination on all GPUs.
GPU-shared undefined Either of the above behaviors for a GPU-shared source may apply.
per-GPU GPU-shared Copy from the GPU with the lowest index set in RENDER_GPU_MASK_NV to
to the shared destination.
per-GPU per-GPU Implementations are encouraged to copy from source to destination
separately on each GPU. This is not required. If and when this is not
feasible, the copy should source from the GPU with the lowest index set
in RENDER_GPU_MASK_NV.
per-GPU undefined Either of the above behaviors for a per-GPU source may apply.
undefined GPU-shared Either of the above behaviors for a GPU-shared destination may apply.
undefined per-GPU Either of the above behaviors for a per-GPU destination may apply.
undefined undefined Any of the above behaviors may apply.
20.5 Multi-GPU Synchronization
MulticastCopyImageSubDataNV and MulticastCopyBufferSubDataNV each provide implicit
synchronization with previous work on the source GPU. MulticastBlitFramebufferNV is
different, providing implicit synchronization with previous work on the destination GPU.
In both cases, synchronization of the copies can be achieved with calls to the barrier
command:
void MulticastBarrierNV(void);
This is called to block all GPUs until all previous commands have been completed by all GPUs,
and all writes have landed. To guarantee consistency, synchronization must be placed between
any two accesses by multiple GPUs to the same memory when at least one of the accesses is a
write. This includes accesses to both the source and the destination. The safest approach is
to call MulticastBarrierNV immediately before and after each copy that involves multiple GPUs.
GPU writes and reads to/from GPU-shared locations require synchronization as well. GPU writes
such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not
automatically synchronized with writes by other GPUs. Neither are GPU reads such as texture
fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs.
Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees
for rendering, writes and reads on a single GPU.
In some cases it may be desirable to have one or more GPUs wait for an operation to complete
on another GPU without synchronizing all GPUs with MulticastBarrierNV. This can be performed
with the following command:
void MulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask);
INVALID_VALUE is generated
* if <signalGpu> is greater than or equal to MULTICAST_GPUS_NV,
* if <waitGpuMask> is zero,
* if <waitGpuMask> is greater than or equal to 2^n, where n is equal to
MULTICAST_GPUS_NV, or
* if <signalGpu> is present in <waitGpuMask>.
MulticastWaitSyncNV provides the same consistency guarantees as MulticastBarrierNV but only
between the GPUs specified by <signalGpu> and <waitGpuMask> in a single direction. It forces
the GPUs specified by waitGpuMask to wait until the GPU specified by <signalGpu> has completed
all previous commands and writes associated with those commands.
20.6 Multi-GPU Queries
Queries are performed across all multicast GPUs. Each query object stores independent result
values for each GPU. The result value for a specific GPU can be queried using one of the
following commands:
void MulticastGetQueryObjectivNV(uint gpu, uint id, enum pname, int *params);
void MulticastGetQueryObjectuivNV(uint gpu, uint id, enum pname, uint *params);
void MulticastGetQueryObjecti64vNV(uint gpu, uint id, enum pname, int64 *params);
void MulticastGetQueryObjectui64vNV(uint gpu, uint id, enum pname, uint64 *params);
The behavior of these commands matches the GetQueryObject* equivalent commands, except they
return the result value for the specified GPU. A query may be available on one GPU but not on
another, so it may be necessary to check QUERY_RESULT_AVAILABLE for each GPU. GetQueryObject*
return query results and availability for GPU 0 only.
In addition to the errors generated by GetQueryObject* (see the listing in section 4.2 on page
49), calling MulticastGetQueryObject* will generate INVALID_VALUE if <gpu> is greater than or
equal to MULTICAST_GPUS_NV.
Additions to Chapter 8 of the OpenGL 4.5 (Compatibility Profile) Specification
(Textures and Samplers)
Modify Section 8.10 (Texture Parameters)
Insert the following paragraph before Table 8.25 (Texture parameters and their values):
If <pname> is PER_GPU_STORAGE_NV, then the state is stored in the texture, but only takes
effect the next time storage is allocated for a texture using TexImage*, TexStorage* or
TextureStorage*. If the value of TEXTURE_IMMUTABLE_FORMAT is TRUE, then PER_GPU_STORAGE_NV
cannot be changed and an error is generated.
Additions to Table 8.26 Texture parameters and their values
Name Type Legal values
------------------ ------- ------------
PER_GPU_STORAGE_NV boolean TRUE, FALSE
Additions to Chapter 10 of the OpenGL 4.5 (Compatibility Profile) Specification
(Vertex Specification and Drawing Commands)
Modify Section 10.9 (Conditional Rendering)
Replace the following text:
If the result (SAMPLES_PASSED) of the query is zero, or if the result (ANY_SAMPLES_PASSED
or ANY_SAMPLES_- PASSED_CONSERVATIVE) is FALSE, all rendering commands described in
section 2.4 are discarded and have no effect when issued between BeginConditional- Render
and the corresponding EndConditionalRender
with this text:
For each active render GPU, if the result (SAMPLES_PASSED) of the query on that GPU is
zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_- PASSED_CONSERVATIVE) is FALSE,
all rendering commands described in section 2.4 are discarded by this GPU and have no
effect when issued between BeginConditional- Render and the corresponding
EndConditionalRender
Similarly replace the following:
If the result (SAMPLES_PASSED) of the query is non-zero, or if the result
(ANY_SAMPLES_PASSED or ANY_SAMPLES_PASSED_- CONSERVATIVE) is TRUE, such commands are not
discarded.
with this:
For each active render GPU, if the result (SAMPLES_PASSED) of the query on that GPU is
non-zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_PASSED_- CONSERVATIVE) is
TRUE, such commands are not discarded.
Finally, replace all instances of "the GL" with "each active render GPU".
Additions to Chapter 14 of the OpenGL 4.5 (Compatibility Profile) Specification
(Fixed-Function Primitive Assembly and Rasterization)
Modify Section 14.3.1 (Multisampling)
Replace the following text:
The location for sample <i> is taken from v[2*(i-start)] and v[2*(i-start)+1].
with the following:
These commands set the sample locations for all multicast GPUs in
MULTICAST_FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_NV. The location for sample <i> on
gpu <g> is taken from v[g*N+2*(i-start)] and v[g*N+2*(i-start)+1].
Replace the following error generated by GetMultisamplefv:
An INVALID_ENUM error is generated if <pname> is not SAMPLE_LOCATION_ARB or
PROGRAMMABLE_SAMPLE_LOCATION_ARB.
with the following:
An INVALID_ENUM error is generated if <pname> is not SAMPLE_LOCATION_ARB,
PROGRAMMABLE_SAMPLE_LOCATION_ARB or MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV.
Add the following to the list of errors generated by GetMultisamplefv:
An INVALID_VALUE error is generated if <pname> is
MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_ARB and <index> is greater than or equal to the
value of PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB multiplied by the value of
MULTICAST_GPUS_NV.
Replace the following pseudocode (in both locations):
float *table = FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_ARB;
sample_location.xy = (table[2*sample_i], table[2*sample_i+1]);
with the following:
float *table = MULTICAST_FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_NV;
table += PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB * gpu;
sample_location.xy = (table[2*sample_i], table[2*sample_i+1]);
Additions to the WGL/GLX/EGL/AGL Specifications
None
Dependencies on ARB_sample_locations
If ARB_sample_locations is not supported, section 20.3.2 and any references to
MulticastFramebufferSampleLocationsfvNV and MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV should
be removed. The modifications to Section 14.3.1 (Multisampling) should also be removed.
Dependencies on ARB_sparse_buffer
If ARB_sparse_buffer is not supported, any reference to SPARSE_STORAGE_BIT_ARB should be
removed.
Interactions with EXT_bindable_uniform
When using the functionality of EXT_bindable_uniform and a per-GPU storage buffer is bound
to a bindable location in a program object, client uniform updates apply to all GPUs.
An INVALID_OPERATION is generated if a buffer with PER_GPU_STORAGE_BIT_NV is bound to a
program object's bindable location and GetUniformfv, GetUniformiv, GetUniformuiv or
GetUniformdv is called.
Errors
Relaxation of INVALID_ENUM errors
---------------------------------
GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as
described in the "New Tokens" section.
New State
Additions to Table 23.4 Rasterization
Initial
Get Value Type Get Command Value Description Sec. Attribute
-------------------------- ------ ----------- ----- ----------------------- ---- ---------
RENDER_GPU_MASK_NV Z+ GetIntegerv * Mask of GPUs that have 20.1 -
writes enabled
* See section 20.1
Additions to Table 23.19 Textures (state per texture object)
Initial
Get Value Type Get Command Value Description Sec.
--------- ---- ----------- ------- ----------- ----
PER_GPU_STORAGE_NV B GetTexParameter FALSE Per-GPU storage requested 20.3
Additions to Table 23.30 Framebuffer (state per framebuffer object)
Get Value Get Command Type Initial Value Description Sec. Attribute
--------- ----------- ---- ------------- ----------- ---- ---------
MULTICAST_PROGRAMMABLE_- GetMultisamplefv * (0.5,0.5) Programmable sample 20.3.2 -
SAMPLE_LOCATION_NV
* The type here is "2* x n x 2 x R[0,1]" which is is equivalent to PROGRAMMABLE_SAMPLE_LOCATION_ARB
but with samples locations for all multicast GPUs (one after the other).
New Implementation Dependent State
Add to Table 23.82, Implementation-Dependent Values, p. 784
Minimum
Get Value Type Get Command Value Description Sec. Attribute
---------------------------- ------ ------------- ----- ---------------------- ---- ---------
MULTICAST_GPUS_NV Z+ GetIntegerv 1 Number of linked GPUs 20.0 -
usable for multicast
Backwards Compatibility
This extension replaces NVX_linked_gpu_multicast. The enumerant values for MULTICAST_GPUS_NV
and PER_GPU_STORAGE_BIT_NV match those of MAX_LGPU_GPUS_NVX and LGPU_SEPARATE_STORAGE_BIT_NVX
(respectively). MulticastBufferSubDataNV, MulticastCopyImageSubDataNV and MulticastBarrierNV
behave analog to LGPUNamedBufferSubDataNVX, LGPUCopyImageSubDataNVX and LGPUInterlockNVX
(respectively).
Sample Code
Binocular stereo rendering example using NV_gpu_multicast with single GPU fallback:
struct ViewData {
GLint viewport_index;
GLfloat mvp[16];
GLfloat modelview[16];
};
ViewData leftViewData = { 0, {...}, {...} };
ViewData rightViewData = { 1, {...}, {...} };
GLuint ubo[2];
glCreateBuffers(2, &ubo[0]);
if (has_NV_gpu_multicast) {
glNamedBufferStorage(ubo[0], size, NULL, GL_PER_GPU_STORAGE_BIT_NV | GL_DYNAMIC_STORAGE_BIT);
glMulticastBufferSubDataNV(0x1, ubo[0], 0, size, &leftViewData);
glMulticastBufferSubDataNV(0x2, ubo[0], 0, size, &rightViewData);
} else {
glNamedBufferStorage(ubo[0], size, &leftViewData, 0);
glNamedBufferStorage(ubo[1], size, &rightViewData, 0);
}
glViewportIndexedf(0, 0, 0, 640, 480); // left viewport
glViewportIndexedf(1, 640, 0, 640, 480); // right viewport
// Vertex shader sets gl_ViewportIndex according to viewport_index in UBO
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
if (has_NV_gpu_multicast) {
glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
drawScene();
// Make GPU 1 wait for glClear above to complete on GPU 0
glMulticastWaitSyncNV(0, 0x2);
// Copy right viewport from GPU 1 to GPU 0
glMulticastCopyImageSubDataNV(1, 0x1,
renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
640, 480, 1);
// Make GPU 0 wait for GPU 1 copy to GPU 0
glMulticastWaitSyncNV(1, 0x1);
} else {
glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
drawScene();
glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]);
drawScene();
}
// Both viewports are now present in GPU 0's renderbuffer
Issues
(1) Should we provide explicit inter-GPU synchronization API? Will this make the implementation
easier or harder for the driver and applications?
RESOLVED. Yes. A naive implementation of implicit synchronization would simply synchronize the
GPUs before and after each copy. Smart implicit synchronization would have to track all APIs
that can modify buffers and textures, creating an excessive burden for driver implementation
and maintenance. An application can track dependencies more easily and outperform a naive
driver implementation using explicit synchronization.
(2) How does this extension interact with queries (e.g. occlusion queries)?
RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs
return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve
query results for all GPUs through a buffer with separate storage (PER_GPU_STORAGE_BIT_NV).
(3) Are copy operations controlled by the render mask?
RESOLVED. Copies which write to the framebuffer are considered render commands and implicitly
controlled by the render mask. Copies between textures and buffers are not considered render
commands so they are not influenced by the mask. If masked copies are desired, use
MulticastCopyImageSubDataNV, MulticastCopyBufferSubDataNV or MulticastBlitFramebufferNV.
These commands explicitly specify the GPU source and destination and are not influenced by the
render mask.
(4) What happens if the MulticastCopyBufferSubDataNV source and destination buffer is the same?
RESOLVED. When the source and destination involve the same GPU, MulticastCopyBufferSubDataNV
matches the behavior of CopyBufferSubData: overlapped copies are not allowed and an
INVALID_VALUE error results. When the source and destination do not involve the same GPU,
overlapping copies are allowed and no error is generated.
(5) How does this extension interact with CopyTexImage2D?
RESOLVED. The behavior depends on the storage type of the target. See section 20.4. Since
CopyTexImage* sources from the framebuffer, the source always has per-GPU storage.
(6) Should we provide a mechanism to modify viewports independently for each GPU?
RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array.
(7) Should we add a present API that automatically displays content from a specific GPU? It
could abstract the transport mechanism, copying when necessary.
RESOLVED. No. Transfers should be avoided to maximize performance and minimize latency.
Minimizing transfers requires application awareness of display connectivity to assign
rendering appropriately. Hiding transfers behind an API would also prevent some interesting
multi-GPU rendering techniques (e.g. checkerboard-style split rendering).
WGL_NV_bridged_display can be used to enable display from multiple GPUs without copies.
(8) Should we expose the extension on single-GPU configurations?
RESOLVED. Yes, this is recommended. It allows more code sharing between multi-GPU and
single-GPU code paths. If there is only one GPU present MULTICAST_GPUS_NV will be 1. It
may also be 1 if explicit GPU control is unavailable (e.g. if the active multi-GPU rendering
mode prevents it). Note that in revisions 5 and prior of this extension the minimum for
MULTICAST_GPUS_NV was 2.
(9) Should glGet*BufferParameter* return the PER_GPU_STORAGE_BIT_NV bit when
BUFFER_STORAGE_FLAGS is queried?
RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as
specified in table 6.3.
(10) Can a query be complete/available on one GPU and not another?
RESOLVED. Yes. Independent query completion is important for conditional rendering. It
allows each GPU to begin conditional rendering in mode QUERY_WAIT without waiting on other
GPUs.
(11) How can custom texel data for be uploaded to each GPU for a given texture?
The easiest way is to create staging textures with the custom texel data and then copy it
to a texture with per-GPU storage using MulticastCopyImageSubDataNV.
(12) Should we allow the waitGpuMask in MulticastWaitSyncNV to include the signal GPU?
RESOLVED. No. There is no reason for a GPU to wait on itself. This is effectively a no-op in
the command stream. Furthermore it is easy to confuse GPU indices and masks, so it is
beneficial to explicitly generate an error in this case.
(13) Will support for NVX_linked_gpu_multicast continue?
RESOLVED. NVX_linked_gpu_multicast is deprecated and applications should switch to
NV_gpu_multicast. However, implementations are encouraged to continue supporting
NVX_linked_gpu_multicast for backwards compatibility.
(14) Does RenderGpuMaskNV work with immediate mode rendering?
RESOLVED. Yes, the render GPU mask applies to immediate mode rendering the same as other
rendering. Note that RenderGpuMaskNV is not one of the commands allowed between Begin and End
(see section 10.7.5) so the render mask must be set before Begin is called.
Revision History
Rev. Date Author Changes
---- -------- -------- -----------------------------------------------
7 04/02/19 jschnarr clarify that the interactions with uniform APIs only apply to
EXT_bindable_uniform (not ARB_uniform_buffer_object).
optionally allow MulticastCopyBufferSubDataNV with buffers lacking
per-GPU storage
6 01/03/19 jschnarr reduce MULTICAST_GPUS_NV minimum to 1
clarify that MULTICAST_GPUS_NV is constant for a context
5 10/07/16 jschnarr trivial typo fix
4 07/21/16 mjk registered
3 06/15/16 jschnarr R370 release