blob: 91f85c63c0da238bb454d28b68b84fecae16e9cb [file] [log] [blame]
Name Strings
Joshua Schnarr, NVIDIA Corporation (jschnarr 'at'
Ingo Esser, NVIDIA Corporation (iesser 'at'
Christoph Kubisch, NVIDIA
Mark Kilgard, NVIDIA
Shipping in NVIDIA release 361 drivers.
Last Modified Date: July 21, 2016
NVIDIA Revision: 4
OpenGL Extension #493
This extension is written against the OpenGL 4.5 specification (Compatibility Profile), dated
February 2, 2015.
This extension interacts with ARB_sparse_buffer.
This extension interacts with ARB_copy_image.
This extension interacts with EXT_direct_state_access.
This extension interacts with ARB_shader_viewport_layer_array.
This extension enables novel multi-GPU rendering techniques by providing application control
over a group of linked GPUs with identical hardware configuration.
Multi-GPU rendering techniques fall into two categories: implicit and explicit. Existing
explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and
application complexity. An application must manage one context per GPU and multi-pump the API
stream. Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering
from one context to multiple GPUs. Common implicit approaches include alternate-frame
rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing. They each have
drawbacks. AFR scales nicely but interacts poorly with inter-frame dependencies. SFR can
improve latency but has challenges with offscreen rendering and scaling of vertex processing.
With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample
positions and the driver blends the result to improve quality. This also has issues with
offscreen rendering and can conflict with other anti-aliasing techniques.
These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks
adequate knowledge to accelerate every application. To resolve this, NVX_linked_gpu_multicast
provides application control over multiple GPUs with a single context.
Key points:
- One context controls multiple GPUs. Every GPU in the linked group can access every object.
- Rendering is broadcast. Each draw is repeated across all GPUs in the linked group.
- Each GPU gets its own instance of all framebuffers and attached textures, allowing
individualized output for each GPU. Input data can be customized for each GPU using buffers
created with the storage flag, LGPU_SEPARATE_STORAGE_BIT_NVX and a new API,
- Textures can be transferred from one GPU to another using LGPUCopyImageSubDataNVX.
New Procedures and Functions
void LGPUNamedBufferSubDataNVX(
bitfield gpuMask, uint buffer,
intptr offset, sizeiptr size,
const void *data);
void LGPUCopyImageSubDataNVX(
uint sourceGpu, bitfield destinationGpuMask,
uint srcName, enum srcTarget,
int srcLevel,
int srcX, int srxY, int srcZ,
uint dstName, enum dstTarget,
int dstLevel,
int dstX, int dstY, int dstZ,
sizei width, sizei height, sizei depth);
void LGPUInterlockNVX(void);
New Tokens
Accepted in the <flags> parameter of BufferStorage and
Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
GetInteger64v, GetFloatv, and GetDoublev:
Additions to the OpenGL 4.5 Specification (Compatibility Profile)
(Add a new chapter after chapter 19 "Compute Shaders")
20 Multicast Rendering
This chapter specifies commands for using multiple GPUs in a linked group. Commands are
multicast, or repeated across all linked GPUs. Objects are shared by all GPUs, however each
GPU has its own instance (copy) of many resources, including framebuffers. When each GPU has
its own instance of a resource, it is considered to have per-GPU storage. When all GPUs share
a single instance of a resource, this is considered GPU-shared storage.
The mechanism for linking GPUs is implementation specific, as is the process-global mechanism
for enabling multicast rendering support (if necessary). The number of GPUs usable for
multicast rendering by a context can be queried by calling GetIntegerv with the symbolic
constant MAX_LGPU_GPUS_NVX. Individual GPUs are identified using zero-based indices in the
range [0, n-1], where n is the number of multicast GPUs. GPUs are also be identified by
bitmasks of the form 2^i, where i is the GPU index. A set of GPUs is specified by the union of
masks for each GPU in the set.
20.1 Multi-GPU Buffer Storage
Like other resources, buffer objects can have two types of storage, per-GPU storage or
GPU-shared storage. Per-GPU storage can be explicitly requested using the
LGPU_SEPARATE_STORAGE_BIT_NVX flag with BufferStorage/NamedBufferStorageEXT. If this flag is
not set, the type of storage used is undefined. The implementation may use either type
and transition between them at any time. Client reads of a buffer with per-GPU storage may
source from any GPU.
The following rules apply to buffer objects with per-GPU storage:
When mapped with WRITE_ONLY access, writes apply to all GPUs.
When bound to UNIFORM_BUFFER, client uniform updates apply to all GPUs.
When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply to
all GPUs.
The following commands affect storage on all GPUs, even if the the buffer object has per-GPU
BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData
An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with
To modify buffer object data on one or more GPUs, the client may use the command
void LGPUNamedBufferSubDataNVX(
bitfield gpuMask, uint buffer,
intptr offset, sizeiptr size,
const void *data);
This function operates similarly to NamedBufferSubData, except that it updates the per-GPU
buffer data on the set of GPUs defined by <gpuMask>.
An INVALID_VALUE error is generated if <gpuMask> is zero.
An INVALID_OPERATION error is generated if <buffer> is not the name of an existing buffer
An INVALID_VALUE error is generated if <offset> or <size> is negative, or if <offset> + <size>
is greater than the value of BUFFER_SIZE for the buffer object.
An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped
with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with
MAP_PERSISTENT_BIT set in the MapBufferRange access flags.
An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer
object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the
20.2 Multi-GPU Framebuffers and Textures
All buffers in the default framebuffer as well as renderbuffers and textures bound to
framebuffer objects receive per-GPU storage. Storage for other textures is undefined: it may
be per-GPU or GPU-shared and can transition between the types at any time.
To copy texel data between GPUs, the client may use the command
void LGPUCopyImageSubDataNVX(
uint sourceGpu, bitfield destinationGpuMask,
uint srcName, enum srcTarget,
int srcLevel,
int srcX, int srxY, int srcZ,
uint dstName, enum dstTarget,
int dstLevel,
int dstX, int dstY, int dstZ,
sizei width, sizei height, sizei depth);
This function operates similarly to CopyImageSubData, except that it takes a source GPU
and a destination GPU set defined by <destinationGpuMask>.
INVALID_ENUM is generated
* if either <srcTarget> or <dstTarget>
- is not RENDERBUFFER or a valid non-proxy texture target
- is one of the cubemap face selectors described in table 3.17,
* if the target does not match the type of the object.
* if either object is a texture and the texture is not complete,
* if the source and destination formats are not compatible,
* if the source and destination number of samples do not match,
* if one image is compressed and the other is uncompressed and the
block size of compressed image is not equal to the texel size
of the compressed image.
INVALID_VALUE is generated
* if <sourceGpu> is greater than or equal to MAX_LGPU_GPUS_NVX,
* if <destinationGpuMask> is zero,
* if either <srcName> or <dstName> does not correspond to a valid
renderbuffer or texture object according to the corresponding
target parameter, or
* if the specified level is not a valid level for the image, or
* if the dimensions of the either subregion exceeds the boundaries
of the corresponding image object, or
* if the image format is compressed and the dimensions of the
subregion fail to meet the alignment constraints of the format.
20.3 Multi-GPU Synchronization
LGPUCopyImageSubDataNVX provides implicit synchronization with previous rendering to the given
texture or renderbuffer on the source GPU. Synchronization of the copy with the destination
GPU(s) is achieved with the interlock function:
void LGPUInterlockNVX(void)
This is called to synchronize all linked GPUs to the same point in the API stream. To
guarantee consistency, the interlock command must be used as a barrier between any two
accesses by multiple GPUs to the same memory when at least one of the accesses is a write.
For consistent copies between GPUs, synchronization is required before and after each copy:
1. Prior to each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called after
the most recent read or write of the target image by a destination GPU.
2. After each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called
prior to any future read or write of the target image by a destination GPU.
GPU writes and reads to/from GPU-shared locations require synchronization as well. GPU writes
such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not
automatically synchronized with writes by other GPUs. Neither are GPU reads such as texture
fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs.
Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees
for rendering, writes and reads on a single GPU.
Additions to the AGL/GLX/WGL Specifications
GLX Protocol
Relaxation of INVALID_ENUM errors
GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as
described in the "New Tokens" section.
New State
New Implementation Dependent State
Add to Table 23.82, Implementation-Dependent Values, p. 784
Get Value Type Get Command Value Description Sec. Attribute
---------------------- ---- ----------- ------- ----------------------- ---- ---------
MAX_LGPU_GPUS_NVX Z+ GetIntegerv 2 Maximum number of 6.9 -
usable GPUs
Sample Code
Binocular stereo rendering example using NVX_linked_gpu_multicast with single GPU fallback:
struct ViewData {
GLint viewport_index;
GLfloat mvp[16];
GLfloat modelview[16];
ViewData leftViewData = { 0, {...}, {...} };
ViewData rightViewData = { 1, {...}, {...} };
GLuint ubo[2];
glCreateBuffers(2, &ubo[0]);
if (has_NVX_linked_gpu_multicast) {
glLGPUNamedBufferSubDataNVX(0x1, ubo[0], 0, size, &leftViewData);
glLGPUNamedBufferSubDataNVX(0x2, ubo[0], 0, size, &rightViewData);
} else {
glNamedBufferStorage(ubo[0], size, &leftViewData, 0);
glNamedBufferStorage(ubo[1], size, &rightViewData, 0);
glViewportIndexedf(0, 0, 0, 640, 480); // left viewport
glViewportIndexedf(1, 640, 0, 640, 480); // right viewport
// Vertex shader sets gl_ViewportIndex according to viewport_index in UBO
if (has_NVX_linked_gpu_multicast) {
glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
// Make GPU 1 wait for glClear above to complete on GPU 0
// Copy right viewport from GPU 1 to GPU 0
glLGPUCopyImageSubDataNVX(1, 0x1,
renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
640, 480, 1);
// Make GPU 0 wait for GPU 1 copy to GPU 0
} else {
glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]);
// Both viewports are now present in GPU 0's renderbuffer
(1) Should we provide explicit inter-gpu synchronization API? Will this make the implementation
easier or harder for the driver and applications?
RESOLVED. Yes. A naive implementation of implicit synchronization would simply interlock the
GPUs before and after each copy. Smart implicit synchronization would have to track all APIs
that can modify buffers and textures, creating an excessive burden for driver implementation
and maintenance. An application can track dependencies more easily and outperform a naive
driver implementation using explicit synchronization.
(2) How does this extension interact with queries (e.g. occlusion queries)?
RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs
return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve
query results for all GPUs through a buffer with separate storage (LGPU_SEPARATE_STORAGE_BIT).
(3) Which textures and buffers have separate storage for each GPU?
The default framebuffer and framebuffer texture attachments. Also buffers allocated with
LGPU_SEPARATE_STORAGE_BIT. Other buffers and textures may or may not have separate storage.
(4) Should we provide a mechanism to modify viewports independently for each GPU?
RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array.
(5) Should we expose this extension on single-GPU configurations?
RESOLVED. No. The extension provides no value unless MULTICAST_GPUS_NV > 1. Limiting exposure
to these configurations guarantees that at least two GPUs will be available when the extension
is reported.
(6) Can rendering be enabled/disabled on a specific subset of GPUs?
This functionality will be added in a future version of this extension.
(7) Should glGet*BufferParameter* return the LGPU_SEPARATE_STORAGE_BIT_NVX bit when
RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as
specified in table 6.3.
Revision History
Rev. Date Author Changes
---- -------- -------- -----------------------------------------
4 07/21/16 mjk Register extension