| Name | 
 |  | 
 |     NVX_linked_gpu_multicast | 
 |  | 
 | Name Strings | 
 |  | 
 |     GL_NVX_linked_gpu_multicast | 
 |  | 
 | Contact | 
 |  | 
 |     Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com) | 
 |     Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com) | 
 |  | 
 | Contributors | 
 |  | 
 |     Christoph Kubisch, NVIDIA | 
 |     Mark Kilgard, NVIDIA | 
 |  | 
 | Status | 
 |  | 
 |     Shipping in NVIDIA release 361 drivers. | 
 |  | 
 | Version | 
 |  | 
 |     Last Modified Date:         July 21, 2016 | 
 |     NVIDIA Revision:            4 | 
 |  | 
 | Number | 
 |  | 
 |     OpenGL Extension #493 | 
 |  | 
 | Dependencies | 
 |  | 
 |     This extension is written against the OpenGL 4.5 specification (Compatibility Profile), dated | 
 |     February 2, 2015. | 
 |  | 
 |     This extension interacts with ARB_sparse_buffer. | 
 |  | 
 |     This extension interacts with ARB_copy_image. | 
 |  | 
 |     This extension interacts with EXT_direct_state_access. | 
 |  | 
 |     This extension interacts with ARB_shader_viewport_layer_array. | 
 |  | 
 | Overview | 
 |  | 
 |     This extension enables novel multi-GPU rendering techniques by providing application control | 
 |     over a group of linked GPUs with identical hardware configuration. | 
 |  | 
 |     Multi-GPU rendering techniques fall into two categories: implicit and explicit.  Existing | 
 |     explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and | 
 |     application complexity.  An application must manage one context per GPU and multi-pump the API | 
 |     stream.  Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering | 
 |     from one context to multiple GPUs.  Common implicit approaches include alternate-frame | 
 |     rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing.  They each have | 
 |     drawbacks.  AFR scales nicely but interacts poorly with inter-frame dependencies.  SFR can | 
 |     improve latency but has challenges with offscreen rendering and scaling of vertex processing. | 
 |     With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample | 
 |     positions and the driver blends the result to improve quality.  This also has issues with | 
 |     offscreen rendering and can conflict with other anti-aliasing techniques. | 
 |      | 
 |     These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks | 
 |     adequate knowledge to accelerate every application.  To resolve this, NVX_linked_gpu_multicast | 
 |     provides application control over multiple GPUs with a single context. | 
 |  | 
 |     Key points: | 
 |  | 
 |     - One context controls multiple GPUs.  Every GPU in the linked group can access every object. | 
 |  | 
 |     - Rendering is broadcast.  Each draw is repeated across all GPUs in the linked group. | 
 |  | 
 |     - Each GPU gets its own instance of all framebuffers and attached textures, allowing | 
 |       individualized output for each GPU.  Input data can be customized for each GPU using buffers | 
 |       created with the storage flag, LGPU_SEPARATE_STORAGE_BIT_NVX and a new API, | 
 |       LGPUNamedBufferSubDataNVX.  | 
 |  | 
 |     - Textures can be transferred from one GPU to another using LGPUCopyImageSubDataNVX. | 
 |      | 
 |      | 
 | New Procedures and Functions | 
 |  | 
 |     void LGPUNamedBufferSubDataNVX( | 
 |         bitfield gpuMask, uint buffer, | 
 |         intptr offset, sizeiptr size, | 
 |         const void *data); | 
 |  | 
 |     void LGPUCopyImageSubDataNVX( | 
 |         uint sourceGpu, bitfield destinationGpuMask, | 
 |         uint srcName, enum srcTarget,  | 
 |         int srcLevel, | 
 |         int srcX, int srxY, int srcZ, | 
 |         uint dstName, enum dstTarget, | 
 |         int dstLevel, | 
 |         int dstX, int dstY, int dstZ, | 
 |         sizei width, sizei height, sizei depth); | 
 |  | 
 |     void LGPUInterlockNVX(void); | 
 |      | 
 | New Tokens | 
 |  | 
 |     Accepted in the <flags> parameter of BufferStorage and | 
 |     NamedBufferStorageEXT: | 
 |  | 
 |         LGPU_SEPARATE_STORAGE_BIT_NVX               0x0800 | 
 |  | 
 |     Accepted by the <pname> parameter of GetBooleanv, GetIntegerv, | 
 |     GetInteger64v, GetFloatv, and GetDoublev: | 
 |  | 
 |         MAX_LGPU_GPUS_NVX                           0x92BA | 
 |  | 
 | Additions to the OpenGL 4.5 Specification (Compatibility Profile) | 
 |  | 
 |     (Add a new chapter after chapter 19 "Compute Shaders") | 
 |  | 
 |     20 Multicast Rendering | 
 |  | 
 |     This chapter specifies commands for using multiple GPUs in a linked group.  Commands are | 
 |     multicast, or repeated across all linked GPUs.  Objects are shared by all GPUs, however each | 
 |     GPU has its own instance (copy) of many resources, including framebuffers.  When each GPU has | 
 |     its own instance of a resource, it is considered to have per-GPU storage.  When all GPUs share | 
 |     a single instance of a resource, this is considered GPU-shared storage.  | 
 |  | 
 |     The mechanism for linking GPUs is implementation specific, as is the process-global mechanism | 
 |     for enabling multicast rendering support (if necessary).  The number of GPUs usable for | 
 |     multicast rendering by a context can be queried by calling GetIntegerv with the symbolic | 
 |     constant MAX_LGPU_GPUS_NVX.  Individual GPUs are identified using zero-based indices in the | 
 |     range [0, n-1], where n is the number of multicast GPUs.  GPUs are also be identified by | 
 |     bitmasks of the form 2^i, where i is the GPU index.  A set of GPUs is specified by the union of | 
 |     masks for each GPU in the set. | 
 |  | 
 |     20.1 Multi-GPU Buffer Storage | 
 |  | 
 |     Like other resources, buffer objects can have two types of storage, per-GPU storage or | 
 |     GPU-shared storage.  Per-GPU storage can be explicitly requested using the | 
 |     LGPU_SEPARATE_STORAGE_BIT_NVX flag with BufferStorage/NamedBufferStorageEXT.  If this flag is | 
 |     not set, the type of storage used is undefined.  The implementation may use either type | 
 |     and transition between them at any time.  Client reads of a buffer with per-GPU storage may | 
 |     source from any GPU. | 
 |  | 
 |     The following rules apply to buffer objects with per-GPU storage: | 
 |  | 
 |       When mapped with WRITE_ONLY access, writes apply to all GPUs. | 
 |       When bound to UNIFORM_BUFFER, client uniform updates apply to all GPUs. | 
 |       When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply to | 
 |       all GPUs. | 
 |  | 
 |     The following commands affect storage on all GPUs, even if the the buffer object has per-GPU | 
 |     storage: | 
 |  | 
 |       BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData | 
 |  | 
 |     An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with | 
 |     LGPU_SEPARATE_STORAGE_BIT_NVX set with MAP_PERSISTENT_BIT or SPARSE_STORAGE_BIT_ARB. | 
 |  | 
 |     To modify buffer object data on one or more GPUs, the client may use the command | 
 |  | 
 |     void LGPUNamedBufferSubDataNVX( | 
 |         bitfield gpuMask, uint buffer, | 
 |         intptr offset, sizeiptr size, | 
 |         const void *data); | 
 |  | 
 |     This function operates similarly to NamedBufferSubData, except that it updates the per-GPU | 
 |     buffer data on the set of GPUs defined by <gpuMask>.   | 
 |  | 
 |     An INVALID_VALUE error is generated if <gpuMask> is zero. | 
 |     An INVALID_OPERATION error is generated if <buffer> is not the name of an existing buffer | 
 |     object. | 
 |     An INVALID_VALUE error is generated if <offset> or <size> is negative, or if <offset> + <size> | 
 |     is greater than the value of BUFFER_SIZE for the buffer object. | 
 |     An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped | 
 |     with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with | 
 |     MAP_PERSISTENT_BIT set in the MapBufferRange access flags. | 
 |     An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer | 
 |     object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the | 
 |     DYNAMIC_STORAGE_BIT set. | 
 |  | 
 |     20.2 Multi-GPU Framebuffers and Textures | 
 |  | 
 |     All buffers in the default framebuffer as well as renderbuffers and textures bound to | 
 |     framebuffer objects receive per-GPU storage.  Storage for other textures is undefined: it may | 
 |     be per-GPU or GPU-shared and can transition between the types at any time.  | 
 |  | 
 |     To copy texel data between GPUs, the client may use the command | 
 |  | 
 |     void LGPUCopyImageSubDataNVX( | 
 |         uint sourceGpu, bitfield destinationGpuMask, | 
 |         uint srcName, enum srcTarget,  | 
 |         int srcLevel, | 
 |         int srcX, int srxY, int srcZ, | 
 |         uint dstName, enum dstTarget, | 
 |         int dstLevel, | 
 |         int dstX, int dstY, int dstZ, | 
 |         sizei width, sizei height, sizei depth); | 
 |  | 
 |     This function operates similarly to CopyImageSubData, except that it takes a source GPU | 
 |     and a destination GPU set defined by <destinationGpuMask>. | 
 |  | 
 |     INVALID_ENUM is generated | 
 |      * if either <srcTarget> or <dstTarget>  | 
 |       - is not RENDERBUFFER or a valid non-proxy texture target | 
 |       - is TEXTURE_BUFFER, or | 
 |       - is one of the cubemap face selectors described in table 3.17, | 
 |      * if the target does not match the type of the object. | 
 |  | 
 |     INVALID_OPERATION is generated | 
 |      * if either object is a texture and the texture is not complete, | 
 |      * if the source and destination formats are not compatible, | 
 |      * if the source and destination number of samples do not match, | 
 |      * if one image is compressed and the other is uncompressed and the | 
 |        block size of compressed image is not equal to the texel size | 
 |        of the compressed image. | 
 |  | 
 |     INVALID_VALUE is generated | 
 |      * if <sourceGpu> is greater than or equal to MAX_LGPU_GPUS_NVX, | 
 |      * if <destinationGpuMask> is zero, | 
 |      * if either <srcName> or <dstName> does not correspond to a valid | 
 |        renderbuffer or texture object according to the corresponding | 
 |        target parameter, or | 
 |      * if the specified level is not a valid level for the image, or | 
 |      * if the dimensions of the either subregion exceeds the boundaries  | 
 |        of the corresponding image object, or | 
 |      * if the image format is compressed and the dimensions of the | 
 |        subregion fail to meet the alignment constraints of the format. | 
 |  | 
 |  | 
 |     20.3 Multi-GPU Synchronization | 
 |  | 
 |     LGPUCopyImageSubDataNVX provides implicit synchronization with previous rendering to the given | 
 |     texture or renderbuffer on the source GPU.  Synchronization of the copy with the destination | 
 |     GPU(s) is achieved with the interlock function: | 
 |  | 
 |       void LGPUInterlockNVX(void) | 
 |  | 
 |     This is called to synchronize all linked GPUs to the same point in the API stream.  To | 
 |     guarantee consistency, the interlock command must be used as a barrier between any two | 
 |     accesses by multiple GPUs to the same memory when at least one of the accesses is a write. | 
 |     For consistent copies between GPUs, synchronization is required before and after each copy: | 
 |      | 
 |     1. Prior to each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called after | 
 |     the most recent read or write of the target image by a destination GPU. | 
 |  | 
 |     2. After each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called  | 
 |     prior to any future read or write of the target image by a destination GPU. | 
 |  | 
 |     GPU writes and reads to/from GPU-shared locations require synchronization as well.  GPU writes | 
 |     such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not | 
 |     automatically synchronized with writes by other GPUs.  Neither are GPU reads such as texture | 
 |     fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs. | 
 |     Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees | 
 |     for rendering, writes and reads on a single GPU. | 
 |      | 
 |            | 
 |     Additions to the AGL/GLX/WGL Specifications | 
 |  | 
 |         None | 
 |  | 
 | GLX Protocol | 
 |  | 
 |     None | 
 |  | 
 | Errors | 
 |  | 
 |     Relaxation of INVALID_ENUM errors | 
 |     --------------------------------- | 
 |     GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as | 
 |     described in the "New Tokens" section. | 
 |  | 
 | New State | 
 |  | 
 |     None | 
 |  | 
 | New Implementation Dependent State | 
 |  | 
 |     Add to Table 23.82, Implementation-Dependent Values, p. 784 | 
 |  | 
 |                                                 Minimum | 
 |     Get Value               Type   Get Command  Value   Description               Sec.  Attribute | 
 |     ----------------------  ----   -----------  ------- -----------------------   ----  --------- | 
 |     MAX_LGPU_GPUS_NVX        Z+   GetIntegerv      2    Maximum number of         6.9     - | 
 |                                                         usable GPUs | 
 | Sample Code | 
 |  | 
 |     Binocular stereo rendering example using NVX_linked_gpu_multicast with single GPU fallback: | 
 |     | 
 |     struct ViewData { | 
 |         GLint viewport_index; | 
 |         GLfloat mvp[16]; | 
 |         GLfloat modelview[16]; | 
 |     }; | 
 |     ViewData leftViewData = { 0, {...}, {...} }; | 
 |     ViewData rightViewData = { 1, {...}, {...} }; | 
 |  | 
 |     GLuint ubo[2]; | 
 |     glCreateBuffers(2, &ubo[0]); | 
 |  | 
 |     if (has_NVX_linked_gpu_multicast) { | 
 |         glNamedBufferStorage(ubo[0], size, NULL, GL_LGPU_SEPARATE_STORAGE_BIT_NVX | GL_DYNAMIC_STORAGE_BIT); | 
 |         glLGPUNamedBufferSubDataNVX(0x1, ubo[0], 0, size, &leftViewData); | 
 |         glLGPUNamedBufferSubDataNVX(0x2, ubo[0], 0, size, &rightViewData); | 
 |     } else { | 
 |         glNamedBufferStorage(ubo[0], size, &leftViewData, 0); | 
 |         glNamedBufferStorage(ubo[1], size, &rightViewData, 0); | 
 |     } | 
 |  | 
 |     glViewportIndexedf(0, 0, 0, 640, 480);  // left viewport | 
 |     glViewportIndexedf(1, 640, 0, 640, 480);  // right viewport | 
 |     // Vertex shader sets gl_ViewportIndex according to viewport_index in UBO | 
 |  | 
 |     glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); | 
 |  | 
 |     if (has_NVX_linked_gpu_multicast) { | 
 |         glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]); | 
 |         drawScene(); | 
 |         // Make GPU 1 wait for glClear above to complete on GPU 0 | 
 |         glLGPUInterlockNVX(); | 
 |         // Copy right viewport from GPU 1 to GPU 0 | 
 |         glLGPUCopyImageSubDataNVX(1, 0x1, | 
 |                                   renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0, | 
 |                                   renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0, | 
 |                                   640, 480, 1); | 
 |         // Make GPU 0 wait for GPU 1 copy to GPU 0 | 
 |         glLGPUInterlockNVX(); | 
 |     } else { | 
 |         glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]); | 
 |         drawScene(); | 
 |         glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]); | 
 |         drawScene(); | 
 |     } | 
 |     // Both viewports are now present in GPU 0's renderbuffer | 
 |  | 
 | Issues | 
 |  | 
 |   (1) Should we provide explicit inter-gpu synchronization API?  Will this make the implementation | 
 |     easier or harder for the driver and applications? | 
 |  | 
 |     RESOLVED. Yes. A naive implementation of implicit synchronization would simply interlock the | 
 |     GPUs before and after each copy.  Smart implicit synchronization would have to track all APIs | 
 |     that can modify buffers and textures, creating an excessive burden for driver implementation | 
 |     and maintenance.  An application can track dependencies more easily and outperform a naive | 
 |     driver implementation using explicit synchronization. | 
 |  | 
 |   (2) How does this extension interact with queries (e.g. occlusion queries)? | 
 |  | 
 |     RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs | 
 |     return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve | 
 |     query results for all GPUs through a buffer with separate storage (LGPU_SEPARATE_STORAGE_BIT). | 
 |  | 
 |   (3) Which textures and buffers have separate storage for each GPU? | 
 |    | 
 |     The default framebuffer and framebuffer texture attachments. Also buffers allocated with | 
 |     LGPU_SEPARATE_STORAGE_BIT. Other buffers and textures may or may not have separate storage. | 
 |  | 
 |   (4) Should we provide a mechanism to modify viewports independently for each GPU? | 
 |  | 
 |     RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array. | 
 |  | 
 |   (5) Should we expose this extension on single-GPU configurations? | 
 |  | 
 |     RESOLVED. No. The extension provides no value unless MULTICAST_GPUS_NV > 1.  Limiting exposure | 
 |     to these configurations guarantees that at least two GPUs will be available when the extension | 
 |     is reported. | 
 |  | 
 |   (6) Can rendering be enabled/disabled on a specific subset of GPUs? | 
 |  | 
 |     This functionality will be added in a future version of this extension. | 
 |  | 
 |   (7) Should glGet*BufferParameter* return the LGPU_SEPARATE_STORAGE_BIT_NVX bit when | 
 |     BUFFER_STORAGE_FLAGS is queried? | 
 |  | 
 |     RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as | 
 |     specified in table 6.3. | 
 |  | 
 | Revision History | 
 |  | 
 |     Rev.    Date    Author    Changes | 
 |     ----  --------  --------  ----------------------------------------- | 
 |      4    07/21/16  mjk       Register extension |