blob: e9a4512357357441b8b01183710454f9d2b9a9e1 [file] [log] [blame]
Name
NV_shader_thread_shuffle
Name Strings
GL_NV_shader_thread_shuffle
Contributors
Jeannot Breton, NVIDIA
Pat Brown, NVIDIA
Eric Werness, NVIDIA
Mark Kilgard, NVIDIA
Contact
Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com)
Status
Shipping.
Version
Last Modified Date: 2/14/2014
NVIDIA Revision: 3
Number
OpenGL Extension #448
Dependencies
This extension is written against the OpenGL 4.3 (Compatibility Profile)
Specification.
This extension is written against version 4.30 (revision 07) of the OpenGL
Shading Language Specification.
OpenGL 4.3 and GLSL 4.3 are required.
This extension interacts with NV_gpu_program5
Overview
Implementations of the OpenGL Shading Language may, but are not required,
to run multiple shader threads for a single stage as a SIMD thread group,
where individual execution threads are assigned to thread groups in an
undefined, implementation-dependent order. This extension provides a set
of new features to the OpenGL Shading Language to share data between
multiple threads within a thread group.
Shaders using the new functionalities provided by this extension should
enable this functionality via the construct
#extension GL_NV_shader_thread_shuffle : require (or enable)
This extension also specifies some modifications to the program assembly
language to support the thread data sharing functionalities.
New Procedures and Functions
None
New Tokens
None
Modifications to The OpenGL Shading Language Specification, Version 4.30
(Revision 07)
Including the following line in a shader can be used to control the
language features described in this extension:
#extension GL_NV_shader_thread_shuffle : <behavior>
where <behavior> is as specified in section 3.3.
New preprocessor #defines are added to the OpenGL Shading Language:
#define GL_NV_shader_thread_shuffle 1
Modify Section 8.3, Common Functions, p. 133
(add a function to share data between threads in a thread group)
Syntax:
int shuffleDownNV(int data, uint index, uint width,
[out bool threadIdValid])
ivec2 shuffleDownNV(ivec2 data, uint index, uint width,
[out bool threadIdValid])
ivec3 shuffleDownNV(ivec3 data, uint index, uint width,
[out bool threadIdValid])
ivec4 shuffleDownNV(ivec4 data, uint index, uint width,
[out bool threadIdValid])
uint shuffleDownNV(uint data, uint index, uint width,
[out bool threadIdValid])
uvec2 shuffleDownNV(uvec2 data, uint index, uint width,
[out bool threadIdValid])
uvec3 shuffleDownNV(uvec3 data, uint index, uint width,
[out bool threadIdValid])
uvec4 shuffleDownNV(uvec4 data, uint index, uint width,
[out bool threadIdValid])
float shuffleDownNV(float data, uint index, uint width,
[out bool threadIdValid])
vec2 shuffleDownNV(vec2 data, uint index, uint width,
[out bool threadIdValid])
vec3 shuffleDownNV(vec3 data, uint index, uint width,
[out bool threadIdValid])
vec4 shuffleDownNV(vec4 data, uint index, uint width,
[out bool threadIdValid])
bool shuffleDownNV(bool data, uint index, uint width,
[out bool threadIdValid])
bvec2 shuffleDownNV(bvec2 data, uint index, uint width,
[out bool threadIdValid])
bvec3 shuffleDownNV(bvec3 data, uint index, uint width,
[out bool threadIdValid])
bvec4 shuffleDownNV(bvec4 data, uint index, uint width,
[out bool threadIdValid])
int shuffleUpNV(int data, uint index, uint width,
[out bool threadIdValid])
ivec2 shuffleUpNV(ivec2 data, uint index, uint width,
[out bool threadIdValid])
ivec3 shuffleUpNV(ivec3 data, uint index, uint width,
[out bool threadIdValid])
ivec4 shuffleUpNV(ivec4 data, uint index, uint width,
[out bool threadIdValid])
uint shuffleUpNV(uint data, uint index, uint width,
[out bool threadIdValid])
uvec2 shuffleUpNV(uvec2 data, uint index, uint width,
[out bool threadIdValid])
uvec3 shuffleUpNV(uvec3 data, uint index, uint width,
[out bool threadIdValid])
uvec4 shuffleUpNV(uvec4 data, uint index, uint width,
[out bool threadIdValid])
float shuffleUpNV(float data, uint index, uint width,
[out bool threadIdValid])
vec2 shuffleUpNV(vec2 data, uint index, uint width,
[out bool threadIdValid])
vec3 shuffleUpNV(vec3 data, uint index, uint width,
[out bool threadIdValid])
vec4 shuffleUpNV(vec4 data, uint index, uint width,
[out bool threadIdValid])
bool shuffleUpNV(bool data, uint index, uint width,
[out bool threadIdValid])
bvec2 shuffleUpNV(bvec2 data, uint index, uint width,
[out bool threadIdValid])
bvec3 shuffleUpNV(bvec3 data, uint index, uint width,
[out bool threadIdValid])
bvec4 shuffleUpNV(bvec4 data, uint index, uint width,
[out bool threadIdValid])
int shuffleXorNV(int data, uint index, uint width,
[out bool threadIdValid])
ivec2 shuffleXorNV(ivec2 data, uint index, uint width,
[out bool threadIdValid])
ivec3 shuffleXorNV(ivec3 data, uint index, uint width,
[out bool threadIdValid])
ivec4 shuffleXorNV(ivec4 data, uint index, uint width,
[out bool threadIdValid])
uint shuffleXorNV(uint data, uint index, uint width,
[out bool threadIdValid])
uvec2 shuffleXorNV(uvec2 data, uint index, uint width,
[out bool threadIdValid])
uvec3 shuffleXorNV(uvec3 data, uint index, uint width,
[out bool threadIdValid])
uvec4 shuffleXorNV(uvec4 data, uint index, uint width,
[out bool threadIdValid])
float shuffleXorNV(float data, uint index, uint width,
[out bool threadIdValid])
vec2 shuffleXorNV(vec2 data, uint index, uint width,
[out bool threadIdValid])
vec3 shuffleXorNV(vec3 data, uint index, uint width,
[out bool threadIdValid])
vec4 shuffleXorNV(vec4 data, uint index, uint width,
[out bool threadIdValid])
bool shuffleXorNV(bool data, uint index, uint width,
[out bool threadIdValid])
bvec2 shuffleXorNV(bvec2 data, uint index, uint width,
[out bool threadIdValid])
bvec3 shuffleXorNV(bvec3 data, uint index, uint width,
[out bool threadIdValid])
bvec4 shuffleXorNV(bvec4 data, uint index, uint width,
[out bool threadIdValid])
int shuffleNV(int data, uint index, uint width,
[out bool threadIdValid])
ivec2 shuffleNV(ivec2 data, uint index, uint width,
[out bool threadIdValid])
ivec3 shuffleNV(ivec3 data, uint index, uint width,
[out bool threadIdValid])
ivec4 shuffleNV(ivec4 data, uint index, uint width,
[out bool threadIdValid])
uint shuffleNV(uint data, uint index, uint width,
[out bool threadIdValid])
uvec2 shuffleNV(uvec2 data, uint index, uint width,
[out bool threadIdValid])
uvec3 shuffleNV(uvec3 data, uint index, uint width,
[out bool threadIdValid])
uvec4 shuffleNV(uvec4 data, uint index, uint width,
[out bool threadIdValid])
float shuffleNV(float data, uint index, uint width,
[out bool threadIdValid])
vec2 shuffleNV(vec2 data, uint index, uint width,
[out bool threadIdValid])
vec3 shuffleNV(vec3 data, uint index, uint width,
[out bool threadIdValid])
vec4 shuffleNV(vec4 data, uint index, uint width,
[out bool threadIdValid])
bool shuffleNV(bool data, uint index, uint width,
[out bool threadIdValid])
bvec2 shuffleNV(bvec2 data, uint index, uint width,
[out bool threadIdValid])
bvec3 shuffleNV(bvec3 data, uint index, uint width,
[out bool threadIdValid])
bvec4 shuffleNV(bvec4 data, uint index, uint width,
[out bool threadIdValid])
Shuffle functions allow active threads within a thread group to exchange
data using 4 different modes (up, down, xor, indexed). They all load
the operand <data> which can be different per thread and return a value
read from the source thread at an address computed with the <index> and
the <width> operands.
<index> is a 5 bits value in the range 0 to 31, MSBs are ignored.
<threadIdValid> is an optional operand. It hold the value of the predicate
that specifies if the source thread from which the current thread reads
data is in range or not.
<width> is used for segmenting the thread group in multiple segments. The
segments need to be subdivided equally, so <width> needs to be a power of 2
in the range 2 to 32. Using a <width> of 32 would divide the thread
group in a single segment. A <width> of 8 would divide the thread group in
4 segments of size 8. Using a <width> that is not a power of 2, that is
lower than 2 or larger than 32 will return an undefined value.
Threads can only share data within their own segment. Each thread
executing the built-in shuffle function will determine the ID of another
thread by combining its value of gl_ThreadInWarpNV with its value of
<index> as described below. Such threads will attempt to read the value of
<data> in the computed other thread and return that value to the caller.
When a shuffle function attempts to access the value of <data> from another
thread, it determines whether the other thread is in accessible range or
not. If it is in range, true will be returned in the optional
<threadIdValid> parameter, if provided by the caller. If it's out of
range, false will be returned in <threadIdValid>, if provided by the
caller, and the value returned by the function will come from the current
thread.
The 4 modes use the following logic to compute the source thread index and
the <threadIdValid> value:
shuffleNV computes the source index using <index> as an absolute address
within the thread group segment.
srcThreadId = <index>
<threadIdValid> = <index> < <width>
For example, with this thread group segment:
-----------------
Thread Id |0|1|2|3|4|5|6|7|
-----------------
Thread <data> |a|b|c|d|e|f|g|h|
-----------------
If <index> is 2
-----------------
src thread Id |2|2|2|2|2|2|2|2|
-----------------
<threadIdValid> |1|1|1|1|1|1|1|1|
-----------------
result |b|b|b|b|b|b|b|b|
-----------------
If <index> is 9
-----------------
src thread Id |9|9|9|9|9|9|9|9|
-----------------
<threadIdValid> |0|0|0|0|0|0|0|0|
-----------------
result |a|b|c|d|e|f|g|h|
-----------------
shuffleUpNV subtracts <index> from the current thread id to get the source
thread id. This have the effect of shifting up the segment by <index>
threads. Source thread id do not wrap around, so lower thread id
will be left unchanged.
srcThreadId = currentThreadId - <index>
<threadIdValid> = srcThreadId >= 0
For example, with this thread group segment:
-----------------
Thread Id |0|1|2|3|4|5|6|7|
-----------------
Thread <data> |a|b|c|d|e|f|g|h|
-----------------
If <index> is 1
------------------
src thread Id |-1|0|1|2|3|4|5|6|
------------------
<threadIdValid> |0 |1|1|1|1|1|1|1|
------------------
result |a |a|b|c|d|e|f|g|
------------------
shuffleDownNV adds <index> to the current thread id to get the source
thread id. This have the effect of shifting down the segment by
<index> threads. Source thread id do not wrap around, so higher thread id
will be left unchanged.
srcThreadId = currentThreadId + <index>
<threadIdValid> = srcThreadId < <width>
For example, with this thread group segment:
-----------------
Thread Id |0|1|2|3|4|5|6|7|
-----------------
Thread <data> |a|b|c|d|e|f|g|h|
-----------------
If <index> is 2
-----------------
src thread Id |2|3|4|5|6|7|8|9|
-----------------
<threadIdValid> |1|1|1|1|1|1|0|0|
-----------------
result |c|d|e|f|g|h|g|h|
-----------------
shuffleXorNv does a bitwise xor between the <index> and the current
thread id to get the src thread id:
srcThreadId = currentThreadId ^ <index>
<threadIdValid> = srcThreadId < <width>
For example, with this thread group segment:
-----------------
Thread Id |0|1|2|3|4|5|6|7|
-----------------
Thread <data> |a|b|c|d|e|f|g|h|
-----------------
If <index> is 0x1
-----------------
src thread Id |1|0|3|2|5|4|7|6|
-----------------
<threadIdValid> |1|1|1|1|1|1|1|1|
-----------------
result |b|a|d|c|f|e|h|g|
-----------------
Dependencies on NV_gpu_program5
If NV_gpu_program5 is supported and "OPTION NV_shader_thread_shuffle" is
specified in an assembly program, the following edits are made to extend
the assembly programming model documented in the NV_gpu_program4 extension
and extended by NV_gpu_program5.
If NV_gpu_program5 is not supported, or if
"OPTION NV_shader_thread_shuffle" is not specified in an assembly program,
the contents of this dependencies section should be ignored.
Section 2.X.2, Program Grammar
(add the following rules to the grammar)
<VECTORop> ::= "SHFDOWN"
| "SHFIDX"
| "SHFUP"
| "SHFXOR"
Modify Section 2.X.4, Program Execution Environment
(Add the table entries and relevant text describing the program
instructions to exchange data between threads.)
Instr- Modifiers
uction V F I C S H D Out Inputs Description
------- -- - - - - - - --- -------- --------------------------------
...
SHFDOWN 50 X X - - - - F v v,vu,vu warp shuffle with added index
SHFIDX 50 X X - - - - F v v,vu,vu warp shuffle with absolute index
SHFUP 50 X X - - - - F v v,vu,vu warp shuffle with subtracted index
SHFXOR 50 X X - - - - F v v,vu,vu warp shuffle with XORed index
...
(Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
as extended by NV_gpu_program5)
+ Shader thread shuffle (NV_shader_thread_shuffle)
If a program specifies the "NV_shader_thread_shuffle" option, it may use
the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions. If this option
is not specified, a program will fail to compile if it uses those
instructions.
Section 2.X.8.Z, SHFDOWN: warp shuffle with added index
The SHFDOWN instruction allows a 32-bit scalar value to be exchanged
between multiple thread within a thread group. The instruction has 3
operands as input. The first operand is a 32-bit scalar. This value will
be shared between thread, it can be a float, a signed or an unsigned
integer. The second operand is an unsigned integer index in the range 0 to
31. It is used to compute from which thread the current thread will read
the 32-bit scalar value. For the SHFDOWN instruction this source thread is
the id of the current thread added with the index operand.
The last operand is an unsigned integer mask. The mask is used for
segmenting the thread group and limiting the source thread index. Bits 0
to 4 of <mask> are a clamp value that limits the source thread index and
bits 8 to 12 a segmentation mask used to segment the thread group in
multiple smaller groups. Together the clamp value and the segmentation
mask will generate 2 internal values, the minThreadId and the maxThreadId,
using the following logic:
minThreadId = current thread id & segmentationMask
maxThreadId = minThreadId | (clamp & ~segmentationMask)
Those 2 values will segment the thread group by restricting the address
range a specific thread can access.
SHFDOWN returns a 2-component vector. The first component is a predicate
that is TRUE when the computed source thread id is in range and FALSE when
it's out of bounds. For SHFDOWN, the source thread id is in range when it
is lower than maxThreadId. The second component holds a 32-bit value.
When the source thread id is in range, this value comes from the source
thread. When the source thread id is out of range, it read the value from
the current thread. If the source thread id reference to an inactive
thread, the returned result will be undefined.
SHFDOWN supports all data type modifiers. For floating-point data types,
the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data
types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer
data types, the TRUE value is the maximum integer value (all bits are ones)
and the FALSE value is zero.
Section 2.X.8.Z, SHFIDX: warp shuffle with absolute index
The SHFIDX instruction allows a 32-bit scalar value to be exchanged between
multiple thread within a thread group. The instruction has 3 operands as
input. The first operand is a 32-bit scalar. This value will be shared
between thread, it can be a float, a signed or an unsigned integer. The
second operand is an unsigned integer index in the range 0 to 31. It is
used to compute from which thread the current thread will read the
32-bit scalar value. For the SHFIDX instruction, this source thread id is
computed using the following operation:
source thread id =( index operand & ~segmentationMask) | minThreadId
The last operand is an unsigned integer mask. The mask is used for
segmenting the thread group and limiting the source thread index. Bits 0
to 4 of <mask> are a clamp value that limits the source thread index and
bits 8 to 12 a segmentation mask used to segment the thread group in
multiple smaller groups. Together the clamp value and the segmentation
mask will generate 2 internal values, the minThreadId and the maxThreadId,
using the following logic:
minThreadId = current thread id & segmentationMask
maxThreadId = minThreadId | (clamp & ~segmentationMask)
Those 2 values will segment the thread group by restricting the address
range a specific thread can access.
SHFIDX returns a 2-component vector. The first component is a predicate
that is TRUE when the computed source thread id is in range and FALSE when
it's out of bounds. For SHFIDX, the source thread id is in range when it
is lower than maxThreadId. The second component holds a 32-bit value.
When the source thread id is in range, this value comes from the source
thread. When the source thread id is out of range, it read the value from
the current thread. If the source thread id reference to an inactive
thread, the returned result will be undefined.
SHFIDX supports all data type modifiers. For floating-point data types,
the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data
types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer
data types, the TRUE value is the maximum integer value (all bits are ones)
and the FALSE value is zero.
Section 2.X.8.Z, SHFUP: warp shuffle with subtracted index
The SHFUP instruction allows a 32-bit scalar value to be exchanged between
multiple thread within a thread group. The instruction has 3 operands as
input. The first operand is a 32-bit scalar. This value will be shared
between thread, it can be a float, a signed or an unsigned integer. The
second operand is an unsigned integer index in the range 0 to 31. It is
used to compute from which thread the current thread will read the 32-bit
scalar value. For the SHFUP instruction this source thread is the id of
the current thread subtracted with the index operand.
The last operand is an unsigned integer mask. The mask is used for
segmenting the thread group and limiting the source thread index. Bits 0
to 4 of <mask> are a clamp value that limits the source thread index and
bits 8 to 12 a segmentation mask used to segment the thread group in
multiple smaller groups. Together the clamp value and the segmentation
mask will generate 2 internal values, the minThreadId and the maxThreadId,
using the following logic:
minThreadId = current thread id & segmentationMask
maxThreadId = minThreadId | (clamp & ~segmentationMask)
Those 2 values will segment the thread group by restricting the address
range a specific thread can access.
SHFUP returns a 2-component vector. The first component is a predicate
that is TRUE when the computed source thread id is in range and FALSE when
it's out of bounds. For SHFUP, the source thread id is in range when it
is greater than maxThreadId. The second component holds a 32-bit value.
When the source thread id is in range, this value comes from the source
thread. When the source thread id is out of range, it read the value from
the current thread. If the source thread id reference to an inactive
thread, the returned result will be undefined.
SHFUP supports all data type modifiers. For floating-point data types,
the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data
types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer
data types, the TRUE value is the maximum integer value (all bits are ones)
and the FALSE value is zero.
Section 2.X.8.Z, SHFXOR: warp shuffle with XORed index
The SHFXOR instruction allows a 32-bit scalar value to be exchanged
between multiple threads within a thread group. The instruction has 3
operands as input. The first operand is a 32-bit scalar. This value will
be shared between threads, it can be a float, a signed or an unsigned
integer. The second operand is an unsigned integer index in the range 0 to
31. It is used to compute from which thread the current thread will read
the 32-bit scalar value. For the SHFXOR instruction this source thread is
the id of the current thread XORed with the index operand.
The last operand is an unsigned integer mask. The mask is used for
segmenting the thread group and limiting the source thread index. Bits 0
to 4 of <mask> are a clamp value that limits the source thread index and
bits 8 to 12 a segmentation mask used to segment the thread group in
multiple smaller groups. Together the clamp value and the segmentation
mask will generate 2 internal values, the minThreadId and the maxThreadId,
using the following logic:
minThreadId = current thread id & segmentationMask
maxThreadId = minThreadId | (clamp & ~segmentationMask)
Those 2 values will segment the thread group by restricting the address
range a specific thread can access.
SHFXOR returns a 2-component vector. The first component is a predicate
that is TRUE when the computed source thread id is in range and FALSE when
it's out of bounds. For SHFXOR, the source thread id is in range when it
is lower than maxThreadId. The second component holds a 32-bit value.
When the source thread id is in range, this value comes from the source
thread. When the source thread id is out of range, it read the value from
the current thread. If the source thread id reference to an inactive
thread, the returned result will be undefined.
SHFXOR supports all data type modifiers. For floating-point data types,
the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data
types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer
data types, the TRUE value is the maximum integer value (all bits are ones)
and the FALSE value is zero.
Errors
None.
New State
None.
New Implementation Dependent State
None.
Issues
None
Revision History
Rev. Date Author Changes
---- -------- -------- -----------------------------------------
3 2/14/14 jbreton Rename the extension from NVX to NV.
2 9/4/13 jbreton Replace mask by width in the shuffle functions.
1 11/27/12 jbreton Internal revisions.