| Name |
| |
| NV_shader_thread_shuffle |
| |
| Name Strings |
| |
| GL_NV_shader_thread_shuffle |
| |
| Contributors |
| |
| Jeannot Breton, NVIDIA |
| Pat Brown, NVIDIA |
| Eric Werness, NVIDIA |
| Mark Kilgard, NVIDIA |
| |
| Contact |
| |
| Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com) |
| |
| Status |
| |
| Shipping. |
| |
| Version |
| |
| Last Modified Date: 2/14/2014 |
| NVIDIA Revision: 3 |
| |
| Number |
| |
| OpenGL Extension #448 |
| |
| Dependencies |
| |
| This extension is written against the OpenGL 4.3 (Compatibility Profile) |
| Specification. |
| |
| This extension is written against version 4.30 (revision 07) of the OpenGL |
| Shading Language Specification. |
| |
| OpenGL 4.3 and GLSL 4.3 are required. |
| |
| This extension interacts with NV_gpu_program5 |
| |
| Overview |
| |
| Implementations of the OpenGL Shading Language may, but are not required, |
| to run multiple shader threads for a single stage as a SIMD thread group, |
| where individual execution threads are assigned to thread groups in an |
| undefined, implementation-dependent order. This extension provides a set |
| of new features to the OpenGL Shading Language to share data between |
| multiple threads within a thread group. |
| |
| Shaders using the new functionalities provided by this extension should |
| enable this functionality via the construct |
| |
| #extension GL_NV_shader_thread_shuffle : require (or enable) |
| |
| This extension also specifies some modifications to the program assembly |
| language to support the thread data sharing functionalities. |
| |
| New Procedures and Functions |
| |
| None |
| |
| |
| New Tokens |
| |
| None |
| |
| |
| Modifications to The OpenGL Shading Language Specification, Version 4.30 |
| (Revision 07) |
| |
| Including the following line in a shader can be used to control the |
| language features described in this extension: |
| |
| #extension GL_NV_shader_thread_shuffle : <behavior> |
| |
| where <behavior> is as specified in section 3.3. |
| |
| New preprocessor #defines are added to the OpenGL Shading Language: |
| |
| #define GL_NV_shader_thread_shuffle 1 |
| |
| |
| Modify Section 8.3, Common Functions, p. 133 |
| |
| (add a function to share data between threads in a thread group) |
| |
| Syntax: |
| |
| int shuffleDownNV(int data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec2 shuffleDownNV(ivec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec3 shuffleDownNV(ivec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec4 shuffleDownNV(ivec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| uint shuffleDownNV(uint data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec2 shuffleDownNV(uvec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec3 shuffleDownNV(uvec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec4 shuffleDownNV(uvec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| float shuffleDownNV(float data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec2 shuffleDownNV(vec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec3 shuffleDownNV(vec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec4 shuffleDownNV(vec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| bool shuffleDownNV(bool data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec2 shuffleDownNV(bvec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec3 shuffleDownNV(bvec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec4 shuffleDownNV(bvec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| |
| int shuffleUpNV(int data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec2 shuffleUpNV(ivec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec3 shuffleUpNV(ivec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec4 shuffleUpNV(ivec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| uint shuffleUpNV(uint data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec2 shuffleUpNV(uvec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec3 shuffleUpNV(uvec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec4 shuffleUpNV(uvec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| float shuffleUpNV(float data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec2 shuffleUpNV(vec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec3 shuffleUpNV(vec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec4 shuffleUpNV(vec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| bool shuffleUpNV(bool data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec2 shuffleUpNV(bvec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec3 shuffleUpNV(bvec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec4 shuffleUpNV(bvec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| |
| int shuffleXorNV(int data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec2 shuffleXorNV(ivec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec3 shuffleXorNV(ivec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec4 shuffleXorNV(ivec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| uint shuffleXorNV(uint data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec2 shuffleXorNV(uvec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec3 shuffleXorNV(uvec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec4 shuffleXorNV(uvec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| float shuffleXorNV(float data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec2 shuffleXorNV(vec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec3 shuffleXorNV(vec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec4 shuffleXorNV(vec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| bool shuffleXorNV(bool data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec2 shuffleXorNV(bvec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec3 shuffleXorNV(bvec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec4 shuffleXorNV(bvec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| |
| int shuffleNV(int data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec2 shuffleNV(ivec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec3 shuffleNV(ivec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| ivec4 shuffleNV(ivec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| uint shuffleNV(uint data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec2 shuffleNV(uvec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec3 shuffleNV(uvec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| uvec4 shuffleNV(uvec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| float shuffleNV(float data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec2 shuffleNV(vec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec3 shuffleNV(vec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| vec4 shuffleNV(vec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| bool shuffleNV(bool data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec2 shuffleNV(bvec2 data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec3 shuffleNV(bvec3 data, uint index, uint width, |
| [out bool threadIdValid]) |
| bvec4 shuffleNV(bvec4 data, uint index, uint width, |
| [out bool threadIdValid]) |
| |
| Shuffle functions allow active threads within a thread group to exchange |
| data using 4 different modes (up, down, xor, indexed). They all load |
| the operand <data> which can be different per thread and return a value |
| read from the source thread at an address computed with the <index> and |
| the <width> operands. |
| |
| <index> is a 5 bits value in the range 0 to 31, MSBs are ignored. |
| <threadIdValid> is an optional operand. It hold the value of the predicate |
| that specifies if the source thread from which the current thread reads |
| data is in range or not. |
| |
| <width> is used for segmenting the thread group in multiple segments. The |
| segments need to be subdivided equally, so <width> needs to be a power of 2 |
| in the range 2 to 32. Using a <width> of 32 would divide the thread |
| group in a single segment. A <width> of 8 would divide the thread group in |
| 4 segments of size 8. Using a <width> that is not a power of 2, that is |
| lower than 2 or larger than 32 will return an undefined value. |
| |
| Threads can only share data within their own segment. Each thread |
| executing the built-in shuffle function will determine the ID of another |
| thread by combining its value of gl_ThreadInWarpNV with its value of |
| <index> as described below. Such threads will attempt to read the value of |
| <data> in the computed other thread and return that value to the caller. |
| |
| When a shuffle function attempts to access the value of <data> from another |
| thread, it determines whether the other thread is in accessible range or |
| not. If it is in range, true will be returned in the optional |
| <threadIdValid> parameter, if provided by the caller. If it's out of |
| range, false will be returned in <threadIdValid>, if provided by the |
| caller, and the value returned by the function will come from the current |
| thread. |
| |
| |
| The 4 modes use the following logic to compute the source thread index and |
| the <threadIdValid> value: |
| |
| shuffleNV computes the source index using <index> as an absolute address |
| within the thread group segment. |
| |
| srcThreadId = <index> |
| <threadIdValid> = <index> < <width> |
| |
| For example, with this thread group segment: |
| |
| ----------------- |
| Thread Id |0|1|2|3|4|5|6|7| |
| ----------------- |
| Thread <data> |a|b|c|d|e|f|g|h| |
| ----------------- |
| |
| If <index> is 2 |
| |
| ----------------- |
| src thread Id |2|2|2|2|2|2|2|2| |
| ----------------- |
| <threadIdValid> |1|1|1|1|1|1|1|1| |
| ----------------- |
| result |b|b|b|b|b|b|b|b| |
| ----------------- |
| |
| If <index> is 9 |
| |
| ----------------- |
| src thread Id |9|9|9|9|9|9|9|9| |
| ----------------- |
| <threadIdValid> |0|0|0|0|0|0|0|0| |
| ----------------- |
| result |a|b|c|d|e|f|g|h| |
| ----------------- |
| |
| |
| shuffleUpNV subtracts <index> from the current thread id to get the source |
| thread id. This have the effect of shifting up the segment by <index> |
| threads. Source thread id do not wrap around, so lower thread id |
| will be left unchanged. |
| |
| srcThreadId = currentThreadId - <index> |
| <threadIdValid> = srcThreadId >= 0 |
| |
| For example, with this thread group segment: |
| |
| ----------------- |
| Thread Id |0|1|2|3|4|5|6|7| |
| ----------------- |
| Thread <data> |a|b|c|d|e|f|g|h| |
| ----------------- |
| |
| If <index> is 1 |
| |
| ------------------ |
| src thread Id |-1|0|1|2|3|4|5|6| |
| ------------------ |
| <threadIdValid> |0 |1|1|1|1|1|1|1| |
| ------------------ |
| result |a |a|b|c|d|e|f|g| |
| ------------------ |
| |
| |
| shuffleDownNV adds <index> to the current thread id to get the source |
| thread id. This have the effect of shifting down the segment by |
| <index> threads. Source thread id do not wrap around, so higher thread id |
| will be left unchanged. |
| |
| srcThreadId = currentThreadId + <index> |
| <threadIdValid> = srcThreadId < <width> |
| |
| For example, with this thread group segment: |
| |
| ----------------- |
| Thread Id |0|1|2|3|4|5|6|7| |
| ----------------- |
| Thread <data> |a|b|c|d|e|f|g|h| |
| ----------------- |
| |
| If <index> is 2 |
| |
| ----------------- |
| src thread Id |2|3|4|5|6|7|8|9| |
| ----------------- |
| <threadIdValid> |1|1|1|1|1|1|0|0| |
| ----------------- |
| result |c|d|e|f|g|h|g|h| |
| ----------------- |
| |
| |
| shuffleXorNv does a bitwise xor between the <index> and the current |
| thread id to get the src thread id: |
| |
| srcThreadId = currentThreadId ^ <index> |
| <threadIdValid> = srcThreadId < <width> |
| |
| For example, with this thread group segment: |
| |
| ----------------- |
| Thread Id |0|1|2|3|4|5|6|7| |
| ----------------- |
| Thread <data> |a|b|c|d|e|f|g|h| |
| ----------------- |
| |
| If <index> is 0x1 |
| |
| ----------------- |
| src thread Id |1|0|3|2|5|4|7|6| |
| ----------------- |
| <threadIdValid> |1|1|1|1|1|1|1|1| |
| ----------------- |
| result |b|a|d|c|f|e|h|g| |
| ----------------- |
| |
| Dependencies on NV_gpu_program5 |
| |
| If NV_gpu_program5 is supported and "OPTION NV_shader_thread_shuffle" is |
| specified in an assembly program, the following edits are made to extend |
| the assembly programming model documented in the NV_gpu_program4 extension |
| and extended by NV_gpu_program5. |
| |
| If NV_gpu_program5 is not supported, or if |
| "OPTION NV_shader_thread_shuffle" is not specified in an assembly program, |
| the contents of this dependencies section should be ignored. |
| |
| Section 2.X.2, Program Grammar |
| |
| (add the following rules to the grammar) |
| |
| <VECTORop> ::= "SHFDOWN" |
| | "SHFIDX" |
| | "SHFUP" |
| | "SHFXOR" |
| |
| |
| Modify Section 2.X.4, Program Execution Environment |
| |
| (Add the table entries and relevant text describing the program |
| instructions to exchange data between threads.) |
| |
| Instr- Modifiers |
| uction V F I C S H D Out Inputs Description |
| ------- -- - - - - - - --- -------- -------------------------------- |
| ... |
| SHFDOWN 50 X X - - - - F v v,vu,vu warp shuffle with added index |
| SHFIDX 50 X X - - - - F v v,vu,vu warp shuffle with absolute index |
| SHFUP 50 X X - - - - F v v,vu,vu warp shuffle with subtracted index |
| SHFXOR 50 X X - - - - F v v,vu,vu warp shuffle with XORed index |
| ... |
| |
| |
| (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, |
| as extended by NV_gpu_program5) |
| |
| + Shader thread shuffle (NV_shader_thread_shuffle) |
| |
| If a program specifies the "NV_shader_thread_shuffle" option, it may use |
| the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions. If this option |
| is not specified, a program will fail to compile if it uses those |
| instructions. |
| |
| |
| Section 2.X.8.Z, SHFDOWN: warp shuffle with added index |
| |
| The SHFDOWN instruction allows a 32-bit scalar value to be exchanged |
| between multiple thread within a thread group. The instruction has 3 |
| operands as input. The first operand is a 32-bit scalar. This value will |
| be shared between thread, it can be a float, a signed or an unsigned |
| integer. The second operand is an unsigned integer index in the range 0 to |
| 31. It is used to compute from which thread the current thread will read |
| the 32-bit scalar value. For the SHFDOWN instruction this source thread is |
| the id of the current thread added with the index operand. |
| |
| The last operand is an unsigned integer mask. The mask is used for |
| segmenting the thread group and limiting the source thread index. Bits 0 |
| to 4 of <mask> are a clamp value that limits the source thread index and |
| bits 8 to 12 a segmentation mask used to segment the thread group in |
| multiple smaller groups. Together the clamp value and the segmentation |
| mask will generate 2 internal values, the minThreadId and the maxThreadId, |
| using the following logic: |
| |
| minThreadId = current thread id & segmentationMask |
| |
| maxThreadId = minThreadId | (clamp & ~segmentationMask) |
| |
| Those 2 values will segment the thread group by restricting the address |
| range a specific thread can access. |
| |
| SHFDOWN returns a 2-component vector. The first component is a predicate |
| that is TRUE when the computed source thread id is in range and FALSE when |
| it's out of bounds. For SHFDOWN, the source thread id is in range when it |
| is lower than maxThreadId. The second component holds a 32-bit value. |
| When the source thread id is in range, this value comes from the source |
| thread. When the source thread id is out of range, it read the value from |
| the current thread. If the source thread id reference to an inactive |
| thread, the returned result will be undefined. |
| |
| SHFDOWN supports all data type modifiers. For floating-point data types, |
| the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data |
| types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer |
| data types, the TRUE value is the maximum integer value (all bits are ones) |
| and the FALSE value is zero. |
| |
| |
| Section 2.X.8.Z, SHFIDX: warp shuffle with absolute index |
| |
| The SHFIDX instruction allows a 32-bit scalar value to be exchanged between |
| multiple thread within a thread group. The instruction has 3 operands as |
| input. The first operand is a 32-bit scalar. This value will be shared |
| between thread, it can be a float, a signed or an unsigned integer. The |
| second operand is an unsigned integer index in the range 0 to 31. It is |
| used to compute from which thread the current thread will read the |
| 32-bit scalar value. For the SHFIDX instruction, this source thread id is |
| computed using the following operation: |
| |
| source thread id =( index operand & ~segmentationMask) | minThreadId |
| |
| The last operand is an unsigned integer mask. The mask is used for |
| segmenting the thread group and limiting the source thread index. Bits 0 |
| to 4 of <mask> are a clamp value that limits the source thread index and |
| bits 8 to 12 a segmentation mask used to segment the thread group in |
| multiple smaller groups. Together the clamp value and the segmentation |
| mask will generate 2 internal values, the minThreadId and the maxThreadId, |
| using the following logic: |
| |
| minThreadId = current thread id & segmentationMask |
| |
| maxThreadId = minThreadId | (clamp & ~segmentationMask) |
| |
| Those 2 values will segment the thread group by restricting the address |
| range a specific thread can access. |
| |
| SHFIDX returns a 2-component vector. The first component is a predicate |
| that is TRUE when the computed source thread id is in range and FALSE when |
| it's out of bounds. For SHFIDX, the source thread id is in range when it |
| is lower than maxThreadId. The second component holds a 32-bit value. |
| When the source thread id is in range, this value comes from the source |
| thread. When the source thread id is out of range, it read the value from |
| the current thread. If the source thread id reference to an inactive |
| thread, the returned result will be undefined. |
| |
| SHFIDX supports all data type modifiers. For floating-point data types, |
| the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data |
| types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer |
| data types, the TRUE value is the maximum integer value (all bits are ones) |
| and the FALSE value is zero. |
| |
| |
| Section 2.X.8.Z, SHFUP: warp shuffle with subtracted index |
| |
| The SHFUP instruction allows a 32-bit scalar value to be exchanged between |
| multiple thread within a thread group. The instruction has 3 operands as |
| input. The first operand is a 32-bit scalar. This value will be shared |
| between thread, it can be a float, a signed or an unsigned integer. The |
| second operand is an unsigned integer index in the range 0 to 31. It is |
| used to compute from which thread the current thread will read the 32-bit |
| scalar value. For the SHFUP instruction this source thread is the id of |
| the current thread subtracted with the index operand. |
| |
| The last operand is an unsigned integer mask. The mask is used for |
| segmenting the thread group and limiting the source thread index. Bits 0 |
| to 4 of <mask> are a clamp value that limits the source thread index and |
| bits 8 to 12 a segmentation mask used to segment the thread group in |
| multiple smaller groups. Together the clamp value and the segmentation |
| mask will generate 2 internal values, the minThreadId and the maxThreadId, |
| using the following logic: |
| |
| minThreadId = current thread id & segmentationMask |
| |
| maxThreadId = minThreadId | (clamp & ~segmentationMask) |
| |
| Those 2 values will segment the thread group by restricting the address |
| range a specific thread can access. |
| |
| SHFUP returns a 2-component vector. The first component is a predicate |
| that is TRUE when the computed source thread id is in range and FALSE when |
| it's out of bounds. For SHFUP, the source thread id is in range when it |
| is greater than maxThreadId. The second component holds a 32-bit value. |
| When the source thread id is in range, this value comes from the source |
| thread. When the source thread id is out of range, it read the value from |
| the current thread. If the source thread id reference to an inactive |
| thread, the returned result will be undefined. |
| |
| SHFUP supports all data type modifiers. For floating-point data types, |
| the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data |
| types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer |
| data types, the TRUE value is the maximum integer value (all bits are ones) |
| and the FALSE value is zero. |
| |
| |
| Section 2.X.8.Z, SHFXOR: warp shuffle with XORed index |
| |
| The SHFXOR instruction allows a 32-bit scalar value to be exchanged |
| between multiple threads within a thread group. The instruction has 3 |
| operands as input. The first operand is a 32-bit scalar. This value will |
| be shared between threads, it can be a float, a signed or an unsigned |
| integer. The second operand is an unsigned integer index in the range 0 to |
| 31. It is used to compute from which thread the current thread will read |
| the 32-bit scalar value. For the SHFXOR instruction this source thread is |
| the id of the current thread XORed with the index operand. |
| |
| The last operand is an unsigned integer mask. The mask is used for |
| segmenting the thread group and limiting the source thread index. Bits 0 |
| to 4 of <mask> are a clamp value that limits the source thread index and |
| bits 8 to 12 a segmentation mask used to segment the thread group in |
| multiple smaller groups. Together the clamp value and the segmentation |
| mask will generate 2 internal values, the minThreadId and the maxThreadId, |
| using the following logic: |
| |
| minThreadId = current thread id & segmentationMask |
| |
| maxThreadId = minThreadId | (clamp & ~segmentationMask) |
| |
| Those 2 values will segment the thread group by restricting the address |
| range a specific thread can access. |
| |
| SHFXOR returns a 2-component vector. The first component is a predicate |
| that is TRUE when the computed source thread id is in range and FALSE when |
| it's out of bounds. For SHFXOR, the source thread id is in range when it |
| is lower than maxThreadId. The second component holds a 32-bit value. |
| When the source thread id is in range, this value comes from the source |
| thread. When the source thread id is out of range, it read the value from |
| the current thread. If the source thread id reference to an inactive |
| thread, the returned result will be undefined. |
| |
| SHFXOR supports all data type modifiers. For floating-point data types, |
| the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data |
| types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer |
| data types, the TRUE value is the maximum integer value (all bits are ones) |
| and the FALSE value is zero. |
| |
| Errors |
| |
| None. |
| |
| New State |
| |
| None. |
| |
| New Implementation Dependent State |
| |
| None. |
| |
| Issues |
| |
| None |
| |
| |
| Revision History |
| |
| Rev. Date Author Changes |
| ---- -------- -------- ----------------------------------------- |
| 3 2/14/14 jbreton Rename the extension from NVX to NV. |
| 2 9/4/13 jbreton Replace mask by width in the shuffle functions. |
| 1 11/27/12 jbreton Internal revisions. |