extensions/NV/NV_gpu_program5_mem_extended.txt - external/github.com/KhronosGroup/OpenGL-Registry - Git at Google

 Name

     NV_gpu_program5_mem_extended

 Name Strings

     GL_NV_gpu_program5_mem_extended

 Contact

     Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

 Status

     Shipping.

 Version

     Last Modified Date:         October 30, 2012
     NVIDIA Revision:            1

 Number

     OpenGL Extension #434

 Dependencies

     NV_gpu_program5 is required.

     This extension is written against the NV_gpu_program5 extension
     specification, which itself is written against the NV_gpu_program4 and
     OpenGL 2.0 Specifications.

     This extension interacts trivially with EXT_shader_image_load_store,
     NV_shader_storage_buffer_object, and NV_compute_program5.

 Overview

     This extension provides a new set of storage modifiers that can be used by
     NV_gpu_program5 assembly program instructions loading from or storing to
     various forms of GPU memory.  In particular, we provide support for loads
     and stores using the storage modifiers:

         .F16X2  .F16X4  .F16    (for 16-bit floating-point scalars/vectors)
         .S8X2   .S8X4           (for 8-bit signed integer vectors)
         .S16X2  .S16X4          (for 16-bit signed integer vectors)
         .U8X2   .U8X4           (for 8-bit unsigned integer vectors)
         .U16X2  .U16X4          (for 16-bit unsigned integer vectors)

     These modifiers are allowed for the following load/store instructions:

         LDC             Load from constant buffer

         LOAD            Global load
         STORE           Global store

         LOADIM          Image load (via EXT_shader_image_load_store)
         STOREIM         Image store (via EXT_shader_image_load_store)

         LDB             Load from storage buffer (via
                           NV_shader_storage_buffer_object)
         STB             Store to storage buffer (via
                           NV_shader_storage_buffer_object)

         LDS             Load from shared memory (via NV_compute_program5)
         STS             Store to shared memory (via NV_compute_program5)

     For assembly programs prior to this extension, it was necessary to access
     memory using packed types and then unpack with additional shader
     instructions.

     Similar capabilities have already been provided in the OpenGL Shading
     Language (GLSL) via the NV_gpu_shader5 extension, using the extended data
     types provided there (e.g., "float16_t", "u8vec4", "s16vec2").

 New Procedures and Functions

     None.

 New Tokens

     None.

 Additions to Chapter 2 of the OpenGL 2.0 Specification (OpenGL Operation)

     (All modifications are relative to Section 2.X, GPU Programs, from the
      NV_gpu_program4 specification.)

     Modify Section 2.X.2, Program Grammar

     (add after the long list of grammar rules) If a program specifies the
     NV_gpu_program5_mem_extended program option, the following rules are added
     to the NV_gpu_program5 base program grammar:

     <opModifier>            ::= "F16X2"
                               | "F16X4"
                               | "S8X2"
                               | "S8X4"
                               | "S16X2"
                               | "S16X4"
                               | "U8X2"
                               | "U8X4"
                               | "U16X2"
                               | "U16X4"

     (Note:  This extension also provides new capabilities for the "F16"
      modifier.  Since it was already supported in NV_gpu_program5, it isn't
      being added to the grammar here.)


     Modify Section 2.X.4.1, Program Instruction Modifiers

     (add to Table X.14 of the NV_gpu_program4 specification.)

       Modifier  Description
       --------  ---------------------------------------------------
       F16       Convert to or from one 16-bit floating-point value,
                 or access one 16-bit floating-point value

       F16X2     Access two 16-bit floating-point values
       F16X4     Access four 16-bit floating-point values
       S8X2      Access two 8-bit signed integer values
       S8X4      Access four 8-bit signed integer values
       S16X2     Access two 16-bit signed integer values
       S16X4     Access four 16-bit signed integer values
       U8X2      Access two 8-bit unsigned integer values
       U8X4      Access four 8-bit unsigned integer values
       U16X2     Access two 16-bit unsigned integer values
       U16X4     Access four 16-bit unsigned integer values

     (modify discussion of storage modifiers for load and store operations,
      adding the entries added to the table above)

     For load and store operations, the "F32", "F32X2", "F32X4", "F64",
     "F64X2", "F64X4", "S8", "S8X2", "S8X4", "S16", "S16X2", "S16X4", "S32",
     "S32X2", "S32X4", "S64", "S64X2", "S64X4", "U8", "U8X2", "U8X4", "U16",
     "U16X2", "U16X4", "U32", "U32X2", "U32X4", "U64", "U64X2", "U64X4", "F16",
     "F16X2", and "F16X4" storage modifiers control how data are loaded from or
     stored to memory. ...


     Modify Section 2.X.4.5, Program Memory Access, from NV_gpu_program5

     (update pseudocode for BufferMemoryLoad)

       result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
       {
         result_t_vec result = { 0, 0, 0, 0 };
         switch (modifier) {

         /* Existing cases and code from NV_gpu_program5 unchanged. */

         case F16:
             result.x = ((float16_t *)address)[0];
             break;
         case F16X2:
             result.x = ((float16_t *)address)[0];
             result.y = ((float16_t *)address)[1];
             break;
         case S8X2:
             result.x = ((int8_t *)address)[0];
             result.y = ((int8_t *)address)[1];
             break;
         case S8X4:
             result.x = ((int8_t *)address)[0];
             result.y = ((int8_t *)address)[1];
             result.z = ((int8_t *)address)[2];
             result.w = ((int8_t *)address)[3];
             break;
         case S16X2:
             result.x = ((int16_t *)address)[0];
             result.y = ((int16_t *)address)[1];
             break;
         case S16X4:
             result.x = ((int16_t *)address)[0];
             result.y = ((int16_t *)address)[1];
             result.z = ((int16_t *)address)[2];
             result.w = ((int16_t *)address)[3];
             break;
         case U8X2:
             result.x = ((uint8_t *)address)[0];
             result.y = ((uint8_t *)address)[1];
             break;
         case U8X4:
             result.x = ((uint8_t *)address)[0];
             result.y = ((uint8_t *)address)[1];
             result.z = ((uint8_t *)address)[2];
             result.w = ((uint8_t *)address)[3];
             break;
         case U16X2:
             result.x = ((uint16_t *)address)[0];
             result.y = ((uint16_t *)address)[1];
             break;
         case U16X4:
             result.x = ((uint16_t *)address)[0];
             result.y = ((uint16_t *)address)[1];
             result.z = ((uint16_t *)address)[2];
             result.w = ((uint16_t *)address)[3];
             break;
         }
         return result;
       }

     (update pseudocode for BufferMemoryStore)

       void BufferMemoryStore(char *address, operand_t_vec operand,
                              OpModifier modifier)
       {
         switch (modifier) {

         /* Existing cases and code from NV_gpu_program5 unchanged. */

         case F16:
             ((float16_t *)address)[0] = operand.x;
             break;
         case F16X2:
             ((float16_t *)address)[0] = operand.x;
             ((float16_t *)address)[1] = operand.y;
             break;
         case S8X2:
             ((int8_t *)address)[0] = operand.x;
             ((int8_t *)address)[1] = operand.y;
             break;
         case S8X4:
             ((int8_t *)address)[0] = operand.x;
             ((int8_t *)address)[1] = operand.y;
             ((int8_t *)address)[2] = operand.z;
             ((int8_t *)address)[3] = operand.w;
             break;
         case S16X2:
             ((int16_t *)address)[0] = operand.x;
             ((int16_t *)address)[1] = operand.y;
             break;
         case S16X4:
             ((int16_t *)address)[0] = operand.x;
             ((int16_t *)address)[1] = operand.y;
             ((int16_t *)address)[2] = operand.z;
             ((int16_t *)address)[3] = operand.w;
             break;
         case U8X2:
             ((uint8_t *)address)[0] = operand.x;
             ((uint8_t *)address)[1] = operand.y;
             break;
         case U8X4:
             ((uint8_t *)address)[0] = operand.x;
             ((uint8_t *)address)[1] = operand.y;
             ((uint8_t *)address)[2] = operand.z;
             ((uint8_t *)address)[3] = operand.w;
             break;
         case U16X2:
             ((uint16_t *)address)[0] = operand.x;
             ((uint16_t *)address)[1] = operand.y;
             break;
         case U16X4:
             ((uint16_t *)address)[0] = operand.x;
             ((uint16_t *)address)[1] = operand.y;
             ((uint16_t *)address)[2] = operand.z;
             ((uint16_t *)address)[3] = operand.w;
             break;
         }
       }

     (modify paragraph to indicate the alignment requirement for new storage
     modifiers) The address used for global memory loads or stores or offset
     used for constant buffer loads must be aligned to the fetch size
     corresponding to the storage opcode modifier.  For S8 and U8, the offset
     has no alignment requirements.  For F16, S8X2, S16, U8X2, and U16, the
     offset must be a multiple of two basic machine units.  For F32, S32, and
     U32, F16X2, S16X2, U16X2, S8X4, and U8X4, the offset must be a multiple of
     four.  For F32X2, F64, S32X2, S64, U32X2, U64, S16X4, and U16X4, the
     offset must be a multiple of eight.  ...  If an offset is not correctly
     aligned, the values returned by a buffer memory load will be undefined,
     and the effects of a buffer memory store will also be undefined.


     Modify Section 2.X.6, Program Options

     + Extended Memory Format Support (NV_gpu_program5_mem_extended)

     If a program specifies the "NV_gpu_program5_mem_extended" option, it may
     use the "F16", "F16X2", "F16X4", "S8X2", "S8X4", "S16X2", "S16X4", "U8X2",
     "U8X4", "U16X2", and "U16X4" storage modifiers on instructions loading
     values from memory or storing values to memory (LDC, LOAD, STORE, LOADIM,
     STOREIM, LDB, STB, LDS, STS).


 Additions to Chapter 3 of the OpenGL 2.0 Specification (Rasterization)

     None.

 Additions to Chapter 4 of the OpenGL 2.0 Specification (Per-Fragment
 Operations and the Frame Buffer)

     None.

 Additions to Chapter 5 of the OpenGL 2.0 Specification (Special Functions)

     None.

 Additions to Chapter 6 of the OpenGL 2.0 Specification (State and
 State Requests)

     None.

 Additions to Appendix A of the OpenGL 2.0 Specification (Invariance)

     None.

 Additions to the AGL/GLX/WGL Specifications

     None.

 Dependencies on EXT_shader_image_load_store, NV_shader_storage_buffer_object,
 and NV_compute_program5

     If EXT_shader_image_load_store is not supported, references to the LOADIM
     and STOREIM opcodes should be removed.

     If NV_shader_storage_buffer_object is not supported, references to the LDB
     and STB opcodes should be removed.

     If NV_compute_program5 is not supported, references to the LDS and STS
     opcodes should be removed.

 Errors

     None.

 New State

     None.

 New Implementation Dependent State

     None.

 Issues

     (1) Should this extension have its own extension string entry, or should
         its existence be inferred from the NV_gpu_program5 extension or some
         other extension?

       RESOLVED:  Provide a separate extension string entry, since this
       functionality was added after NV_gpu_program5 was published and may not
       be available on older drivers supporting NV_gpu_program5.

 Revision History

     Revision 1, October 30, 2012 (pbrown):  Initial revision.
	Name

	NV_gpu_program5_mem_extended

	Name Strings

	GL_NV_gpu_program5_mem_extended

	Contact

	Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

	Status

	Shipping.

	Version

	Last Modified Date: October 30, 2012
	NVIDIA Revision: 1

	Number

	OpenGL Extension #434

	Dependencies

	NV_gpu_program5 is required.

	This extension is written against the NV_gpu_program5 extension
	specification, which itself is written against the NV_gpu_program4 and
	OpenGL 2.0 Specifications.

	This extension interacts trivially with EXT_shader_image_load_store,
	NV_shader_storage_buffer_object, and NV_compute_program5.

	Overview

	This extension provides a new set of storage modifiers that can be used by
	NV_gpu_program5 assembly program instructions loading from or storing to
	various forms of GPU memory. In particular, we provide support for loads
	and stores using the storage modifiers:

	.F16X2 .F16X4 .F16 (for 16-bit floating-point scalars/vectors)
	.S8X2 .S8X4 (for 8-bit signed integer vectors)
	.S16X2 .S16X4 (for 16-bit signed integer vectors)
	.U8X2 .U8X4 (for 8-bit unsigned integer vectors)
	.U16X2 .U16X4 (for 16-bit unsigned integer vectors)

	These modifiers are allowed for the following load/store instructions:

	LDC Load from constant buffer

	LOAD Global load
	STORE Global store

	LOADIM Image load (via EXT_shader_image_load_store)
	STOREIM Image store (via EXT_shader_image_load_store)

	LDB Load from storage buffer (via
	NV_shader_storage_buffer_object)
	STB Store to storage buffer (via
	NV_shader_storage_buffer_object)

	LDS Load from shared memory (via NV_compute_program5)
	STS Store to shared memory (via NV_compute_program5)

	For assembly programs prior to this extension, it was necessary to access
	memory using packed types and then unpack with additional shader
	instructions.

	Similar capabilities have already been provided in the OpenGL Shading
	Language (GLSL) via the NV_gpu_shader5 extension, using the extended data
	types provided there (e.g., "float16_t", "u8vec4", "s16vec2").

	New Procedures and Functions

	None.

	New Tokens

	None.

	Additions to Chapter 2 of the OpenGL 2.0 Specification (OpenGL Operation)

	(All modifications are relative to Section 2.X, GPU Programs, from the
	NV_gpu_program4 specification.)

	Modify Section 2.X.2, Program Grammar

	(add after the long list of grammar rules) If a program specifies the
	NV_gpu_program5_mem_extended program option, the following rules are added
	to the NV_gpu_program5 base program grammar:

	<opModifier> ::= "F16X2"
	\| "F16X4"
	\| "S8X2"
	\| "S8X4"
	\| "S16X2"
	\| "S16X4"
	\| "U8X2"
	\| "U8X4"
	\| "U16X2"
	\| "U16X4"

	(Note: This extension also provides new capabilities for the "F16"
	modifier. Since it was already supported in NV_gpu_program5, it isn't
	being added to the grammar here.)


	Modify Section 2.X.4.1, Program Instruction Modifiers

	(add to Table X.14 of the NV_gpu_program4 specification.)

	Modifier Description
	-------- ---------------------------------------------------
	F16 Convert to or from one 16-bit floating-point value,
	or access one 16-bit floating-point value

	F16X2 Access two 16-bit floating-point values
	F16X4 Access four 16-bit floating-point values
	S8X2 Access two 8-bit signed integer values
	S8X4 Access four 8-bit signed integer values
	S16X2 Access two 16-bit signed integer values
	S16X4 Access four 16-bit signed integer values
	U8X2 Access two 8-bit unsigned integer values
	U8X4 Access four 8-bit unsigned integer values
	U16X2 Access two 16-bit unsigned integer values
	U16X4 Access four 16-bit unsigned integer values

	(modify discussion of storage modifiers for load and store operations,
	adding the entries added to the table above)

	For load and store operations, the "F32", "F32X2", "F32X4", "F64",
	"F64X2", "F64X4", "S8", "S8X2", "S8X4", "S16", "S16X2", "S16X4", "S32",
	"S32X2", "S32X4", "S64", "S64X2", "S64X4", "U8", "U8X2", "U8X4", "U16",
	"U16X2", "U16X4", "U32", "U32X2", "U32X4", "U64", "U64X2", "U64X4", "F16",
	"F16X2", and "F16X4" storage modifiers control how data are loaded from or
	stored to memory. ...


	Modify Section 2.X.4.5, Program Memory Access, from NV_gpu_program5

	(update pseudocode for BufferMemoryLoad)

	result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
	{
	result_t_vec result = { 0, 0, 0, 0 };
	switch (modifier) {

	/* Existing cases and code from NV_gpu_program5 unchanged. */

	case F16:
	result.x = ((float16_t *)address)[0];
	break;
	case F16X2:
	result.x = ((float16_t *)address)[0];
	result.y = ((float16_t *)address)[1];
	break;
	case S8X2:
	result.x = ((int8_t *)address)[0];
	result.y = ((int8_t *)address)[1];
	break;
	case S8X4:
	result.x = ((int8_t *)address)[0];
	result.y = ((int8_t *)address)[1];
	result.z = ((int8_t *)address)[2];
	result.w = ((int8_t *)address)[3];
	break;
	case S16X2:
	result.x = ((int16_t *)address)[0];
	result.y = ((int16_t *)address)[1];
	break;
	case S16X4:
	result.x = ((int16_t *)address)[0];
	result.y = ((int16_t *)address)[1];
	result.z = ((int16_t *)address)[2];
	result.w = ((int16_t *)address)[3];
	break;
	case U8X2:
	result.x = ((uint8_t *)address)[0];
	result.y = ((uint8_t *)address)[1];
	break;
	case U8X4:
	result.x = ((uint8_t *)address)[0];
	result.y = ((uint8_t *)address)[1];
	result.z = ((uint8_t *)address)[2];
	result.w = ((uint8_t *)address)[3];
	break;
	case U16X2:
	result.x = ((uint16_t *)address)[0];
	result.y = ((uint16_t *)address)[1];
	break;
	case U16X4:
	result.x = ((uint16_t *)address)[0];
	result.y = ((uint16_t *)address)[1];
	result.z = ((uint16_t *)address)[2];
	result.w = ((uint16_t *)address)[3];
	break;
	}
	return result;
	}

	(update pseudocode for BufferMemoryStore)

	void BufferMemoryStore(char *address, operand_t_vec operand,
	OpModifier modifier)
	{
	switch (modifier) {

	/* Existing cases and code from NV_gpu_program5 unchanged. */

	case F16:
	((float16_t *)address)[0] = operand.x;
	break;
	case F16X2:
	((float16_t *)address)[0] = operand.x;
	((float16_t *)address)[1] = operand.y;
	break;
	case S8X2:
	((int8_t *)address)[0] = operand.x;
	((int8_t *)address)[1] = operand.y;
	break;
	case S8X4:
	((int8_t *)address)[0] = operand.x;
	((int8_t *)address)[1] = operand.y;
	((int8_t *)address)[2] = operand.z;
	((int8_t *)address)[3] = operand.w;
	break;
	case S16X2:
	((int16_t *)address)[0] = operand.x;
	((int16_t *)address)[1] = operand.y;
	break;
	case S16X4:
	((int16_t *)address)[0] = operand.x;
	((int16_t *)address)[1] = operand.y;
	((int16_t *)address)[2] = operand.z;
	((int16_t *)address)[3] = operand.w;
	break;
	case U8X2:
	((uint8_t *)address)[0] = operand.x;
	((uint8_t *)address)[1] = operand.y;
	break;
	case U8X4:
	((uint8_t *)address)[0] = operand.x;
	((uint8_t *)address)[1] = operand.y;
	((uint8_t *)address)[2] = operand.z;
	((uint8_t *)address)[3] = operand.w;
	break;
	case U16X2:
	((uint16_t *)address)[0] = operand.x;
	((uint16_t *)address)[1] = operand.y;
	break;
	case U16X4:
	((uint16_t *)address)[0] = operand.x;
	((uint16_t *)address)[1] = operand.y;
	((uint16_t *)address)[2] = operand.z;
	((uint16_t *)address)[3] = operand.w;
	break;
	}
	}

	(modify paragraph to indicate the alignment requirement for new storage
	modifiers) The address used for global memory loads or stores or offset
	used for constant buffer loads must be aligned to the fetch size
	corresponding to the storage opcode modifier. For S8 and U8, the offset
	has no alignment requirements. For F16, S8X2, S16, U8X2, and U16, the
	offset must be a multiple of two basic machine units. For F32, S32, and
	U32, F16X2, S16X2, U16X2, S8X4, and U8X4, the offset must be a multiple of
	four. For F32X2, F64, S32X2, S64, U32X2, U64, S16X4, and U16X4, the
	offset must be a multiple of eight. ... If an offset is not correctly
	aligned, the values returned by a buffer memory load will be undefined,
	and the effects of a buffer memory store will also be undefined.


	Modify Section 2.X.6, Program Options

	+ Extended Memory Format Support (NV_gpu_program5_mem_extended)

	If a program specifies the "NV_gpu_program5_mem_extended" option, it may
	use the "F16", "F16X2", "F16X4", "S8X2", "S8X4", "S16X2", "S16X4", "U8X2",
	"U8X4", "U16X2", and "U16X4" storage modifiers on instructions loading
	values from memory or storing values to memory (LDC, LOAD, STORE, LOADIM,
	STOREIM, LDB, STB, LDS, STS).


	Additions to Chapter 3 of the OpenGL 2.0 Specification (Rasterization)

	None.

	Additions to Chapter 4 of the OpenGL 2.0 Specification (Per-Fragment
	Operations and the Frame Buffer)

	None.

	Additions to Chapter 5 of the OpenGL 2.0 Specification (Special Functions)

	None.

	Additions to Chapter 6 of the OpenGL 2.0 Specification (State and
	State Requests)

	None.

	Additions to Appendix A of the OpenGL 2.0 Specification (Invariance)

	None.

	Additions to the AGL/GLX/WGL Specifications

	None.

	Dependencies on EXT_shader_image_load_store, NV_shader_storage_buffer_object,
	and NV_compute_program5

	If EXT_shader_image_load_store is not supported, references to the LOADIM
	and STOREIM opcodes should be removed.

	If NV_shader_storage_buffer_object is not supported, references to the LDB
	and STB opcodes should be removed.

	If NV_compute_program5 is not supported, references to the LDS and STS
	opcodes should be removed.

	Errors

	None.

	New State

	None.

	New Implementation Dependent State

	None.

	Issues

	(1) Should this extension have its own extension string entry, or should
	its existence be inferred from the NV_gpu_program5 extension or some
	other extension?

	RESOLVED: Provide a separate extension string entry, since this
	functionality was added after NV_gpu_program5 was published and may not
	be available on older drivers supporting NV_gpu_program5.

	Revision History

	Revision 1, October 30, 2012 (pbrown): Initial revision.