blob: 8acedbf90da632da9bf2cafb85924574e6008a6d [file] [log] [blame]
Name Strings
cl_intel_thread_local_exec
Contributors
Yariv Aridor
Evgeny Fiksman
Doron Singer
Contact
Yariv.Aridor(at)Intel.Com
IP Status
N/A
Version
Version 2, August 27, 2015
Number
OpenCL Extension #16
Status
Complete.
Extension Type
OpenCL device extension
Dependencies
OpenCL 1.0 is required
Overview
This extension allows the user to execute OpenCL tasks and kernels with
the user application's threads. The extension defines a token that can
be passed to clCreateCommandQueue, creating a queue with the "thread
local exec" capability.
All enqueue APIs (e.g., clEnqueueRead) submitted to such a queue
never enqueue commands. An Enqueue API call is executed by the
caller host-thread itself without involving any of the OpenCL
runtime threads, much like function calls. The queue would typically
stay empty - the queue handler argument is used only for
compatibility of the enqueue APIs (i.e., keeping the same API
signatures) and checking, at runtime, whether the enqueue API needs
to be executed in such a fashion. Enqueue API calls on a "local"
command queue can still use event dependency lists and output
events. A non-NULL event dependency list will block the caller
application thread until all the corresponding events are completed.
Output events will be accessible only after the return of the
enqueue API call (as with regular command queues) and should always
have "completed" status (as expected). They might be useful for
querying event status and profiling data. "Local" command queues
behave the same regardless if they are defined as in-order or
out-of-order. However, with in-order queues, threads may be blocked
until execution of previously enqueued commands (by other threads)
is completed. Given an enqueue API with a blocking flag (e.g.,
clEnqueueReadBuffer), both its blocking non-blocking calls behave
the same and have the exact same impact on the application. Calls to
clFinish and clFlush, clEnqueueBarrier and clEnqueueMarker are valid
although meaningless for these command queues. An optimal
implementation of these APIs for Immediate command queues will incur
the minimal overhead of a function call and possibly an if-then-else
to distinguish between a local command queue and a regular one.
clEnqueueNDRangeCommand and clEnqueueTask should have optimized
implementations using a single execution thread.
Header File
cl_ext.h
New Procedures and Functions
None
New Tokens
CL_QUEUE_THREAD_LOCAL_EXEC_ENABLE_INTEL (((cl_bitfield)1) << 31)
Additions to Chapter 5 of the OpenCL 1.1 (v45) Specification
Add CL_QUEUE_THREAD_LOCAL_EXEC_INTEL as a new property of command
queues into table 5.1 (chapter 5)
In chapter 5.1, add the following error codes to clCreateCommandQueue:
CL_INVALID_QUEUE_PROPERTIES if the target device doesn't support the
corresponding execution mode.
And in a dedicated section to thread local execution, also in
chapter 5.1, page 55:
A host thread which fails to immediately invoke an enqueue API call
because of out of stack memory should return CL_OUT_OF_HOST_MEMORY
for this enqueue API call. The same host code should work, as is,
with and without immediate command queues, except for code changes
due to
I. Potential deadlocks. These are the same type of deadlocks which
may occur with current blocking API calls in OpenCL 1.1
II. Any optimizations which take into account the immediate
execution of commands e.g., writing to a mapped buffer immediately
after a call to clEnqueueMapBuffer without a follow-up call to
clFinish.
Interactions with other extensions
None
Issues
N/A
Sample Code
cl_device_id device_id;
cl_context context;
cl_platform_id platform = 0;
cl_int err = clGetPlatformIDs(1, &platform, NULL);
cl_context_properties prop[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0 };
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device_id, NULL);
// create context
context = clCreateContext(prop, 1, &device_id, NULL, NULL, &err);
// create program with source
cl_program program = clCreateProgramWithSource(context, 1, (const char**)&ocl_test_program, NULL, &err);
err = clBuildProgram(program, NULL, NULL, NULL, NULL, NULL);
cl_float src[IMMEDIATE_EXECUTION_GLOBAL_SIZE];
cl_float dst[IMMEDIATE_EXECUTION_GLOBAL_SIZE];
// ... Fill buffers
// Create an in-order immediate queue
cl_command_queue queue1 = clCreateCommandQueue (context, device_id, CL_QUEUE_THREAD_LOCAL_EXEC_ENABLE_INTEL, &err);
// Create Kernel
cl_kernel kernel1 = clCreateKernel(program, "copy", &err);
// Create buffers
size_t size = sizeof(cl_float);
cl_mem buffer_src = clCreateBuffer(context, CL_MEM_READ_ONLY, size * IMMEDIATE_EXECUTION_GLOBAL_SIZE, NULL, &err);
cl_mem buffer_dst = clCreateBuffer(context, CL_MEM_READ_WRITE, size * IMMEDIATE_EXECUTION_GLOBAL_SIZE, NULL, &err);
// Set arguments
err = clSetKernelArg(kernel1, 0, sizeof(cl_mem), &buffer_src);
err = clSetKernelArg(kernel1, 1, sizeof(cl_mem), &buffer_dst);
// Execute commands - Write buffers
err = clEnqueueWriteBuffer (queue1, buffer_src, false, 0, size* IMMEDIATE_EXECUTION_GLOBAL_SIZE, src, 0, NULL, NULL);
err = clEnqueueWriteBuffer (queue1, buffer_dst, false, 0, size* IMMEDIATE_EXECUTION_GLOBAL_SIZE, dst, 0, NULL, NULL);
// Execute kernel
size_t global_work_size[1] = { IMMEDIATE_EXECUTION_GLOBAL_SIZE };
size_t local_work_size[1] = { IMMEDIATE_EXECUTION_LOCAL_SIZE };
cl_event evt;
err = clEnqueueNDRangeKernel(queue1, kernel1, 1, NULL, global_work_size, local_work_size, 0, NULL, &evt);
err = clEnqueueReadBuffer (queue1, buffer_dst, CL_TRUE, 0, size*IMMEDIATE_EXECUTION_GLOBAL_SIZE, dst, 0, NULL, NULL);
// release OpenCL objects
Conformance Tests
N/A
Revision History
Version 1, March 28, 2012 - Initial revision