blob: 04ff2d3a5cf10767155495a8e97227c005d6ff53 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="generator" content="AsciiDoc 8.6.9">
<title>The OpenCL Specification</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
@media screen {
body {
max-width: 50em; /* approximately 80 characters wide */
margin-left: 16em;
}
#toc {
position: fixed;
top: 0;
left: 0;
bottom: 0;
width: 13em;
padding: 0.5em;
padding-bottom: 1.5em;
margin: 0;
overflow: auto;
border-right: 3px solid #f8f8f8;
background-color: white;
}
#toc .toclevel1 {
margin-top: 0.5em;
}
#toc .toclevel2 {
margin-top: 0.25em;
display: list-item;
color: #aaaaaa;
}
#toctitle {
margin-top: 0.5em;
}
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install(3);
/*]]>*/
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
MathML: { extensions: ["content-mathml.js"] },
tex2jax: { inlineMath: [['$','$'], ['\\(','\\)']] }
});
</script>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
</head>
<body class="book">
<div id="header">
<h1>The OpenCL Specification</h1>
<span id="author">Khronos OpenCL Working Group</span><br>
<span id="revnumber">version v2.2-3</span>
<div id="toc">
<div id="toctitle">Table of Contents</div>
<noscript><p><b>JavaScript must be enabled in your browser to display the table of contents.</b></p></noscript>
</div>
</div>
<div id="content">
<div id="preamble">
<div class="sectionbody">
<div class="paragraph"><p>Copyright 2008-2017 The Khronos Group.</p></div>
<div class="paragraph"><p>This specification is protected by copyright laws and contains material proprietary
to the Khronos Group, Inc. Except as described by these terms, it or any components
may not be reproduced, republished, distributed, transmitted, displayed, broadcast
or otherwise exploited in any manner without the express prior written permission
of Khronos Group.</p></div>
<div class="paragraph"><p>Khronos Group grants a conditional copyright license to use and reproduce the
unmodified specification for any purpose, without fee or royalty, EXCEPT no licenses
to any patent, trademark or other intellectual property rights are granted under
these terms. Parties desiring to implement the specification and make use of
Khronos trademarks in relation to that implementation, and receive reciprocal patent
license protection under the Khronos IP Policy must become Adopters and confirm the
implementation as conformant under the process defined by Khronos for this
specification; see <a href="https://www.khronos.org/adopters">https://www.khronos.org/adopters</a>.</p></div>
<div class="paragraph"><p>Khronos Group makes no, and expressly disclaims any, representations or warranties,
express or implied, regarding this specification, including, without limitation:
merchantability, fitness for a particular purpose, non-infringement of any
intellectual property, correctness, accuracy, completeness, timeliness, and
reliability. Under no circumstances will the Khronos Group, or any of its Promoters,
Contributors or Members, or their respective partners, officers, directors,
employees, agents or representatives be liable for any damages, whether direct,
indirect, special or consequential damages for lost revenues, lost profits, or
otherwise, arising from or in connection with these materials.</p></div>
<div class="paragraph"><p>Vulkan is a registered trademark and Khronos, OpenXR, SPIR, SPIR-V, SYCL, WebGL,
WebCL, OpenVX, OpenVG, EGL, COLLADA, glTF, NNEF, OpenKODE, OpenKCAM, StreamInput,
OpenWF, OpenSL ES, OpenMAX, OpenMAX AL, OpenMAX IL, OpenMAX DL, OpenML and DevU are
trademarks of the Khronos Group Inc. ASTC is a trademark of ARM Holdings PLC,
OpenCL is a trademark of Apple Inc. and OpenGL and OpenML are registered trademarks
and the OpenGL ES and OpenGL SC logos are trademarks of Silicon Graphics
International used under license by Khronos. All other product names, trademarks,
and/or company names are used solely for identification and belong to their
respective owners.</p></div>
<div style="page-break-after:always"></div>
<div class="paragraph"><p><strong>Acknowledgements</strong></p></div>
<div class="paragraph"><p>The OpenCL specification is the result of the contributions of many
people, representing a cross section of the desktop, hand-held, and
embedded computer industry. Following is a partial list of the
contributors, including the company that they represented at the time of
their contribution:</p></div>
<div class="paragraph"><p>Chuck Rose, Adobe<br>
Eric Berdahl, Adobe<br>
Shivani Gupta, Adobe<br>
Bill Licea Kane, AMD<br>
Ed Buckingham, AMD<br>
Jan Civlin, AMD<br>
Laurent Morichetti, AMD<br>
Mark Fowler, AMD<br>
Marty Johnson, AMD<br>
Michael Mantor, AMD<br>
Norm Rubin, AMD<br>
Ofer Rosenberg, AMD<br>
Brian Sumner, AMD<br>
Victor Odintsov, AMD<br>
Aaftab Munshi, Apple<br>
Abe Stephens, Apple<br>
Alexandre Namaan, Apple<br>
Anna Tikhonova, Apple<br>
Chendi Zhang, Apple<br>
Eric Bainville, Apple<br>
David Hayward, Apple<br>
Giridhar Murthy, Apple<br>
Ian Ollmann, Apple<br>
Inam Rahman, Apple<br>
James Shearer, Apple<br>
MonPing Wang, Apple<br>
Tanya Lattner, Apple<br>
Mikael Bourges-Sevenier, Aptina<br>
Anton Lokhmotov, ARM<br>
Dave Shreiner, ARM<br>
Hedley Francis, ARM<br>
Robert Elliott, ARM<br>
Scott Moyers, ARM<br>
Tom Olson, ARM<br>
Anastasia Stulova, ARM<br>
Christopher Thompson-Walsh, Broadcom<br>
Holger Waechtler, Broadcom<br>
Norman Rink, Broadcom<br>
Andrew Richards, Codeplay<br>
Maria Rovatsou, Codeplay<br>
Alistair Donaldson, Codeplay<br>
Alastair Murray, Codeplay<br>
Stephen Frye, Electronic Arts<br>
Eric Schenk, Electronic Arts<br>
Daniel Laroche, Freescale<br>
David Neto, Google<br>
Robin Grosman, Huawei<br>
Craig Davies, Huawei<br>
Brian Horton, IBM<br>
Brian Watt, IBM<br>
Gordon Fossum, IBM<br>
Greg Bellows, IBM<br>
Joaquin Madruga, IBM<br>
Mark Nutter, IBM<br>
Mike Perks, IBM<br>
Sean Wagner, IBM<br>
Jon Parr, Imagination Technologies<br>
Robert Quill, Imagination Technologies<br>
James McCarthy, Imagination Technologie<br>
Aaron Kunze, Intel<br>
Aaron Lefohn, Intel<br>
Adam Lake, Intel<br>
Alexey Bader, Intel<br>
Allen Hux, Intel<br>
Andrew Brownsword, Intel<br>
Andrew Lauritzen, Intel<br>
Bartosz Sochacki, Intel<br>
Ben Ashbaugh, Intel<br>
Brian Lewis, Intel<br>
Geoff Berry, Intel<br>
Hong Jiang, Intel<br>
Jayanth Rao, Intel<br>
Josh Fryman, Intel<br>
Larry Seiler, Intel<br>
Mike MacPherson, Intel<br>
Murali Sundaresan, Intel<br>
Paul Lalonde, Intel<br>
Raun Krisch, Intel<br>
Stephen Junkins, Intel<br>
Tim Foley, Intel<br>
Timothy Mattson, Intel<br>
Yariv Aridor, Intel<br>
Michael Kinsner, Intel<br>
Kevin Stevens, Intel<br>
Jon Leech, Khronos<br>
Benjamin Bergen, Los Alamos National Laboratory<br>
Roy Ju, Mediatek<br>
Bor-Sung Liang, Mediatek<br>
Rahul Agarwal, Mediatek<br>
Michal Witaszek, Mobica<br>
JenqKuen Lee, NTHU<br>
Amit Rao, NVIDIA<br>
Ashish Srivastava, NVIDIA<br>
Bastiaan Aarts, NVIDIA<br>
Chris Cameron, NVIDIA<br>
Christopher Lamb, NVIDIA<br>
Dibyapran Sanyal, NVIDIA<br>
Guatam Chakrabarti, NVIDIA<br>
Ian Buck, NVIDIA<br>
Jaydeep Marathe, NVIDIA<br>
Jian-Zhong Wang, NVIDIA<br>
Karthik Raghavan Ravi, NVIDIA<br>
Kedar Patil, NVIDIA<br>
Manjunath Kudlur, NVIDIA<br>
Mark Harris, NVIDIA<br>
Michael Gold, NVIDIA<br>
Neil Trevett, NVIDIA<br>
Richard Johnson, NVIDIA<br>
Sean Lee, NVIDIA<br>
Tushar Kashalikar, NVIDIA<br>
Vinod Grover, NVIDIA<br>
Xiangyun Kong, NVIDIA<br>
Yogesh Kini, NVIDIA<br>
Yuan Lin, NVIDIA<br>
Mayuresh Pise, NVIDIA<br>
Allan Tzeng, QUALCOMM<br>
Alex Bourd, QUALCOMM<br>
Anirudh Acharya, QUALCOMM<br>
Andrew Gruber, QUALCOMM<br>
Andrzej Mamona, QUALCOMM<br>
Benedict Gaster, QUALCOMM<br>
Bill Torzewski, QUALCOMM<br>
Bob Rychlik, QUALCOMM<br>
Chihong Zhang, QUALCOMM<br>
Chris Mei, QUALCOMM<br>
Colin Sharp, QUALCOMM<br>
David Garcia, QUALCOMM<br>
David Ligon, QUALCOMM<br>
Jay Yun, QUALCOMM<br>
Lee Howes, QUALCOMM<br>
Richard Ruigrok, QUALCOMM<br>
Robert J. Simpson, QUALCOMM<br>
Sumesh Udayakumaran, QUALCOMM<br>
Vineet Goel, QUALCOMM<br>
Lihan Bin, QUALCOMM<br>
Vlad Shimanskiy, QUALCOMM<br>
Jian Liu, QUALCOMM<br>
Tasneem Brutch, Samsung<br>
Yoonseo Choi, Samsung<br>
Dennis Adams, Sony<br>
Pr-Anders Aronsson, Sony<br>
Jim Rasmusson, Sony<br>
Thierry Lepley, STMicroelectronics<br>
Anton Gorenko, StreamComputing<br>
Jakub Szuppe, StreamComputing<br>
Vincent Hindriksen, StreamComputing<br>
Alan Ward, Texas Instruments<br>
Yuan Zhao, Texas Instruments<br>
Pete Curry, Texas Instruments<br>
Simon McIntosh-Smith, University of Bristol<br>
James Price, University of Bristol<br>
Paul Preney, University of Windsor<br>
Shane Peelar, University of Windsor<br>
Brian Hutsell, Vivante<br>
Mike Cai, Vivante<br>
Sumeet Kumar, Vivante<br>
Wei-Lun Kao, Vivante<br>
Xing Wang, Vivante<br>
Jeff Fifield, Xilinx<br>
Hem C. Neema, Xilinx<br>
Henry Styles, Xilinx<br>
Ralph Wittig, Xilinx<br>
Ronan Keryell, Xilinx<br>
AJ Guillon, YetiWare Inc<br></p></div>
<div style="page-break-after:always"></div>
</div>
</div>
<div class="sect1">
<h2 id="_introduction">1. Introduction</h2>
<div class="sectionbody">
<div class="paragraph"><p>Modern processor architectures have embraced parallelism as an important
pathway to increased performance. Facing technical challenges with
higher clock speeds in a fixed power envelope, Central Processing Units
(CPUs) now improve performance by adding multiple cores. Graphics
Processing Units (GPUs) have also evolved from fixed function rendering
devices into programmable parallel processors. As todays computer
systems often include highly parallel CPUs, GPUs and other types of
processors, it is important to enable software developers to take full
advantage of these heterogeneous processing platforms.
<br>
<br>
Creating applications for heterogeneous parallel processing platforms is
challenging as traditional programming approaches for multi-core CPUs
and GPUs are very different. CPU-based parallel programming models are
typically based on standards but usually assume a shared address space
and do not encompass vector operations. General purpose GPU
programming models address complex memory hierarchies and vector
operations but are traditionally platform-, vendor- or
hardware-specific. These limitations make it difficult for a developer
to access the compute power of heterogeneous CPUs, GPUs and other types
of processors from a single, multi-platform source code base. More than
ever, there is a need to enable software developers to effectively take
full advantage of heterogeneous processing platforms from high
performance compute servers, through desktop computer systems to
handheld devices - that include a diverse mix of parallel CPUs, GPUs and
other processors such as DSPs and the Cell/B.E. processor.
<br>
<br>
<strong>OpenCL</strong> (Open Computing Language) is an open royalty-free standard for
general purpose parallel programming across CPUs, GPUs and other
processors, giving software developers portable and efficient access to
the power of these heterogeneous processing platforms.
<br>
<br>
OpenCL supports a wide range of applications, ranging from embedded and
consumer software to HPC solutions, through a low-level,
high-performance, portable abstraction. By creating an efficient,
close-to-the-metal programming interface, OpenCL will form the
foundation layer of a parallel computing ecosystem of
platform-independent tools, middleware and applications. OpenCL is
particularly suited to play an increasingly significant role in emerging
interactive graphics applications that combine general parallel compute
algorithms with graphics rendering pipelines.
<br>
<br>
OpenCL consists of an API for coordinating parallel computation across
heterogeneous processors; and a cross-platform intermediate language
with a well-specified computation environment. The OpenCL standard:</p></div>
<div class="ulist"><ul>
<li>
<p>
Supports both data- and
task-based parallel programming models
</p>
</li>
<li>
<p>
Utilizes a portable and
self-contained intermediate representation with support for parallel
execution
</p>
</li>
<li>
<p>
Defines consistent
numerical requirements based on IEEE 754
</p>
</li>
<li>
<p>
Defines a configuration
profile for handheld and embedded devices
</p>
</li>
<li>
<p>
Efficiently interoperates
with OpenGL, OpenGL ES and other graphics APIs
</p>
</li>
</ul></div>
<div class="paragraph"><p>This document begins with an overview of basic concepts and the
architecture of OpenCL, followed by a detailed description of its
execution model, memory model and synchronization support. It then
discusses the OpenCL__platform and runtime API. Some examples are given
that describe sample compute use-cases and how they would be written in
OpenCL. The specification is divided into a core specification that any
OpenCL compliant implementation must support; a handheld/embedded
profile which relaxes the OpenCL compliance requirements for handheld
and embedded devices; and a set of optional extensions that are likely
to move into the core specification in later revisions of the OpenCL
specification.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_glossary">2. Glossary</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>Application</strong>: The combination of the program running on the host and
OpenCL devices.
<br>
<br>
<strong>Acquire semantics</strong>: One of the memory order semantics defined for
synchronization operations.  Acquire semantics apply to atomic
operations that load from memory.  Given two units of execution, <strong>A</strong> and
<strong>B</strong>, acting on a shared atomic object <strong>M</strong>, if <strong>A</strong> uses an atomic load of
<strong>M</strong> with acquire semantics to synchronize-with an atomic store to <strong>M</strong> by
<strong>B</strong> that used release semantics, then <strong>A</strong>'s atomic load will occur before
any subsequent operations by <strong>A</strong>.  Note that the memory orders
<em>release</em>, <em>sequentially consistent</em>, and <em>acquire_release</em> all include
<em>release semantics</em> and effectively pair with a load using acquire
semantics.
<br>
<br>
<strong>Acquire release semantics</strong>: A memory order semantics for
synchronization operations (such as atomic operations) that has the
properties of both acquire and release memory orders. It is used with
read-modify-write operations.
<br>
<br>
<strong>Atomic operations</strong>: Operations that at any point, and from any
perspective, have either occurred completely, or not at all. Memory
orders associated with atomic operations may constrain the visibility of
loads and stores with respect to the atomic operations (see <em>relaxed
semantics</em>, <em>acquire semantics</em>, <em>release semantics</em> or <em>acquire release
semantics</em>).
<br>
<br>
<strong>Blocking and Non-Blocking Enqueue API calls</strong>: A <em>non-blocking enqueue
API call</em> places a <em>command</em> on a <em>command-queue</em> and returns
immediately to the host. The <em>blocking-mode enqueue API calls</em> do not
return to the host until the command has completed.
<br>
<br>
<strong>Barrier</strong>: There are three types of <em>barriers</em> a command-queue barrier,
a work-group barrier and a sub-group barrier.</p></div>
<div class="ulist"><ul>
<li>
<p>
The OpenCL API provides a
function to enqueue a <em>command-queue</em> <em>barrier</em> command. This <em>barrier</em>
command ensures that all previously enqueued commands to a command-queue
have finished execution before any following <em>commands</em> enqueued in the
<em>command-queue</em> can begin execution.
</p>
</li>
<li>
<p>
The OpenCL kernel
execution model provides built-in <em>work-group barrier</em> functionality.
This <em>barrier</em> built-in function can be used by a <em>kernel</em> executing on
a <em>device</em> to perform synchronization between <em>work-items</em> in a
<em>work-group</em> executing the <em>kernel</em>. All the <em>work-items</em> of a
<em>work-group</em> must execute the <em>barrier</em> construct before any are allowed
to continue execution beyond the <em>barrier</em>.
</p>
</li>
<li>
<p>
The OpenCL kernel
execution model provides built-in <em>sub-group barrier</em> functionality.
This <em>barrier</em> built-in function can be used by a <em>kernel</em> executing on
a <em>device</em> to perform synchronization between <em>work-items</em> in a
<em>sub-group</em> executing the <em>kernel</em>. All the <em>work-items</em> of a
<em>sub-group</em> must execute the <em>barrier</em> construct before any are allowed
to continue execution beyond the <em>barrier</em>.
</p>
</li>
</ul></div>
<div class="paragraph"><p><strong>Buffer Object</strong>: A memory object that stores a linear collection of
bytes. Buffer objects are accessible using a pointer in a <em>kernel</em>
executing on a <em>device</em>. Buffer objects can be manipulated by the host
using OpenCL API calls. A <em>buffer object</em> encapsulates the following
information:</p></div>
<div class="ulist"><ul>
<li>
<p>
Size in bytes.
</p>
</li>
<li>
<p>
Properties that describe
usage information and which region to allocate from.
</p>
</li>
<li>
<p>
Buffer data.
</p>
</li>
</ul></div>
<div class="paragraph"><p><strong>Built-in Kernel</strong>: A <em>built-in kernel</em> is a <em>kernel</em> that is executed on
an OpenCL <em>device</em> or <em>custom device</em> by fixed-function hardware or in
firmware. <em>Applications</em> can query the <em>built-in kernels</em> supported by
a <em>device</em> or <em>custom device</em>. A <em>program object</em> can only contain
<em>kernels</em> written in OpenCL C or <em>built-in kernels</em> but not both. See
also <em>Kernel</em> and <em>Program</em>.
<br>
<br>
<strong>Child kernel</strong>: see <em>device-side enqueue.</em>
<br>
<br>
<strong>Command</strong>: The OpenCL operations that are submitted to a <em>command-queue</em>
for execution. For example, OpenCL commands issue kernels for execution
on a compute device, manipulate memory objects, etc.
<br>
<br>
<strong>Command-queue</strong>: An object that holds <em>commands</em> that will be executed on
a specific <em>device</em>. The <em>command-queue</em> is created on a specific
<em>device</em> in a <em>context</em>. <em>Commands</em> to a <em>command-queue</em> are queued
in-order but may be executed in-order or out-of-order. <em>Refer to
In-order Execution_and_Out-of-order Execution</em>.
<br>
<br>
<strong>Command-queue Barrier</strong>. See <em>Barrier</em>.
<br>
<br>
<strong>Command synchronization</strong>: Constraints on the order that commands are
launched for execution on a device defined in terms of the
synchronization points that occur between commands in host
command-queues and between commands in device-side command-queues. See
<em>synchronization points</em>.
<br>
<br>
<strong>Complete</strong>: The final state in the six state model for the execution of
a command. The transition into this state occurs is signaled through
event objects or callback functions associated with a command.
<br>
<br>
<strong>Compute Device Memory</strong>: This refers to one or more memories attached
to the compute device.
<br>
<br>
<strong>Compute Unit</strong>: An OpenCL <em>device</em> has one or more <em>compute units</em>. A
<em>work-group</em> executes on a single <em>compute unit</em>. A <em>compute unit</em> is
composed of one or more <em>processing elements</em> and <em>local memory</em>. A
<em>compute unit</em> may also include dedicated texture filter units that can
be accessed by its processing elements.
<br>
<br>
<strong>Concurrency</strong>: A property of a system in which a set of tasks in a system
can remain active and make progress at the same time. To utilize
concurrent execution when running a program, a programmer must identify
the concurrency in their problem, expose it within the source code, and
then exploit it using a notation that supports concurrency.
<br>
<br>
<strong>Constant Memory</strong>: A region of <em>global memory</em> that remains constant
during the execution of a <em>kernel</em>. The <em>host</em> allocates and
initializes memory objects placed into <em>constant memory</em>.</p></div>
<div class="paragraph"><p><strong>Context</strong>: The environment within which the kernels execute and the
domain in which synchronization and memory management is defined. The
<em>context</em> includes a set of <em>devices</em>, the memory accessible to those
<em>devices</em>, the corresponding memory properties and one or more
<em>command-queues</em> used to schedule execution of a <em>kernel(s)</em> or
operations on <em>memory objects</em>.
<br>
<br>
<strong>Control flow</strong>: The flow of instructions executed by a work-item.
Multiple logically related work items may or may not execute the same
control flow. The control flow is said to be <em>converged</em> if all the
work-items in the set execution the same stream of instructions. In a
<em>diverged</em> control flow, the work-items in the set execute different
instructions. At a later point, if a diverged control flow becomes
converged, it is said to be a re-converged control flow.
<br>
<br>
<strong>Converged control flow</strong>: see <strong>control flow</strong>.
<br>
<br>
<strong>Custom Device</strong>: An OpenCL <em>device</em> that fully implements the OpenCL
Runtime but does not support <em>programs</em> written in OpenCL C.  A custom
device may be specialized non-programmable hardware that is very power
efficient and performant for directed tasks or hardware with limited
programmable capabilities such as specialized DSPs. Custom devices are
not OpenCL conformant. Custom devices may support an online compiler.  
Programs for custom devices can be created using the OpenCL runtime APIs
that allow OpenCL programs to be created from source (if an online
compiler is supported) and/or binary, or from <em>built-in
kernels_supported by the _device</em>.  See also <em>Device</em>.
<br>
<br>
<strong>Data Parallel Programming Model</strong>: Traditionally, this term refers to a
programming model where concurrency is expressed as instructions from a
single program applied to multiple elements within a set of data
structures.  The term has been generalized in OpenCL to refer to a model
wherein a set of  instructions from a single program are applied
concurrently to each point within an abstract domain of indices.
<br>
<br>
<strong>Data race</strong>: The execution of a program contains a data race if it
contains two actions in different work items or host threads where (1)
one action modifies a memory location and the other action reads or
modifies the same memory location, and (2) at least one of these actions
is not atomic, or the corresponding memory scopes are not inclusive, and
(3) the actions are global actions unordered by the
global-happens-before relation or are local actions unordered by the
local-happens before relation.
<br>
<br>
<strong>Deprecation</strong>: existing features are marked as deprecated if their usage is not recommended as that feature is being de-emphasized, superseded and may be removed from a future version of the specification.[BA2] 
<br>
<br>
<strong>Device</strong>: A <em>device</em> is a collection of <em>compute units</em>. A
<em>command-queue</em> is used to queue <em>commands</em> to a <em>device</em>. Examples of
<em>commands</em> include executing <em>kernels</em>, or reading and writing <em>memory
objects</em>. OpenCL devices typically correspond to a GPU, a multi-core
CPU, and other processors such as DSPs and the Cell/B.E. processor.
<br>
<br>
<strong>Device-side enqueue</strong>: A mechanism whereby a kernel-instance is enqueued
by a kernel-instance running on a device without direct involvement by
the host program. This produces <em>nested parallelism</em>; i.e. additional
levels of concurrency are nested inside a running kernel-instance. The
kernel-instance executing on a device (the <em>parent kernel</em>) enqueues a
kernel-instance (the <em>child kernel</em>) to a device-side command queue.
Child and parent kernels execute asynchronously though a parent kernel
does not complete until all of its child-kernels have completed.
<br>
<br>
<strong>Diverged control flow</strong>: see <em>control flow</em>.
<br>
<br>
<strong>Ended</strong>: The fifth state in the six state model for the execution of a
command. The transition into this state occurs when execution of a
command has ended. When a Kernel-enqueue command ends, all of the
work-groups associated with that command have finished their execution.
<br>
<br>
<strong>Event Object</strong>: An <em>event</em> <em>object_encapsulates the status of an
operation such as a _command</em>. It can be used to synchronize operations
in a context.
<br>
<br>
<strong>Event Wait List</strong>: An <em>event wait list</em> is a list of <em>event objects</em> that
can be used to control when a particular <em>command</em> begins execution.
<br>
<br>
<strong>Fence</strong>: A memory ordering operation without an associated atomic
object. A fence can use the <em>acquire semantics, release semantics</em>, or
<em>acquire release semantics</em>.
<br>
<br>
<strong>Framework</strong>: A software system that contains the set of components to
support software development and execution. A <em>framework</em> typically
includes libraries, APIs, runtime systems, compilers, etc.
<br>
<br>
<strong>Generic address space</strong>: An address space that include the <em>private</em>,
<em>local</em>, and <em>global</em> address spaces available to a device. The generic
address space supports conversion of pointers to and from private, local
and global address spaces, and hence lets a programmer write a single
function that at compile time can take arguments from any of the three
named address spaces.
<br>
<br>
<strong>Global Happens before</strong>: see <em>happens before</em>.
<br>
<br>
<strong>Global ID</strong>: A <em>global ID</em> is used to uniquely identify a <em>work-item</em> and
is derived from the number of <em>global work-items</em> specified when
executing a <em>kernel</em>. The <em>global ID</em> is a N-dimensional value that
starts at (0, 0, 0). See also <em>Local ID</em>.
<br>
<br>
<strong>Global Memory</strong>: A memory region accessible to all <em>work-items</em> executing
in a <em>context</em>. It is accessible to the <em>host</em> using <em>commands</em> such as
read, write and map. <em>Global memory</em> is included within the <em>generic
address space</em> that includes the private and local address spaces.
<br>
<br>
<strong>GL share group</strong>: A <em>GL share group</em> object manages shared OpenGL or
OpenGL ES resources
such as textures, buffers, framebuffers, and renderbuffers and is
associated with one or more GL context objects. The <em>GL share group</em> is
typically an opaque object and not directly accessible.
<br>
<br>
<strong>Handle</strong>: An opaque type that references an <em>object</em> allocated by
OpenCL. Any operation on an <em>object</em> occurs by reference to that
objects handle.
<br>
<br>
<strong>Happens before</strong>: An ordering relationship between operations that
execute on multiple units of execution. If an operation A happens-before
operation B then A must occur before B; in particular, any value written
by A will be visible to B.We define two separate happens before
relations: <em>global-happens-before</em> and <em>local-happens-before</em>. These are
defined in section 3.3.6.
<br>
<br>
<strong>Host</strong>: The <em>host</em> interacts with the <em>context</em> using the OpenCL API.
<br>
<br>
<strong>Host-thread</strong>: the unit of execution that executes the statements in the
Host program.
<br>
<br>
<strong>Host pointer</strong>: A pointer to memory that is in the virtual address space
on the <em>host</em>.
<br>
<br>
<strong>Illegal</strong>: Behavior of a system that is explicitly not allowed and will
be reported as an error when encountered by OpenCL.
<br>
<br>
<strong>Image Object</strong>: A <em>memory object</em> that stores a two- or three-
dimensional structured array. Image data can only be accessed with read
and write functions. The read functions use a <em>sampler</em>.
<br>
<br>
The <em>image object</em> encapsulates the following information:</p></div>
<div class="ulist"><ul>
<li>
<p>
Dimensions of the image.
</p>
</li>
<li>
<p>
Description of each
element in the image.
</p>
</li>
<li>
<p>
Properties that describe
usage information and which region to allocate from.
</p>
</li>
<li>
<p>
Image data.
</p>
</li>
</ul></div>
<div class="paragraph"><p>The elements of an image are selected from a list of predefined image
formats.
<br>
<br>
<strong>Implementation Defined</strong>: Behavior that is explicitly allowed to vary
between conforming implementations of OpenCL. An OpenCL implementor is
required to document the implementation-defined behavior.
<br>
<br>
<strong>Independent Forward Progress</strong>: If an entity supports independent forward
progress, then if it is otherwise not dependent on any actions due to be
performed by any other entity (for example it does not wait on a lock
held by, and thus that must be released by, any other entity), then its
execution cannot be blocked by the execution of any other entity in the
system (it will not be starved). Work items in a subgroup, for example,
typically do not support independent forward progress, so one work item
in a subgroup may be completely blocked (starved) if a different work
item in the same subgroup enters a spin loop.
<br>
<br>
<strong>In-order Execution</strong>: A model of execution in OpenCL where the <em>commands</em>
in a <em>command-queue_ are executed in order of submission with each
_command</em> running to completion before the next one begins. See
Out-of-order Execution.
<br>
<br>
<strong>Intermediate Language</strong>: A lower-level language that may be used to
create programs. SPIR-V is a required IL for OpenCL 2.2 runtimes.
Additional ILs may be accepted on an implementation-defined basis.
<br>
<br>
<strong>Kernel</strong>: A <em>kernel</em> is a function declared in a <em>program</em> and executed
on an OpenCL <em>device</em>. A <em>kernel</em> is identified by the kernel or
kernel qualifier applied to any function defined in a <em>program</em>.
<br>
<br>
<strong>Kernel-instance</strong>: The work carried out by an OpenCL program occurs
through the execution of kernel-instances on devices. The kernel
instance is the <em>kernel object</em>, the values associated with the
arguments to the kernel, and the parameters that define the <em>NDRange</em>
index space.
<br>
<br>
<strong>Kernel Object</strong>: A <em>kernel object</em> encapsulates a specific <em>kernel
function declared in a <em>program</em> and the argument values to be used when
executing this </em>kernel function.
<br>
<br>
<strong>Kernel Language</strong>: A language that is used to create source code for kernel.
Supported kernel languages include OpenCL C, OpenCL C++, and OpenCL dialect of SPIR-V.
<br>
<br>
<strong>Launch</strong>: The transition of a command from the <em>submitted</em> state to the
<em>ready</em> state. See <em>Ready</em>.
<br>
<br>
<strong>Local ID</strong>: A <em>local ID</em> specifies a unique <em>work-item ID</em> within a given
<em>work-group</em> that is executing a <em>kernel</em>. The <em>local ID</em> is a
N-dimensional value that starts at (0, 0, 0). See also <em>Global ID</em>.
<br>
<br>
<strong>Local Memory</strong>: A memory region associated with a <em>work-group</em> and
accessible only by <em>work-items</em> in that <em>work-group</em>. <em>Local memory</em> is
included within the <em>generic address space</em> that includes the private
and global address spaces.
<br>
<br>
<strong>Marker</strong>: A <em>command</em> queued in a <em>command-queue</em> that can be used to
tag all <em>commands</em> queued before the <em>marker</em> in the <em>command-queue</em>.
The <em>marker</em> command returns an <em>event</em> which can be used by the
<em>application</em> to queue a wait on the marker event i.e. wait for all
commands queued before the <em>marker</em> command to complete.
<br>
<br>
<strong>Memory Consistency Model</strong>: Rules that define which values are observed
when multiple units of execution load data from any shared memory plus
the synchronization operations that constrain the order of memory
operations and define synchronization relationships. The memory
consistency model in OpenCL is based on the memory model from the ISO
C11 programming language.
<br>
<br>
<strong>Memory Objects</strong>: A <em>memory object</em> is a handle to a reference counted
region of <em>global memory</em>. Also see_Buffer Object_and_Image Object_.
<br>
<br>
<strong>Memory Regions (or Pools)</strong>: A distinct address space in OpenCL. <em>Memory
regions</em> may overlap in physical memory though OpenCL will treat them as
logically distinct. The <em>memory regions</em> are denoted as <em>private</em>,
<em>local</em>, <em>constant,</em> and <em>global</em>.
<br>
<br>
<strong>Memory Scopes</strong>: These memory scopes define a hierarchy of visibilities
when analyzing the ordering constraints of memory operations. They are
defined by the values of the memory_scope enumeration constant. Current
values are <strong>memory_scope_work_item</strong>(memory constraints only apply to a
single work-item and in practice apply only to image operations)<strong>,
memory_scope_sub_group</strong> (memory-ordering constraints only apply to
work-items executing in a sub-group), <strong>memory_scope_work_group</strong>
(memory-ordering constraints only apply to work-items executing in a
work-group), <strong>memory_scope_device</strong> (memory-ordering constraints only
apply to work-items executing on a single device) and
<strong>memory_scope_all_svm_devices</strong> (memory-ordering constraints only apply
to work-items executing across multiple devices and when using shared
virtual memory).
<br>
<br>
<strong>Modification Order</strong>:All modifications to a particular atomic object M
occur in some particular <strong>total order</strong>, called the <strong>modification
order</strong> of M. If A and B are modifications of an atomic object M, and A
happens-before B, then A shall precede B in the modification order of M.
Note that the modification order of an atomic object M is independent of
whether M is in local or global memory.
<br>
<br>
<strong>Nested Parallelism</strong>: See <em>device-side enqueue</em>.
<br>
<br>
<strong>Object</strong>: Objects are abstract representation of the resources that can
be manipulated by the OpenCL API. Examples include <em>program objects</em>,
<em>kernel objects</em>, and <em>memory objects</em>.
<br>
<br>
<strong>Out-of-Order Execution</strong>: A model of execution in which <em>commands</em> placed
in the <em>work queue</em> may begin and complete execution in any order
consistent with constraints imposed by <em>event wait
lists_and_command-queue barrier</em>. See <em>In-order Execution</em>.
<br>
<br>
<strong>Parent device</strong>: The OpenCL <em>device</em> which is partitioned to create
<em>sub-devices</em>. Not all <em>parent devices_are _root devices</em>. A <em>root
device</em> might be partitioned and the <em>sub-devices</em> partitioned again.
In this case, the first set of <em>sub-devices</em> would be <em>parent devices</em>
of the second set, but not the <em>root devices</em>. Also see <em>device</em>,
<em>parent device</em> and <em>root device</em>.
<br>
<br>
<strong>Parent kernel</strong>: see <em>device-side enqueue</em>.
<br>
<br>
<strong>Pipe</strong>: The <em>pipe</em> memory object conceptually is an ordered sequence of
data items. A pipe has two endpoints: a write endpoint into which data
items are inserted, and a read endpoint from which data items are
removed. At any one time, only one kernel instance may write into a
pipe, and only one kernel instance may read from a pipe. To support the
producer consumer design pattern, one kernel instance connects to the
write endpoint (the producer) while another kernel instance connects to
the reading endpoint (the consumer).
<br>
<br>
<strong>Platform</strong>: The <em>host</em> plus a collection of <em>devices</em> managed by the
OpenCL <em>framework</em> that allow an application to share <em>resources</em> and
execute <em>kernels</em> on <em>devices</em> in the <em>platform</em>.
<br>
<br>
<strong>Private Memory</strong>: A region of memory private to a <em>work-item</em>. Variables
defined in one <em>work-items</em> <em>private memory</em> are not visible to another
<em>work-item</em>.
<br>
<br>
<strong>Processing Element</strong>: A virtual scalar processor. A work-item may
execute on one or more processing elements.
<br>
<br>
<strong>Program</strong>: An OpenCL <em>program</em> consists of a set of <em>kernels</em>.
<em>Programs</em> may also contain auxiliary functions called by the <em>_kernel
functions and constant data.
<br>
<br>
<strong>Program Object</strong>: A _program object</em> encapsulates the following
information:</p></div>
<div class="ulist"><ul>
<li>
<p>
A reference to an
associated <em>context</em>.
</p>
</li>
<li>
<p>
A <em>program</em> source or
binary.
</p>
</li>
<li>
<p>
The latest successfully
built program executable, the list of <em>devices</em> for which the program
executable is built, the build options used and a build log.
</p>
</li>
<li>
<p>
The number of <em>kernel
objects</em> currently attached.
</p>
</li>
</ul></div>
<div class="paragraph"><p> </p></div>
<div class="paragraph"><p><strong>Queued</strong>: The first state in the six state model for the execution of a
command. The transition into this state occurs when the command is
enqueued into a command-queue.
<br>
<br>
<strong>Ready</strong>: The third state in the six state model for the execution of a
command. The transition into this state occurs when pre-requisites
constraining execution of a command have been met; i.e. the command has
been launched. When a Kernel-enqueue command is launched, work-groups
associated with the command are placed in a devices work-pool from
which they are scheduled for execution.
<br>
<br>
<strong>Re-converged Control Flow</strong>: see <em>control flow</em>.
<br>
<br>
<strong>Reference Count</strong>: The life span of an OpenCL object is determined by its
<em>reference count_an internal count of the number of references to the
object. When you create an object in OpenCL, its _reference count</em> is
set to one. Subsequent calls to the appropriate <em>retain</em> API (such as
clRetainContext, clRetainCommandQueue) increment the <em>reference count</em>.
Calls to the appropriate <em>release</em> API (such as clReleaseContext,
clReleaseCommandQueue) decrement the <em>reference count</em>.
Implementations may also modify the <em>reference count</em>, e.g. to track
attached objects or to ensure correct operation of in-progress or
scheduled activities. The object becomes inaccessible to host code when
the number of <em>release</em> operations performed matches the number of
<em>retain</em> operations plus the allocation of the object. At this point the
reference count may be zero but this is not guaranteed.
<br>
<br>
<strong>Relaxed Consistency</strong>: A memory consistency model in which the contents
of memory visible to different <em>work-items</em> or <em>commands</em> may be
different except at a <em>barrier</em> or other explicit synchronization
points.
<br>
<br>
<strong>Relaxed Semantics</strong>: A memory order semantics for atomic operations that
implies no order constraints. The operation is <em>atomic</em> but it has no
impact on the order of memory operations.
<br>
<br>
<strong>Release Semantics</strong>: One of the memory order semantics defined for
synchronization operations.  Release semantics apply to atomic
operations that store to memory.  Given two units of execution, <strong>A</strong> and
<strong>B</strong>, acting on a shared atomic object <strong>M</strong>, if <strong>A</strong> uses an atomic store
of <strong>M</strong> with release semantics to synchronize-with an atomic load to <strong>M</strong>
by <strong>B*that used acquire semantics, then *A*s atomic store will occur
<em>after</em> any prior operations by *A</strong>. Note that the memory orders
<em>acquire</em>, <em>sequentialy consistent</em>, and <em>acquire_release</em> all include
<em>acquire semantics</em> and effectively pair with a store using release
semantics.
<br>
<br>
<strong>Remainder work-groups</strong>: When the work-groups associated with a
kernel-instance are defined, the sizes of a work-group in each dimension
may not evenly divide the size of the NDRange in the corresponding
dimensions. The result is a collection of work-groups on the boundaries
of the NDRange that are smaller than the base work-group size. These are
known as <em>remainder work-groups</em>.
<br>
<br>
<strong>Running</strong>: The fourth state in the six state model for the execution of
a command. The transition into this state occurs when the execution of
the command starts. When a Kernel-enqueue command starts, one or more
work-groups associated with the command start to execute.
<br>
<br>
<strong>Root device</strong>: A <em>root device</em> is an OpenCL <em>device</em> that has not been
partitioned. Also see <em>device</em>, <em>parent device</em> and <em>root device</em>.
<br>
<br>
<strong>Resource</strong>: A class of <em>objects</em> defined by OpenCL. An instance of a
<em>resource</em> is an <em>object</em>. The most common <em>resources</em> are the
<em>context</em>, <em>command-queue</em>, <em>program objects</em>, <em>kernel objects</em>, and
<em>memory objects</em>. Computational resources are hardware elements that
participate in the action of advancing a program counter. Examples
include the <em>host</em>, <em>devices</em>, <em>compute units</em> and <em>processing
elements</em>.
<br>
<br>
<strong>Retain</strong>, Release: The action of incrementing (retain) and decrementing
(release) the reference count using an OpenCL <em>object</em>. This is a book
keeping functionality to make sure the system doesnt remove an <em>object</em>
before all instances that use this <em>object</em> have finished. Refer to
<em>Reference Count</em>.
<br>
<br>
<strong>Sampler</strong>: An <em>object</em> that describes how to sample an image when the
image is read in the <em>kernel</em>. The image read functions take a
<em>sampler</em> as an argument. The <em>sampler</em> specifies the image
addressing-mode i.e. how out-of-range image coordinates are handled, the
filter mode, and whether the input image coordinate is a normalized or
unnormalized value.
<br>
<br>
<strong>Scope inclusion</strong>: Two actions <strong>A</strong> and <strong>B</strong> are defined to have an
inclusive scope if they have the same scope <strong>P</strong> such that: (1) if <strong>P</strong> is
memory_scope_sub_group, and <strong>A</strong> and <strong>B</strong> are executed by work-items
within the same sub-group, or (2) if <strong>P</strong> is memory_scope_work_group, and
<strong>A</strong> and <strong>B</strong> are executed by work-items within the same work-group, or
(3) if <strong>P</strong> is memory_scope_device, and <strong>A</strong> and <strong>B</strong> are executed by
work-items on the same device, or (4) if <strong>P</strong> is
memory_scope_all_svm_devices, if <strong>A</strong> and <strong>B</strong> are executed by host
threads or by work-items on one or more devices that can share SVM
memory with each other and the host process.
<br>
<br>
<strong>Sequenced before</strong>: A relation between evaluations executed by a single
unit of execution. Sequenced-before is an asymmetric, transitive,
pair-wise relation that induces a partial order between evaluations.
Given any two evaluations A and B, if A is sequenced-before B, then the
execution of A shall precede the execution of B.
<br>
<br>
<strong>Sequential consistency</strong>: Sequential consistency interleaves the steps
executed by each unit of execution. Each access to a memory location
sees the last assignment to that location in that interleaving.
<br>
<br>
<strong>Sequentially consistent semantics</strong>: One of the memory order semantics
defined for synchronization operations. When using
sequentially-consistent synchronization operations, the loads and stores
within one unit of execution appear to execute in program order (i.e.,
the sequenced-before order), and loads and stores from different units
of execution appear to be simply interleaved.
<br>
<br>
<strong>Shared Virtual Memory (SVM)</strong>: An address space exposed to both the host
and the devices within a context. SVM causes addresses to be meaningful
between the host and all of the devices within a context and therefore
supports the use of pointer based data structures in OpenCL kernels. It
logically extends a portion of the global memory into the host address
space therefore giving work-items access to the host address space.
There are three types of SVM in OpenCL <strong>Coarse-Grained buffer SVM</strong>:
Sharing occurs at the granularity of regions of OpenCL buffer memory
objects. <strong>Fine-Grained buffer SVM</strong>: Sharing occurs at the granularity
of individual loads/stores into bytes within OpenCL buffer memory
objects. <strong>Fine-Grained system SVM</strong>: Sharing occurs at the granularity of
individual loads/stores into bytes occurring anywhere within the host
memory.
<br>
<br>
<strong>SIMD</strong>: Single Instruction Multiple Data. A programming model where a
<em>kernel</em> is executed concurrently on multiple <em>processing elements</em> each
with its own data and a shared program counter. All <em>processing
elements</em> execute a strictly identical set of instructions.
<br>
<br>
<strong>Specialization constants</strong>: Specialization is intended for constant
objects that will not have known constant values until after initial
generation of a SPIR-V module. Such objects are called specialization
constants. Application might provide values for
the specialization constants that will be used when SPIR-V program is
built. Specialization constants that do not receive a value from an
application shall use default value as defined in SPIR-V specification.
<br>
<br>
<strong>SPMD</strong>: Single Program Multiple Data. A programming model where a
<em>kernel</em> is executed concurrently on multiple <em>processing elements</em> each
with its own data and its own program counter. Hence, while all
computational resources run the same <em>kernel</em> they maintain their own
instruction counter and due to branches in a <em>kernel</em>, the actual
sequence of instructions can be quite different across the set of
<em>processing elements</em>.
<br>
<br>
<strong>Sub-device</strong>: An OpenCL <em>device</em> can be partitioned into multiple
<em>sub-devices</em>. The new <em>sub-devices</em> alias specific collections of
compute units within the parent <em>device</em>, according to a partition
scheme. The <em>sub-devices</em> may be used in any situation that their
parent <em>device</em> may be used. Partitioning a <em>device</em> does not destroy
the parent <em>device</em>, which may continue to be used along side and
intermingled with its child <em>sub-devices</em>. Also see <em>device</em>, <em>parent
device</em> and <em>root device</em>.
<br>
<br>
<strong>Sub-group</strong>: Sub-groups are an implementation-dependent grouping of
work-items within a work-group.   The size and number of sub-groups is
implementation-defined.
<br>
<br>
<strong>Sub-group Barrier</strong>. See <em>Barrier</em>.
<br>
<br>
<strong>Submitted</strong>: The second state in the six state model for the execution
of a command. The transition into this state occurs when the command is
flushed from the command-queue and submitted for execution on the
device. Once submitted, a programmer can assume a command will execute
once its prerequisites have been met.
<br>
<br>
<strong>SVM Buffer</strong>: A memory allocation enabled to work with Shared Virtual
Memory (SVM). Depending on how the SVM buffer is created, it can be a
coarse-grained or fine-grained SVM buffer. Optionally it may be wrapped
by a Buffer Object. See <em>Shared Virtual Memory (SVM)</em>.
<br>
<br>
<strong>Synchronization</strong>: Synchronization refers to mechanisms that constrain
the order of execution and the visibility of memory operations between
two or more units of execution.
<br>
<br>
<strong>Synchronization operations</strong>: Operations that define memory order
constraints in a program. They play a special role in controlling how
memory operations in one unit of execution (such as work-items or, when
using SVM a host thread) are made visible to another. Synchronization
operations in OpenCL include <em>atomic operations</em> and <em>fences</em>.
<br>
<br>
<strong>Synchronization point</strong>: A synchronization point between a pair of
commands (A and B) assures that results of command A happens-before
command B is launched (i.e. enters the ready state) .
<br>
<br>
<strong>Synchronizes with</strong>: A relation between operations in two different
units of execution that defines a memory order constraint in global
memory (<em>global-synchronizes-with</em>) or local memory
(<em>local-synchronizes-with</em>).
<br>
<br>
<strong>Task Parallel Programming Model</strong>: A programming model in which
computations are expressed in terms of multiple concurrent tasks
executing in one or more <em>command-queues</em>. The concurrent tasks can be
running different <em>kernels</em>.
<br>
<br>
<strong>Thread-safe</strong>: An OpenCL API call is considered to be <em>thread-safe</em> if
the internal state as managed by OpenCL remains consistent when called
simultaneously by multiple <em>host</em> threads. OpenCL API calls that are
<em>thread-safe</em> allow an application to call these functions in multiple
<em>host</em> threads without having to implement mutual exclusion across these
<em>host</em> threads i.e. they are also re-entrant-safe.
<br>
<br>
<strong>Undefined</strong>: The behavior of an OpenCL API call, built-in function used
inside a <em>kernel</em> or execution of a <em>kernel</em> that is explicitly not
defined by OpenCL. A conforming implementation is not required to
specify what occurs when an undefined construct is encountered in
OpenCL.
<br>
<br>
<strong>Unit of execution</strong>: a generic term for a process, OS managed thread
running on the host (a host-thread), kernel-instance, host program,
work-item or any other executable agent that advances the work
associated with a program.
<br>
<br>
<strong>Work-group</strong>: A collection of related <em>work-items</em> that execute on a
single <em>compute unit</em>. The <em>work-items</em> in the group execute the same
<em>kernel-instance</em> and share <em>local</em> <em>memory</em> and <em>work-group functions</em>.
<br>
<br>
<strong>Work-group Barrier</strong>. See <em>Barrier</em>.
<br>
<br>
<strong>Work-group Function</strong>: A function that carries out collective operations
across all the work-items in a work-group. Available collective
operations are a barrier, reduction, broadcast, prefix sum, and
evaluation of a predicate. A work-group function must occur within a
<em>converged control flow</em>; i.e. all work-items in the work-group must
encounter precisely the same work-group function.
<br>
<br>
<strong>Work-group Synchronization</strong>: Constraints on the order of execution for
work-items in a single work-group.
<br>
<br>
<strong>Work-pool</strong>: A logical pool associated with a device that holds commands
and work-groups from kernel-instances that are ready to execute. OpenCL
does not constrain the order that commands and work-groups are scheduled
for execution from the work-pool; i.e. a programmer must assume that
they could be interleaved. There is one work-pool per device used by
all command-queues associated with that device. The work-pool may be
implemented in any manner as long as it assures that work-groups placed
in the pool will eventually execute.
<br>
<br>
<strong>Work-item</strong>: One of a collection of parallel executions of a <em>kernel</em>
invoked on a <em>device</em> by a <em>command</em>. A <em>work-item</em> is executed by one
or more <em>processing elements</em> as part of a <em>work-group</em> executing on a
<em>compute unit</em>. A <em>work-item</em> is distinguished from other work-items by
its <em>global ID</em> or the combination of its <em>work-group</em> ID and its <em>local
ID</em> within a <em>work-group</em>.</p></div>
<div class="paragraph"><p> </p></div>
</div>
</div>
<div class="sect1">
<h2 id="_the_opencl_architecture">3. The OpenCL Architecture</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>OpenCL</strong> is an open industry standard for programming a heterogeneous
collection of CPUs, GPUs and other discrete computing devices organized
into a single platform. It is more than a language. OpenCL is a
framework for parallel programming and includes a language, API,
libraries and a runtime system to support software development. Using
OpenCL, for example, a programmer can write general purpose programs
that execute on GPUs without the need to map their algorithms onto a 3D
graphics API such as OpenGL or DirectX.
<br>
<br>
The target of OpenCL is expert programmers wanting to write portable yet
efficient code. This includes library writers, middleware vendors, and
performance oriented application programmers. Therefore OpenCL provides
a low-level hardware abstraction plus a framework to support programming
and many details of the underlying hardware are exposed.
<br>
<br>
To describe the core ideas behind OpenCL, we will use a hierarchy of
models:</p></div>
<div class="ulist"><ul>
<li>
<p>
Platform Model
</p>
</li>
<li>
<p>
Memory Model
</p>
</li>
<li>
<p>
Execution Model
</p>
</li>
<li>
<p>
Programming Model
</p>
</li>
</ul></div>
<div class="sect2">
<h3 id="_platform_model">3.1. Platform Model</h3>
<div class="paragraph"><p>The Platform model for OpenCL is defined in <em>figure 3.1</em>. The model
consists of a <strong>host</strong> connected to one or more <strong>OpenCL devices</strong>. An OpenCL
device is divided into one or more <strong>compute units</strong> (CUs) which are further
divided into one or more <strong>processing elements</strong> (PEs). Computations on a
device occur within the processing elements.
<br>
<br>
An OpenCL application is implemented as both host code and device kernel
code.  The host code portion of an OpenCL application runs on a host
processor according to the models native to the host platform. The
OpenCL application host code submits the kernel code as commands from
the host to OpenCL devices.  An OpenCL device executes the commands
computation on the processing elements within the device. 
<br>
<br>
An OpenCL device has considerable latitude on how computations are
mapped onto the devices processing elements.  When processing elements
within a compute unit execute the same sequence of statements across the
processing elements, the control flow is said to be <em>converged.</em>
Hardware optimized for executing a single stream of instructions over
multiple processing elements is well suited to converged control
flows. When the control flow varies from one processing element to
another, it is said to be <em>diverged.</em> While a kernel always begins
execution with a converged control flow, due to branching statements
within a kernel, converged and diverged control flows may occur within a
single kernel. This provides a great deal of flexibility in the
algorithms that can be implemented with OpenCL.
<br>
<br></p></div>
<div class="paragraph"><p><span class="image">
<img src="opencl22-API_files/image004_new.png" alt="opencl22-API_files/image004_new.png" width="320" height="180">
</span></p></div>
<div class="paragraph"><p><strong>Figure 3.1</strong>: <em>Platform model &#8230; one host plus one or more compute devices each
with one or more compute units composed of one or more processing elements</em>.
<br>
<br>
Programmers provide programs in the form of SPIR-V source binaries,
OpenCL C or OpenCL C++ source strings or implementation-defined binary objects. The
OpenCL platform provides a compiler to translate program input of either
form into executable program objects. The device code compiler may be
<em>online</em> or <em>offline</em>. An <em>online</em> <em>compiler</em> is available during host
program execution using standard APIs. An <em>offline compiler</em> is
invoked outside of host program control, using platform-specific
methods. The OpenCL runtime allows developers to get a previously
compiled device program executable and be able to load and execute a
previously compiled device program executable.
<br>
<br>
OpenCL defines two kinds of platform profiles: a <em>full profile</em> and a
reduced-functionality <em>embedded profile</em>. A full profile platform must
provide an online compiler for all its devices. An embedded platform
may provide an online compiler, but is not required to do so.
<br>
<br>
A device may expose special purpose functionality as a <em>built-in
function</em>. The platform provides APIs for enumerating and invoking the
built-in functions offered by a device, but otherwise does not define
their construction or semantics. A <em>custom device</em> supports only
built-in functions, and cannot be programmed via a kernel language.
<br>
<br>
All device types support the OpenCL execution model, the OpenCL memory
model, and the APIs used in OpenCL to manage devices.
<br>
<br>
The platform model is an abstraction describing how OpenCL views the
hardware. The relationship between the elements of the platform model
and the hardware in a system may be a fixed property of a device or it
may be a dynamic feature of a program dependent on how a compiler
optimizes code to best utilize physical hardware.</p></div>
</div>
<div class="sect2">
<h3 id="_execution_model">3.2. Execution Model</h3>
<div class="paragraph"><p>The OpenCL execution model is defined in terms of two distinct units of
execution: <strong>kernels</strong> that execute on one or more OpenCL devices and a
<strong>host program</strong> that executes on the host. With regard to OpenCL, the
kernels are where the "work" associated with a computation occurs. This
work occurs through <strong>work-items</strong> that execute in groups (<strong>work-groups</strong>).
<br>
<br>
A kernel executes within a well-defined context managed by the host.
The context defines the environment within which kernels execute. It
includes the following resources:</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>Devices</strong>: One or
more devices exposed by the OpenCL platform.
</p>
</li>
<li>
<p>
<strong>Kernel Objects</strong>:The
OpenCL functions with their associated argument values that run on
OpenCL devices.
</p>
</li>
<li>
<p>
<strong>Program Objects</strong>:The
program source and executable that implement the kernels.
</p>
</li>
<li>
<p>
<strong>Memory
Objects</strong>:Variables visible to the host and the OpenCL devices.
Instances of kernels operate on these objects as they execute.
</p>
</li>
</ul></div>
<div class="paragraph"><p>The host program uses the OpenCL API to create and manage the context.
Functions from the OpenCL API enable the host to interact with a device
through a <em>command-queue</em>. Each command-queue is associated with a
single device. The commands placed into the command-queue fall into
one of three types:</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>Kernel-enqueue commands</strong>:
Enqueue a kernel for execution on a device.
</p>
</li>
<li>
<p>
<strong>Memory commands</strong>:
Transfer data between the host and device memory, between memory
objects, or map and unmap memory objects from the host address space.
</p>
</li>
<li>
<p>
<strong>Synchronization
commands</strong>: Explicit synchronization points that define order constraints
between commands.
</p>
</li>
</ul></div>
<div class="paragraph"><p>In addition to commands submitted from the host command-queue, a kernel
running on a device can enqueue commands to a device-side command queue.
This results in <em>child kernels</em> enqueued by a kernel executing on a
device (the <em>parent kernel</em>). Regardless of whether the command-queue
resides on the host or a device, each command passes through six states.</p></div>
<div class="olist arabic"><ol class="arabic">
<li>
<p>
<strong>Queued</strong>: The command is enqueued to a command-queue. A
command may reside in the queue until it is flushed either explicitly (a
call to clFlush) or implicitly by some other command.
</p>
</li>
<li>
<p>
<strong>Submitted</strong>: The command is flushed from the command-queue and
submitted for execution on the device. Once flushed from the
command-queue, a command will execute after any prerequisites for
execution are met.
</p>
</li>
<li>
<p>
<strong>Ready</strong>: All prerequisites constraining execution of a command
have been met. The command, or for a kernel-enqueue command the
collection of work groups associated with a command, is placed in a
device work-pool from which it is scheduled for execution.
</p>
</li>
<li>
<p>
<strong>Running</strong>: Execution of the command starts. For the case of a
kernel-enqueue command, one or more work-groups associated with the
command start to execute.
</p>
</li>
<li>
<p>
<strong>Ended</strong>: Execution of a command ends. When a Kernel-enqueue
command ends, all of the work-groups associated with that command have
finished their execution. <em>Immediate side effects</em>, i.e. those
associated with the kernel but not necessarily with its child kernels,
are visible to other units of execution. These side effects include
updates to values in global memory.
</p>
</li>
<li>
<p>
<strong>Complete</strong>: The command and its child commands have finished
execution and the status of the event object, if any, associated with
the command is set to CL_COMPLETE.
</p>
</li>
</ol></div>
<div class="paragraph"><p>The execution states and the transitions between them are summarized in
Figure 3-2. These states and the concept of a device work-pool are
conceptual elements of the execution model. An implementation of OpenCL
has considerable freedom in how these are exposed to a program. Five of
the transitions, however, are directly observable through a profiling
interface. These profiled states are shown in Figure 3-2.</p></div>
<div class="paragraph"><p><span class="image">
<img src="opencl22-API_files/image006.jpg" alt="image">
</span></p></div>
<div class="paragraph"><p><strong>Figure 3-2: The states and transitions between states defined in the
OpenCL execution model. A subset of these transitions is exposed
through the profiling interface (see section 5.14).</strong></p></div>
<div class="paragraph"><p>Commands communicate their status through <em>Event objects</em>. Successful
completion is indicated by setting the event status associated with a
command to CL_COMPLETE. Unsuccessful completion results in abnormal
termination of the command which is indicated by setting the event
status to a negative value. In this case, the command-queue associated
with the abnormally terminated command and all other command-queues in
the same context may no longer be available and their behavior is
implementation defined.
<br>
<br>
A command submitted to a device will not launch until prerequisites that
constrain the order of commands have been resolved. These
prerequisites have three sources:</p></div>
<div class="ulist"><ul>
<li>
<p>
They may arise from
commands submitted to a command-queue that constrain the order in which
commands are launched. For example, commands that follow a command queue
barrier will not launch until all commands prior to the barrier are
complete.
</p>
</li>
<li>
<p>
The second source of
prerequisites is dependencies between commands expressed through events.
A command may include an optional list of events. The command will wait
and not launch until all the events in the list are in the state CL
COMPLETE. By this mechanism, event objects define order constraints
between commands and coordinate execution between the host and one or
more devices.
</p>
</li>
<li>
<p>
The third source of
prerequisities can be the presence of non-trivial C initializers or C<span class="monospaced">
constructors for program scope global variables. In this case, OpenCL
C/C</span> compiler shall generate program initialization kernels that
perform C initialization or C<span class="monospaced"> construction. These kernels must be
executed by OpenCL runtime on a device before any kernel from the same
program can be executed on the same device. The ND-range for any program
initialization kernel is (1,1,1). When multiple programs are linked
together, the order of execution of program initialization kernels
that belong to different programs is undefined.
<br>
<br>
Program clean up may result in the execution of one or more program
clean up kernels by the OpenCL runtime. This is due to the presence of
non-trivial C\</span> destructors for program scope variables. The ND-range
for executing any program clean up kernel is (1,1,1). The order of
execution of clean up kernels from different programs (that are linked
together) is undefined.
<br>
<br>
Note that C initializers, C<span class="monospaced"> constructors, or C</span> destructors for
program scope variables cannot use pointers to coarse grain and fine
grain SVM allocations.
<br>
<br>
A command may be submitted to a device and yet have no visible side effects
outside of waiting on and satisfying event dependences. Examples include
markers, kernels executed over ranges of no work-items or copy
operations with zero sizes. Such commands may pass directly from the
<em>ready</em> state to the <em>ended</em> state.
<br>
<br>
Command execution can be blocking or non-blocking. Consider a sequence
of OpenCL commands. For blocking commands, the OpenCL API functions
that enqueue commands don&#8217;t return until the command has completed.
Alternatively, OpenCL functions that enqueue non-blocking commands
return immediately and require that a programmer defines dependencies
between enqueued commands to ensure that enqueued commands are not
launched before needed resources are available. In both cases, the
actual execution of the command may occur asynchronously with execution
of the host program.
<br>
<br>
Commands within a single command-queue execute relative to each other in
one of two modes:
</p>
</li>
</ul></div>
<div class="paragraph"><p> </p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>In-order Execution</strong>:
Commands and any side effects associated with commands appear to the
OpenCL application as if they execute in the same order they are
enqueued to a command-queue.
</p>
</li>
<li>
<p>
<strong>Out-of-order Execution</strong>:
Commands execute in any order constrained only by explicit
synchronization points (e.g. through command queue barriers) or explicit
dependencies on events.
<br>
<br>
Multiple command-queues can be present within a single context.
Multiple command-queues execute commands independently. Event objects
visible to the host program can be used to define synchronization points
between commands in multiple command queues. If such synchronization
points are established between commands in multiple command-queues, an
implementation must assure that the command-queues progress concurrently
and correctly account for the dependencies established by the
synchronization points. For a detailed explanation of synchronization
points, see section 3.2.4.
<br>
<br>
The core of the OpenCL execution model is defined by how the kernels
execute. When a kernel-enqueue command submits a kernel for execution,
an index space is defined. The kernel, the argument values associated
with the arguments to the kernel, and the parameters that define the
index space define a <em>kernel-instance</em>. When a kernel-instance executes
on a device, the kernel function executes for each point in the defined
index space. Each of these executing kernel functions is called a
<em>work-item</em>. The work-items associated with a given kernel-instance are
managed by the device in groups called <em>work-groups</em>. These work-groups
define a coarse grained decomposition of the Index space. Work-groups
are further divided into <em>sub-groups</em>, which provide an additional level
of control over execution.
<br>
<br>
Work-items have a global ID based on their coordinates within the Index
space. They can also be defined in terms of their work-group and the
local ID within a work-group. The details of this mapping are described
in the following section.
</p>
</li>
</ul></div>
<div class="sect3">
<h4 id="_execution_model_mapping_work_items_onto_an_ndrange">3.2.1. Execution Model: Mapping work-items onto an NDRange</h4>
<div class="paragraph"><p>The index space supported by OpenCL is called an NDRange. An NDRange is
an N-dimensional index space, where N is one, two or three. The NDRange
is decomposed into work-groups forming blocks that cover the Index
space. An NDRange is defined by three integer arrays of length N:</p></div>
<div class="ulist"><ul>
<li>
<p>
The extent of the index
space (or global size) in each dimension.
</p>
</li>
<li>
<p>
An offset index F
indicating the initial value of the indices in each dimension (zero by
default).
</p>
</li>
<li>
<p>
The size of a work-group
(local size) in each dimension.
</p>
</li>
</ul></div>
<div class="paragraph"><p> </p></div>
<div class="paragraph"><p>Each work-items global ID is an N-dimensional tuple. The global ID
components are values in the range from F, to F plus the number of
elements in that dimension minus one.
<br>
<br>
If a kernel is created from OpenCL C 2.0 or SPIR-V, the size of work-groups
in an NDRange (the local size) need not be the same for all work-groups.
In this case, any single dimension for which the global size is not
divisible by the local size will be partitioned into two regions. One
region will have work-groups that have the same number of work items as
was specified for that dimension by the programmer (the local size). The
other region will have work-groups with less than the number of work
items specified by the local size parameter in that dimension (the
<em>remainder work-groups</em>). Work-group sizes could be non-uniform in
multiple dimensions, potentially producing work-groups of up to 4
different sizes in a 2D range and 8 different sizes in a 3D range.
<br>
<br>
Each work-item is assigned to a work-group and given a local ID to
represent its position within the work-group. A work-item&#8217;s local ID is
an N-dimensional tuple with components in the range from zero to the
size of the work-group in that dimension minus one.
<br>
<br>
Work-groups are assigned IDs similarly. The number of work-groups in
each dimension is not directly defined but is inferred from the local
and global NDRanges provided when a kernel-instance is enqueued. A
work-group&#8217;s ID is an N-dimensional tuple with components in the range 0
to the ceiling of the global size in that dimension divided by the local
size in the same dimension. As a result, the combination of a
work-group ID and the local-ID within a work-group uniquely defines a
work-item. Each work-item is identifiable in two ways; in terms of a
global index, and in terms of a work-group index plus a local index
within a work group.
<br>
<br>
For example, consider the 2-dimensional index space in figure 3-3. We
input the index space for the work-items (G<sub>x</sub>, G<sub>y</sub>), the size of each
work-group (S<sub>x</sub>, S<sub>y</sub>) and the global ID offset (F<sub>x</sub>, F<sub>y</sub>). The
global indices define an G<sub>x</sub>by G<sub>y</sub> index space where the total number
of work-items is the product of G<sub>x</sub> and G<sub>y</sub>. The local indices define
an S<sub>x</sub> by S<sub>y</sub> index space where the number of work-items in a single
work-group is the product of S<sub>x</sub> and S<sub>y</sub>. Given the size of each
work-group and the total number of work-items we can compute the number
of work-groups. A 2-dimensional index space is used to uniquely identify
a work-group. Each work-item is identified by its global ID (<em>g</em><sub>x</sub>,
<em>g</em><sub>y</sub>) or by the combination of the work-group ID (<em>w</em><sub>x</sub>, <em>w</em><sub>y</sub>), the
size of each work-group (S<sub>x</sub>,S<sub>y</sub>) and the local ID (s<sub>x</sub>, s<sub>y</sub>) inside
the work-group such that
<br></p></div>
<div class="paragraph"><p>&#160; &#160; &#160; &#160; (g<sub>x</sub> , g<sub>y</sub>) = (w<sub>x</sub> * S<sub>x</sub> + s<sub>x</sub> + F<sub>x</sub>, w<sub>y</sub> * S<sub>y</sub> + s<sub>y</sub> + F<sub>y</sub>)
<br>
<br>
The number of work-groups can be computed as:
<br></p></div>
<div class="paragraph"><p>&#160; &#160; &#160; &#160; (W<sub>x</sub>, W<sub>y</sub>) = (ceil(G<sub>x</sub> / S<sub>x</sub>),ceil( G<sub>y</sub> / S<sub>y</sub>))
<br>
<br>
Given a global ID and the work-group size, the work-group ID for a
work-item is computed as:
<br></p></div>
<div class="paragraph"><p>&#160; &#160; &#160; &#160; (w<sub>x</sub>, w<sub>y</sub>) = ( (g<sub>x</sub> s<sub>x</sub> F<sub>x</sub>) / S<sub>x</sub>, (g<sub>y</sub> s<sub>y</sub> F<sub>y</sub>) /
S<sub>y</sub> )</p></div>
<div class="paragraph"><p><span class="image">
<img src="opencl22-API_files/image007.jpg" alt="image">
</span></p></div>
<div class="paragraph"><p><strong>Figure 3-3: An example of an NDRange index space showing work-items,
their global IDs and their mapping onto the pair of work-group and local
IDs. In this case, we assume that in each dimension, the size of the
work-group evenly divides the global NDRange size (i.e. all work-groups
have the same size) and that the offset is equal to zero.</strong>
<br>
<br>
Within a work-group work-items may be divided into sub-groups. The
mapping of work-items to sub-groups is implementation-defined and may be
queried at runtime. While sub-groups may be used in multi-dimensional
work-groups, each sub-group is 1-dimensional and any given work-item may
query which sub-group it is a member of.
<br>
<br>
Work items are mapped into sub-groups through a combination of
compile-time decisions and the parameters of the dispatch. The mapping
to sub-groups is invariant for the duration of a kernels execution,
across dispatches of a given kernel with the same work-group dimensions,
between dispatches and query operations consistent with the dispatch
parameterization, and from one work-group to another within the dispatch
(excluding the trailing edge work-groups in the presence of non-uniform
work-group sizes). In addition, all sub-groups within a work-group will
be the same size, apart from the sub-group with the maximum index which
may be smaller if the size of the work-group is not evenly divisible by
the size of the sub-groups.
<br>
<br>
In the degenerate case, a single sub-group must be supported for each
work-group. In this situation all sub-group scope functions are
equivalent to their work-group level equivalents.</p></div>
</div>
<div class="sect3">
<h4 id="_execution_model_execution_of_kernel_instances">3.2.2. Execution Model: Execution of kernel-instances</h4>
<div class="paragraph"><p>The work carried out by an OpenCL program occurs through the execution
of kernel-instances on compute devices. To understand the details of
OpenCLs execution model, we need to consider how a kernel object moves
from the kernel-enqueue command, into a command-queue, executes on a
device, and completes.
<br>
<br>
A kernel-object is defined from a function within the program object and
a collection of arguments connecting the kernel to a set of argument
values. The host program enqueues a kernel-object to the command queue
along with the NDRange, and the work-group decomposition. These define
a <em>kernel-instance</em>. In addition, an optional set of events may be
defined when the kernel is enqueued. The events associated with a
particular kernel-instance are used to constrain when the
kernel-instance is launched with respect to other commands in the queue
or to commands in other queues within the same context.
<br>
<br>
A kernel-instance is submitted to a device. For an in-order command
queue, the kernel instances appear to launch and then execute in that
same order; where we use the term appear to emphasize that when there
are no dependencies between commands and hence differences in the order
that commands execute cannot be observed in a program, an implementation
can reorder commands even in an in-order command queue. For an out of
order command-queue, kernel-instances wait to be launched until:</p></div>
<div class="ulist"><ul>
<li>
<p>
Synchronization commands
enqueued prior to the kernel-instance are satisfied.
</p>
</li>
<li>
<p>
Each of the events in an
optional event list defined when the kernel-instance was enqueued are
set to CL_COMPLETE.
</p>
</li>
</ul></div>
<div class="paragraph"><p>Once these conditions are met, the kernel-instance is launched and the
work-groups associated with the kernel-instance are placed into a pool
of ready to execute work-groups. This pool is called a <em>work-pool</em>.
The work-pool may be implemented in any manner as long as it assures
that work-groups placed in the pool will eventually execute. The
device schedules work-groups from the work-pool for execution on the
compute units of the device. The kernel-enqueue command is complete when
all work-groups associated with the kernel-instance end their execution,
updates to global memory associated with a command are visible globally,
and the device signals successful completion by setting the event
associated with the kernel-enqueue command to CL_COMPLETE.
<br>
<br>
While a command-queue is associated with only one device, a single
device may be associated with multiple command-queues all feeding into
the single work-pool. A device may also be associated with command
queues associated with different contexts within the same platform,
again all feeding into the single work-pool. The device will pull
work-groups from the work-pool and execute them on one or several
compute units in any order; possibly interleaving execution of
work-groups from multiple commands. A conforming implementation may
choose to serialize the work-groups so a correct algorithm cannot assume
that work-groups will execute in parallel. There is no safe and
portable way to synchronize across the independent execution of
work-groups since once in the work-pool, they can execute in any order.
<br>
<br>
The work-items within a single sub-group execute concurrently but not
necessarily in parallel (i.e. they are not guaranteed to make
independent forward progress). Therefore, only high-level
synchronization constructs (e.g. sub-group functions such as barriers)
that apply to all the work-items in a sub-group are well defined and
included in OpenCL.
<br>
<br>
Sub-groups execute concurrently within a given work-group and with
appropriate device support (<em>see Section__4.2</em>) may make independent
forward progress with respect to each other, with respect to host
threads and with respect to any entities external to the OpenCL system
but running on an OpenCL device, even in the absence of work-group
barrier operations. In this situation, sub-groups are able to internally
synchronize using barrier operations without synchronizing with each
other and may perform operations that rely on runtime dependencies on
operations other sub-groups perform.
<br>
<br>
The work-items within a single work-group execute concurrently but are
only guaranteed to make independent progress in the presence of
sub-groups and device support. In the absence of this capability, only
high-level synchronization constructs (e.g. work-group functions such as
barriers) that apply to all the work-items in a work-group are well
defined and included in OpenCL for synchronization within the
work-group.
<br>
<br>
In the absence of synchronization functions (e.g. a barrier), work-items
within a sub-group may be serialized. In the presence of sub -group
functions, work-items within a sub -group may be serialized before any
given sub -group function, between dynamically encountered pairs of sub
-group functions and between a work-group function and the end of the
kernel.
<br>
<br>
In the absence of independent forward progress of constituent
sub-groups, work-items within a work-group may be serialized before,
after or between work-group synchronization functions.</p></div>
</div>
<div class="sect3">
<h4 id="_execution_model_device_side_enqueue">3.2.3. Execution Model: Device-side enqueue</h4>
<div class="paragraph"><p>Algorithms may need to generate additional work as they execute. In
many cases, this additional work cannot be determined statically; so the
work associated with a kernel only emerges at runtime as the
kernel-instance executes. This capability could be implemented in logic
running within the host program, but involvement of the host may add
significant overhead and/or complexity to the application control
flow. A more efficient approach would be to nest kernel-enqueue
commands from inside other kernels. This <strong>nested parallelism</strong> can be
realized by supporting the enqueuing of kernels on a device without
direct involvement by the host program; so-called <strong>device-side
enqueue</strong>.
<br>
<br>
Device-side kernel-enqueue commands are similar to host-side
kernel-enqueue commands. The kernel executing on a device (the <strong>parent
kernel</strong>) enqueues a kernel-instance (the <strong>child kernel</strong>) to a
device-side command queue. This is an out-of-order command-queue and
follows the same behavior as the out-of-order command-queues exposed to
the host program. Commands enqueued to a device side command-queue
generate and use events to enforce order constraints just as for the
command-queue on the host. These events, however, are only visible to
the parent kernel running on the device. When these prerequisite
events take on the value CL_COMPLETE, the work-groups associated with
the child kernel are launched into the devices work pool. The device
then schedules them for execution on the compute units of the device.
Child and parent kernels execute asynchronously. However, a parent will
not indicate that it is complete by setting its event to CL_COMPLETE
until all child kernels have ended execution and have signaled
completion by setting any associated events to the value CL_COMPLETE.
Should any child kernel complete with an event status set to a negative
value (i.e. abnormally terminate), the parent kernel will abnormally
terminate and propagate the childs negative event value as the value of
the parents event. If there are multiple children that have an event
status set to a negative value, the selection of which childs negative
event value is propagated is implementation-defined.</p></div>
</div>
<div class="sect3">
<h4 id="_execution_model_synchronization">3.2.4. Execution Model: Synchronization</h4>
<div class="paragraph"><p>Synchronization refers to mechanisms that constrain the order of
execution between two or more units of execution. Consider the
following three domains of synchronization in OpenCL:</p></div>
<div class="ulist"><ul>
<li>
<p>
Work-group
synchronization: Constraints on the order of execution for work-items in
a single work-group
</p>
</li>
<li>
<p>
Sub-group synchronization:
Contraints on the order of execution for work-items in a single
sub-group
</p>
</li>
<li>
<p>
Command synchronization:
Constraints on the order of commands launched for execution
</p>
</li>
</ul></div>
<div class="paragraph"><p>Synchronization across all work-items within a single work-group is
carried out using a <em>work-group function</em>. These functions carry out
collective operations across all the work-items in a work-group.
Available collective operations are: barrier, reduction, broadcast,
prefix sum, and evaluation of a predicate. A work-group function must
occur within a converged control flow; i.e. all work-items in the
work-group must encounter precisely the same work-group function. For
example, if a work-group function occurs within a loop, the work-items
must encounter the same work-group function in the same loop
iterations. All the work-items of a work-group must execute the
work-group function and complete reads and writes to memory before any
are allowed to continue execution beyond the work-group function.
Work-group functions that apply between work-groups are not provided in
OpenCL since OpenCL does not define forward-progress or ordering
relations between work-groups, hence collective synchronization
operations are not well defined.
<br>
<br>
Synchronization across all work-items within a single sub-group is
carried out using a <em>sub-group function</em>. These functions carry out
collective operations across all the work-items in a sub-group.
Available collective operations are: barrier, reduction, broadcast,
prefix sum, and evaluation of a predicate. A sub-group function must
occur within a converged control flow; i.e. all work-items in the
sub-group must encounter precisely the same sub-group function. For
example, if a work-group function occurs within a loop, the work-items
must encounter the same sub-group function in the same loop iterations.
All the work-items of a sub-group must execute the sub-group function
and complete reads and writes to memory before any are allowed to
continue execution beyond the sub-group function. Synchronization
between sub-groups must either be performed using work-group functions,
or through memory operations. Using memory operations for sub-group
synchronization should be used carefully as forward progress of
sub-groups relative to each other is only supported optionally by OpenCL
implementations.
<br>
<br>
Command synchronization is defined in terms of distinct <strong>synchronization
points</strong>. The synchronization points occur between commands in host
command-queues and between commands in device-side command-queues. The
synchronization points defined in OpenCL include:</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>Launching a command:</strong> A
kernel-instance is launched onto a device after all events that kernel
is waiting-on have been set to CL_COMPLETE.
</p>
</li>
<li>
<p>
<strong>Ending a command:</strong> Child
kernels may be enqueued such that they wait for the parent kernel to
reach the <em>end</em> state before they can be launched. In this case, the
ending of the parent command defines a synchronization point.
</p>
</li>
<li>
<p>
<strong>Completion of a command:</strong>
A kernel-instance is complete after all of the work-groups in the kernel
and all of its child kernels have completed. This is signaled to the
host, a parent kernel or other kernels within command queues by setting
the value of the event associated with a kernel to CL_COMPLETE.
</p>
</li>
<li>
<p>
<strong>Blocking Commands:</strong> A
blocking command defines a synchronization point between the unit of
execution that calls the blocking API function and the enqueued command
reaching the complete state.
</p>
</li>
<li>
<p>
<strong>Command-queue barrier:</strong>
The command-queue barrier ensures that all previously enqueued commands
have completed before subsequently enqueued commands can be launched.
</p>
</li>
<li>
<p>
<strong>clFinish:</strong> This function
blocks until all previously enqueued commands in the command queue have
completed after which clFinish defines a synchronization point and the
clFinish function returns.
</p>
</li>
</ul></div>
<div class="paragraph"><p>A synchronization point between a pair of commands (A and B) assures
that results of command A happens-before command B is launched. This
requires that any updates to memory from command A complete and are made
available to other commands before the synchronization point completes.
Likewise, this requires that command B waits until after the
synchronization point before loading values from global memory. The
concept of a synchronization point works in a similar fashion for
commands such as a barrier that apply to two sets of commands. All the
commands prior to the barrier must complete and make their results
available to following commands. Furthermore, any commands following
the barrier must wait for the commands prior to the barrier before
loading values and continuing their execution.
<br>
<br>
These <em>happens-before</em> relationships are a fundamental part of the
OpenCL memory model. When applied at the level of commands, they are
straightforward to define at a language level in terms of ordering
relationships between different commands. Ordering memory operations
inside different commands, however, requires rules more complex than can
be captured by the high level concept of a synchronization point.
These rules are described in detail in section 3.3.6.</p></div>
</div>
<div class="sect3">
<h4 id="_execution_model_categories_of_kernels">3.2.5. Execution Model: Categories of Kernels</h4>
<div class="paragraph"><p>The OpenCL execution model supports three types of kernels:</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>OpenCL kernels</strong> are
managed by the OpenCL API as kernel-objects associated with kernel
functions within program-objects. OpenCL kernels are provided via a
kernel language.
All OpenCL implementations must support OpenCL kernels supplied in the
standard SPIR-V intermediate language with the appropriate environment
specification, and the OpenCL C programming language defined in earlier
versions of the OpenCL specification. Implementations must also support
OpenCL kernels in
SPIR-V intermediate language. SPIR-V binaries nay be
generated from an
OpenCL kernel language or by a third party compiler from an
alternative input.
</p>
</li>
<li>
<p>
<strong>Native kernels</strong> are
accessed through a host function pointer. Native kernels are queued for
execution along with OpenCL kernels on a device and share memory objects
with OpenCL kernels. For example, these native kernels could be
functions defined in application code or exported from a library. The
ability to execute native kernels is optional within OpenCL and the
semantics of native kernels are implementation-defined. The OpenCL API
includes functions to query capabilities of a device(s) and determine if
this capability is supported.
</p>
</li>
<li>
<p>
<strong>Built-in kernels</strong> are tied
to particular device and are not built at runtime from source code in a
program object. The common use of built in kernels is to expose
fixed-function hardware or firmware associated with a particular OpenCL
device or custom device. The semantics of a built-in kernel may be
defined outside of OpenCL and hence are implementation defined.
</p>
</li>
</ul></div>
<div class="paragraph"><p>All three types of kernels are manipulated through the OpenCL command
queues and must conform to the synchronization points defined in the
OpenCL execution model.</p></div>
</div>
</div>
<div class="sect2">
<h3 id="_memory_model">3.3. Memory Model</h3>
<div class="paragraph"><p>The OpenCL memory model describes the structure, contents, and behavior
of the memory exposed by an OpenCL platform as an OpenCL program runs.
The model allows a programmer to reason about values in memory as the
host program and multiple kernel-instances execute.
<br>
<br>
An OpenCL program defines a context that includes a host, one or more
devices, command-queues, and memory exposed within the context.
Consider the units of execution involved with such a program. The host
program runs as one or more host threads managed by the operating system
running on the host (the details of which are defined outside of
OpenCL). There may be multiple devices in a single context which all
have access to memory objects defined by OpenCL. On a single device,
multiple work-groups may execute in parallel with potentially
overlapping updates to memory. Finally, within a single work-group,
multiple work-items concurrently execute, once again with potentially
overlapping updates to memory.
<br>
<br>
The memory model must precisely define how the values in memory as seen
from each of these units of execution interact so a programmer can
reason about the correctness of OpenCL programs. We define the memory
model in four parts.</p></div>
<div class="ulist"><ul>
<li>
<p>
Memory regions: The
distinct memories visible to the host and the devices that share a
context.
</p>
</li>
<li>
<p>
Memory objects: The
objects defined by the OpenCL API and their management by the host and
devices.
</p>
</li>
<li>
<p>
Shared Virtual Memory: A
virtual address space exposed to both the host and the devices within a
context.
</p>
</li>
<li>
<p>
Consistency Model: Rules
that define which values are observed when multiple units of execution
load data from memory plus the atomic/fence operations that constrain
the order of memory operations and define synchronization relationships.
</p>
</li>
</ul></div>
<div class="sect3">
<h4 id="_memory_model_fundamental_memory_regions">3.3.1. Memory Model: Fundamental Memory Regions</h4>
<div class="paragraph"><p>Memory in OpenCL is divided into two parts.</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>Host Memory:</strong> The memory
directly available to the host. The detailed behavior of host memory is
defined outside of OpenCL. Memory objects move between the Host and the
devices through functions within the OpenCL API or through a shared
virtual memory interface.
</p>
</li>
<li>
<p>
<strong>Device Memory:</strong> Memory
directly available to kernels executing on OpenCL devices.
</p>
</li>
</ul></div>
<div class="paragraph"><p>Device memory consists of four named address spaces or <em>memory regions</em>:</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>Global Memory:</strong> This
memory region permits read/write access to all work-items in all
work-groups running on any device within a context. Work-items can read
from or write to any element of a memory object. Reads and writes to
global memory may be cached depending on the capabilities of the device.
</p>
</li>
<li>
<p>
<strong>Constant Memory</strong>: A
region of global memory that remains constant during the execution of a
kernel-instance. The host allocates and initializes memory objects
placed into constant memory.
</p>
</li>
<li>
<p>
<strong>Local Memory</strong>: A memory
region local to a work-group. This memory region can be used to allocate
variables that are shared by all work-items in that work-group.
</p>
</li>
<li>
<p>
<strong>Private Memory</strong>: A region
of memory private to a work-item. Variables defined in one work-items
private memory are not visible to another work-item.
</p>
</li>
</ul></div>
<div class="paragraph"><p> </p></div>
<div class="paragraph"><p>The memory regions and their relationship to the OpenCL Platform model
are summarized in figure 3-4. Local and private memories are always
associated with a particular device. The global and constant memories,
however, are shared between all devices within a given context. An
OpenCL device may include a cache to support efficient access to these
shared memories
<br>
<br>
To understand memory in OpenCL, it is important to appreciate the
relationships between these named address spaces.   The four named
address spaces available to a device are disjoint meaning they do not
overlap.   This is a logical relationship, however, and an
implementation may choose to let these disjoint named address spaces
share physical memory.
<br>
<br>
Programmers often need functions callable from kernels where the
pointers manipulated by those functions can point to multiple named
address spaces. This saves a programmer from the error-prone and
wasteful practice of creating multiple copies of functions; one for each
named address space. Therefore the global, local and private address
spaces belong to a single <em>generic address space</em>. This is closely
modeled after the concept of a generic address space used in the
embedded C standard (ISO/IEC 9899:1999). Since they all belong to a
single generic address space, the following properties are supported for
pointers to named address spaces in device memory:</p></div>
<div class="ulist"><ul>
<li>
<p>
A pointer to the generic
address space can be cast to a pointer to a global, local or private
address space
</p>
</li>
<li>
<p>
A pointer to a global,
local or private address space can be cast to a pointer to the generic
address space.
</p>
</li>
<li>
<p>
A pointer to a global,
local or private address space can be implicitly converted to a pointer
to the generic address space, but the converse is not allowed.
</p>
</li>
</ul></div>
<div class="paragraph"><p> </p></div>
<div class="paragraph"><p>The constant address space is disjoint from the generic address space.
<br>
<br>
The addresses of memory associated with memory objects in Global memory
are not preserved between kernel instances, between a device and the
host, and between devices. In this regard global memory acts as a global
pool of memory objects rather than an address space. This restriction is
relaxed when shared virtual memory (SVM) is used.
<br>
<br>
SVM causes addresses to be meaningful between the host and all of the
devices within a context hence supporting the use of pointer based data
structures in OpenCL kernels. It logically extends a portion of the
global memory into the host address space giving work-items access to
the host address space. On platforms with hardware support for a shared
address space between the host and one or more devices, SVM may also
provide a more efficient way to share data between devices and the host.
Details about SVM are presented in section 3.3.3.</p></div>
<div class="paragraph"><p><span class="image">
<img src="opencl22-API_files/image008.jpg" alt="image">
</span></p></div>
<div class="paragraph"><p><strong>Figure 3-4: The named address spaces exposed in an OpenCL Platform.
Global and Constant memories are shared between the one or more devices
within a context, while local and private memories are associated with a
single device. Each device may include an optional cache to support
efficient access to their view of the global and constant address
spaces.</strong></p></div>
<div class="paragraph"><p>A programmer may use the features of the memory consistency model
(section 3.3.4) to manage safe access to global memory from multiple
work-items potentially running on one or more devices. In addition, when
using shared virtual memory (SVM), the memory consistency model may also
be used to ensure that host threads safely access memory locations in
the shared memory region.</p></div>
</div>
<div class="sect3">
<h4 id="_memory_model_memory_objects">3.3.2. Memory Model: Memory Objects</h4>
<div class="paragraph"><p>The contents of global memory are <em>memory objects</em>. A memory object is
a handle to a reference counted region of global memory. Memory objects
use the OpenCL type <em>cl_mem</em> and fall into three distinct classes.</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>Buffer</strong>: A memory object
stored as a block of contiguous memory and used as a general purpose
object to hold data used in an OpenCL program. The types of the values
within a buffer may be any of the built in types (such as int, float),
vector types, or user-defined structures. The buffer can be
manipulated through pointers much as one would with any block of memory
in C.
</p>
</li>
<li>
<p>
<strong>Image</strong>: An image memory
object holds one, two or three dimensional images. The formats are
based on the standard image formats used in graphics applications. An
image is an opaque data structure managed by functions defined in the
OpenCL API. To optimize the manipulation of images stored in the
texture memories found in many GPUs, OpenCL kernels have traditionally
been disallowed from both reading and writing a single image. In OpenCL
2.0, however, we have relaxed this restriction by providing
synchronization and fence operations that let programmers properly
synchronize their code to safely allow a kernel to read and write a
single image.
</p>
</li>
<li>
<p>
<strong>Pipe</strong>: The <em>pipe</em> memory
object conceptually is an ordered sequence of data items. A pipe has
two endpoints: a write endpoint into which data items are inserted, and
a read endpoint from which data items are removed. At any one time,
only one kernel instance may write into a pipe, and only one kernel
instance may read from a pipe. To support the producer consumer design
pattern, one kernel instance connects to the write endpoint (the
producer) while another kernel instance connects to the reading endpoint
(the consumer).
</p>
</li>
</ul></div>
<div class="paragraph"><p> </p></div>
<div class="paragraph"><p>Memory objects are allocated by host APIs. The host program can provide
the runtime with a pointer to a block of continuous memory to hold the
memory object when the object is created (CL_MEM_USE_HOST_PTR).
Alternatively, the physical memory can be managed by the OpenCL runtime
and not be directly accessible to the host program.
<br>
<br>
Allocation and access to memory objects within the different memory
regions varies between the host and work-items running on a device.
This is summarized in table 3.1 which__describes whether the kernel or
the host can allocate from a memory region, the type of allocation
(static at compile time vs. dynamic at runtime) and the type of access
allowed (i.e. whether the kernel or the host can read and/or write to a
memory region).</p></div>
<div style="page-break-after:always"></div>
<table class="tableblock frame-all grid-all"
style="
width:80%;
">
<col style="width:20%;">
<col style="width:20%;">
<col style="width:20%;">
<col style="width:20%;">
<col style="width:20%;">
<tbody>
<tr>
<td class="tableblock halign-left valign-top" ><p class="tableblock"></p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Global</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Constant</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Local</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Private</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top" rowspan="2" ><p class="tableblock">Host</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Dynamic Allocation</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Dynamic Allocation</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Dynamic Allocation</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">No Allocation</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access to buffers and images but not pipes</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">No access</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">No access</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top" rowspan="2" ><p class="tableblock">Kernel</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Static Allocation for program scope variables</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Static Allocation</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Static Allocation. Dynamic allocation for child kernel</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Static Allocation</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Read-only access</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access. No access to child&#8217;s local memory.</p></td>
<td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access</p></td>
</tr>
</tbody>
</table>
<div class="paragraph"><p> </p></div>
<div class="paragraph"><p><strong>Table 3 1: The different memory regions in
OpenCL and how memory objects are allocated and accessed by the host and
by an executing instance of a kernel. For the case of kernels, we
distinguish between the behavior of local memory with respect to a
kernel (self) and its child kernels.</strong></p></div>
<div class="paragraph"><p>Once allocated, a memory object is made available to kernel-instances
running on one or more devices. In addition to shared virtual memory
(section 3.3.3) there are three basic ways to manage the contents of
buffers between the host and devices.</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>Read/Write/Fill
commands</strong>: The data associated with a memory object is explicitly read
and written between the host and global memory regions using commands
enqueued to an OpenCL command queue.
</p>
</li>
<li>
<p>
<strong>Map/Unmap commands</strong>: Data
from the memory object is mapped into a contiguous block of memory
accessed through a host accessible pointer. The host program enqueues a
<em>map</em> command on block of a memory object before it can be safely
manipulated by the host program. When the host program is finished
working with the block of memory, the host program enqueues an <em>unmap</em>
command to allow a kernel-instance to safely read and/or write the
buffer.**
</p>
</li>
<li>
<p>
<strong>Copy commands:</strong> The data
associated with a memory object is copied between two buffers, each of
which may reside either on the host or on the device.
</p>
</li>
</ul></div>
<div class="paragraph"><p> </p></div>
<div class="paragraph"><p>In both cases, the commands to transfer data between devices and the
host can be blocking or non-blocking operations. The OpenCL function
call for a blocking memory transfer returns once the associated memory
resources on the host can be safely reused. For a non-blocking memory
transfer, the OpenCL function call returns as soon as the command is
enqueued.
<br>
<br>
Memory objects are bound to a context and hence can appear in multiple
kernel-instances running on more than one physical device. The OpenCL
platform must support a large range of hardware platforms including
systems that do not support a single shared address space in hardware;
hence the ways memory objects can be shared between kernel-instances is
restricted. The basic principle is that multiple read operations on
memory objects from multiple kernel-instances that overlap in time are
allowed, but mixing overlapping reads and writes into the same memory
objects from different kernel instances is only allowed when fine
grained synchronization is used with shared virtual memory (see section
3.3.3).
<br>
<br>
When global memory is manipulated by multiple kernel-instances running
on multiple devices, the OpenCL runtime system must manage the
association of memory objects with a given device. In most cases the
OpenCL runtime will implicitly associate a memory object with a device.
A kernel instance is naturally associated with the command queue to
which the kernel was submitted. Since a command-queue can only access a
single device, the queue uniquely defines which device is involved with
any given kernel-instance; hence defining a clear association between
memory objects, kernel-instances and devices. Programmers may
anticipate these associations in their programs and explicitly manage
association of memory objects with devices in order to improve
performance.</p></div>
</div>
<div class="sect3">
<h4 id="_memory_model_shared_virtual_memory">3.3.3. Memory Model: Shared Virtual Memory</h4>
<div class="paragraph"><p>OpenCL extends the global memory region into the host memory region
through a shared virtual memory (SVM) mechanism. There are three types
of SVM in OpenCL</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>Coarse-Grained buffer
SVM</strong>: Sharing occurs at the granularity of regions of OpenCL buffer
memory objects. Consistency is enforced at synchronization points and
with map/unmap commands to drive updates between the host and the
device. This form of SVM is similar to non-SVM use of memory; however,
it lets kernel-instances share pointer-based data structures (such as
linked-lists) with the host program. Program scope global variables are
treated as per-device coarse-grained SVM for addressing and sharing
purposes.
</p>
</li>
<li>
<p>
<strong>Fine-Grained buffer
SVM</strong>: Sharing occurs at the granularity of individual loads/stores into
bytes within OpenCL buffer memory objects. Loads and stores may be
cached. This means consistency is guaranteed at synchronization points.
If the optional OpenCL atomics are supported, they can be used to
provide fine-grained control of memory consistency.
</p>
</li>
<li>
<p>
<strong>Fine-Grained system SVM</strong>:
Sharing occurs at the granularity of individual loads/stores into bytes
occurring anywhere within the host memory. Loads and stores may be
cached so consistency is guaranteed at synchronization points. If the
optional OpenCL atomics are supported, they can be used to provide
fine-grained control of memory consistency.
</p>
</li>
</ul></div>
<table class="tableblock frame-all grid-all"
style="
width:100%;
">
<caption class="title">Table 1. <strong>A summary of shared virtual memory (SVM) options in OpenCL</strong></caption>
<col style="width:20%;">
<col style="width:20%;">
<col style="width:20%;">
<col style="width:20%;">
<col style="width:20%;">
<tbody>
<tr>
<td class="tableblock halign-center valign-top" ><p class="tableblock"></p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Granularity of sharing</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Memory Allocation</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Mechanisms to enforce Consistency</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Explicit updates
between host and device</p></td>
</tr>
<tr>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Non-SVM buffers</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">OpenCL Memory objects(buffer)</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">clCreateBuffer</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Host synchronization points on the same or between
devices.</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">yes, through Map and Unmap commands.</p></td>
</tr>
<tr>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Coarse-Grained buffer SVM</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">OpenCL Memory objects (buffer)</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">clSVMAlloc</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Host synchronization points
between devices</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">yes, through Map and Unmap commands.</p></td>
</tr>
<tr>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Fine Grained buffer SVM</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Bytes within OpenCL Memory objects (buffer)</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">clSVMAlloc</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Synchronization points plus atomics (if supported)</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">No</p></td>
</tr>
<tr>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Fine-Grained system SVM</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Bytes within Host memory (system)</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Host memory allocation mechanisms (e.g. malloc)</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">Synchronization points plus atomics (if
supported)</p></td>
<td class="tableblock halign-center valign-top" ><p class="tableblock">No</p></td>
</tr>
</tbody>
</table>
<div class="paragraph"><p>Coarse-Grained buffer SVM is required in the core OpenCL specification.
The two finer grained approaches are optional features in OpenCL. The
various SVM mechanisms to access host memory from the work-items
associated with a kernel instance are summarized in table 3-2.</p></div>
</div>
<div class="sect3">
<h4 id="_memory_model_memory_consistency_model">3.3.4. Memory Model: Memory Consistency Model</h4>
<div class="paragraph"><p>The OpenCL memory model tells programmers what they can expect from an
OpenCL implementation; which memory operations are guaranteed to happen
in which order and which memory values each read operation will return.
The memory model tells compiler writers which restrictions they must
follow when implementing compiler optimizations; which variables they
can cache in registers and when they can move reads or writes around a
barrier or atomic operation. The memory model also tells hardware
designers about limitations on hardware optimizations; for example, when
they must flush or invalidate hardware caches.
<br>
<br>
The memory consistency model in OpenCL is based on the memory model from
the ISO C11 programming language. To help make the presentation more
precise and self-contained, we include modified paragraphs taken
verbatim from the ISO C11 international standard. When a paragraph is
taken or modified from the C11 standard, it is identified as such along
with its original location in the C11 standard.
<br>
<br>
For programmers, the most intuitive model is the <em>sequential
consistency</em> memory model. Sequential consistency interleaves the steps
executed by each of the units of execution. Each access to a memory
location sees the last assignment to that location in that
interleaving. While sequential consistency is relatively
straightforward for a programmer to reason about, implementing
sequential consistency is expensive. Therefore, OpenCL implements a
relaxed memory consistency model; i.e. it is possible to write programs
where the loads from memory violate sequential consistency. Fortunately,
if a program does not contain any races and if the program only uses
atomic operations that utilize the sequentially consistent memory order
(the default memory ordering for OpenCL), OpenCL programs appear to
execute with sequential consistency.
<br>
<br>
Programmers can to some degree control how the memory model is relaxed by choosing the memory order for synchronization operations. The precise semantics of synchronization and the memory orders are formally defined in section 3.3.6. Here, we give a high level description of how these memory orders apply to atomic operations on atomic objects shared between units of execution. OpenCL memory_order choices are based on those from the ANSI C11 standard memory model. They are specified in certain OpenCL functions through the following enumeration constants:</p></div>
<div class="ulist"><ul>
<li>
<p>
<strong>memory_order_relaxed</strong>:
implies no order constraints. This memory order can be used safely to
increment counters that are concurrently incremented, but it doesnt
guarantee anything about the ordering with respect to operations to
other memory locations. It can also be used, for example, to do ticket
allocation and by expert programmers implementing lock-free algorithms.
</p>
</li>
<li>
<p>
<strong>memory_order_acquire</strong>: A
synchronization operation (fence or atomic) that has acquire semantics
"acquires" side-effects from a release operation that synchronises with
it: if an acquire synchronises with a release, the acquiring unit of
execution will see all side-effects preceding that release (and possibly
subsequent side-effects.) As part of carefully-designed protocols,
programmers can use an "acquire" to safely observe the work of another
unit of execution.
</p>
</li>
<li>
<p>
<strong>memory_order_release</strong>: A
synchronization operation (fence or atomic operation) that has release
semantics "releases" side effects to an acquire operation that
synchronises with it. All side effects that precede the release are
included in the release. As part of carefully-designed protocols,
programmers can use a "release" to make changes made in one unit of
execution visible to other units of execution.
</p>
</li>
</ul></div>
<div class="admonitionblock">
<table><tr>
<td class="icon">
<div class="title">Note</div>
</td>
<td class="content">In general, no acquire must <em>always</em> synchronise with any
particular release. However, synchronisation can be forced by certain
executions. See 3.3.6.2 for detailed rules for when synchronisation
must occur.</td>
</tr></table>
</div>
<div class="ulist"><ul>
<li>
<p>
<strong>memory_order_acq_rel</strong>: A
synchronization operation with acquire-release semantics has the
properties of both the acquire and release memory orders. It is
typically used to order read-modify-write operations.
</p>
</li>
<li>
<p>
<strong>memory_order_seq_cst</strong>:
The loads and stores of each unit of execution appear to execute in
program (i.e., sequenced-before) order, and the loads and stores from
different units of execution appear to be simply interleaved.
<br>
<br>
Regardless of which memory_order is specified, resolving constraints on
memory operations across a heterogeneous platform adds considerable