| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <meta name="generator" content="AsciiDoc 8.6.9"> |
| <title>The OpenCL Specification</title> |
| <style type="text/css"> |
| /* Shared CSS for AsciiDoc xhtml11 and html5 backends */ |
| |
| /* Default font. */ |
| body { |
| font-family: Georgia,serif; |
| } |
| |
| /* Title font. */ |
| h1, h2, h3, h4, h5, h6, |
| div.title, caption.title, |
| thead, p.table.header, |
| #toctitle, |
| #author, #revnumber, #revdate, #revremark, |
| #footer { |
| font-family: Arial,Helvetica,sans-serif; |
| } |
| |
| body { |
| margin: 1em 5% 1em 5%; |
| } |
| |
| a { |
| color: blue; |
| text-decoration: underline; |
| } |
| a:visited { |
| color: fuchsia; |
| } |
| |
| em { |
| font-style: italic; |
| color: navy; |
| } |
| |
| strong { |
| font-weight: bold; |
| color: #083194; |
| } |
| |
| h1, h2, h3, h4, h5, h6 { |
| color: #527bbd; |
| margin-top: 1.2em; |
| margin-bottom: 0.5em; |
| line-height: 1.3; |
| } |
| |
| h1, h2, h3 { |
| border-bottom: 2px solid silver; |
| } |
| h2 { |
| padding-top: 0.5em; |
| } |
| h3 { |
| float: left; |
| } |
| h3 + * { |
| clear: left; |
| } |
| h5 { |
| font-size: 1.0em; |
| } |
| |
| div.sectionbody { |
| margin-left: 0; |
| } |
| |
| hr { |
| border: 1px solid silver; |
| } |
| |
| p { |
| margin-top: 0.5em; |
| margin-bottom: 0.5em; |
| } |
| |
| ul, ol, li > p { |
| margin-top: 0; |
| } |
| ul > li { color: #aaa; } |
| ul > li > * { color: black; } |
| |
| .monospaced, code, pre { |
| font-family: "Courier New", Courier, monospace; |
| font-size: inherit; |
| color: navy; |
| padding: 0; |
| margin: 0; |
| } |
| pre { |
| white-space: pre-wrap; |
| } |
| |
| #author { |
| color: #527bbd; |
| font-weight: bold; |
| font-size: 1.1em; |
| } |
| #email { |
| } |
| #revnumber, #revdate, #revremark { |
| } |
| |
| #footer { |
| font-size: small; |
| border-top: 2px solid silver; |
| padding-top: 0.5em; |
| margin-top: 4.0em; |
| } |
| #footer-text { |
| float: left; |
| padding-bottom: 0.5em; |
| } |
| #footer-badges { |
| float: right; |
| padding-bottom: 0.5em; |
| } |
| |
| #preamble { |
| margin-top: 1.5em; |
| margin-bottom: 1.5em; |
| } |
| div.imageblock, div.exampleblock, div.verseblock, |
| div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock, |
| div.admonitionblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| div.admonitionblock { |
| margin-top: 2.0em; |
| margin-bottom: 2.0em; |
| margin-right: 10%; |
| color: #606060; |
| } |
| |
| div.content { /* Block element content. */ |
| padding: 0; |
| } |
| |
| /* Block element titles. */ |
| div.title, caption.title { |
| color: #527bbd; |
| font-weight: bold; |
| text-align: left; |
| margin-top: 1.0em; |
| margin-bottom: 0.5em; |
| } |
| div.title + * { |
| margin-top: 0; |
| } |
| |
| td div.title:first-child { |
| margin-top: 0.0em; |
| } |
| div.content div.title:first-child { |
| margin-top: 0.0em; |
| } |
| div.content + div.title { |
| margin-top: 0.0em; |
| } |
| |
| div.sidebarblock > div.content { |
| background: #ffffee; |
| border: 1px solid #dddddd; |
| border-left: 4px solid #f0f0f0; |
| padding: 0.5em; |
| } |
| |
| div.listingblock > div.content { |
| border: 1px solid #dddddd; |
| border-left: 5px solid #f0f0f0; |
| background: #f8f8f8; |
| padding: 0.5em; |
| } |
| |
| div.quoteblock, div.verseblock { |
| padding-left: 1.0em; |
| margin-left: 1.0em; |
| margin-right: 10%; |
| border-left: 5px solid #f0f0f0; |
| color: #888; |
| } |
| |
| div.quoteblock > div.attribution { |
| padding-top: 0.5em; |
| text-align: right; |
| } |
| |
| div.verseblock > pre.content { |
| font-family: inherit; |
| font-size: inherit; |
| } |
| div.verseblock > div.attribution { |
| padding-top: 0.75em; |
| text-align: left; |
| } |
| /* DEPRECATED: Pre version 8.2.7 verse style literal block. */ |
| div.verseblock + div.attribution { |
| text-align: left; |
| } |
| |
| div.admonitionblock .icon { |
| vertical-align: top; |
| font-size: 1.1em; |
| font-weight: bold; |
| text-decoration: underline; |
| color: #527bbd; |
| padding-right: 0.5em; |
| } |
| div.admonitionblock td.content { |
| padding-left: 0.5em; |
| border-left: 3px solid #dddddd; |
| } |
| |
| div.exampleblock > div.content { |
| border-left: 3px solid #dddddd; |
| padding-left: 0.5em; |
| } |
| |
| div.imageblock div.content { padding-left: 0; } |
| span.image img { border-style: none; vertical-align: text-bottom; } |
| a.image:visited { color: white; } |
| |
| dl { |
| margin-top: 0.8em; |
| margin-bottom: 0.8em; |
| } |
| dt { |
| margin-top: 0.5em; |
| margin-bottom: 0; |
| font-style: normal; |
| color: navy; |
| } |
| dd > *:first-child { |
| margin-top: 0.1em; |
| } |
| |
| ul, ol { |
| list-style-position: outside; |
| } |
| ol.arabic { |
| list-style-type: decimal; |
| } |
| ol.loweralpha { |
| list-style-type: lower-alpha; |
| } |
| ol.upperalpha { |
| list-style-type: upper-alpha; |
| } |
| ol.lowerroman { |
| list-style-type: lower-roman; |
| } |
| ol.upperroman { |
| list-style-type: upper-roman; |
| } |
| |
| div.compact ul, div.compact ol, |
| div.compact p, div.compact p, |
| div.compact div, div.compact div { |
| margin-top: 0.1em; |
| margin-bottom: 0.1em; |
| } |
| |
| tfoot { |
| font-weight: bold; |
| } |
| td > div.verse { |
| white-space: pre; |
| } |
| |
| div.hdlist { |
| margin-top: 0.8em; |
| margin-bottom: 0.8em; |
| } |
| div.hdlist tr { |
| padding-bottom: 15px; |
| } |
| dt.hdlist1.strong, td.hdlist1.strong { |
| font-weight: bold; |
| } |
| td.hdlist1 { |
| vertical-align: top; |
| font-style: normal; |
| padding-right: 0.8em; |
| color: navy; |
| } |
| td.hdlist2 { |
| vertical-align: top; |
| } |
| div.hdlist.compact tr { |
| margin: 0; |
| padding-bottom: 0; |
| } |
| |
| .comment { |
| background: yellow; |
| } |
| |
| .footnote, .footnoteref { |
| font-size: 0.8em; |
| } |
| |
| span.footnote, span.footnoteref { |
| vertical-align: super; |
| } |
| |
| #footnotes { |
| margin: 20px 0 20px 0; |
| padding: 7px 0 0 0; |
| } |
| |
| #footnotes div.footnote { |
| margin: 0 0 5px 0; |
| } |
| |
| #footnotes hr { |
| border: none; |
| border-top: 1px solid silver; |
| height: 1px; |
| text-align: left; |
| margin-left: 0; |
| width: 20%; |
| min-width: 100px; |
| } |
| |
| div.colist td { |
| padding-right: 0.5em; |
| padding-bottom: 0.3em; |
| vertical-align: top; |
| } |
| div.colist td img { |
| margin-top: 0.3em; |
| } |
| |
| @media print { |
| #footer-badges { display: none; } |
| } |
| |
| #toc { |
| margin-bottom: 2.5em; |
| } |
| |
| #toctitle { |
| color: #527bbd; |
| font-size: 1.1em; |
| font-weight: bold; |
| margin-top: 1.0em; |
| margin-bottom: 0.1em; |
| } |
| |
| div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 { |
| margin-top: 0; |
| margin-bottom: 0; |
| } |
| div.toclevel2 { |
| margin-left: 2em; |
| font-size: 0.9em; |
| } |
| div.toclevel3 { |
| margin-left: 4em; |
| font-size: 0.9em; |
| } |
| div.toclevel4 { |
| margin-left: 6em; |
| font-size: 0.9em; |
| } |
| |
| span.aqua { color: aqua; } |
| span.black { color: black; } |
| span.blue { color: blue; } |
| span.fuchsia { color: fuchsia; } |
| span.gray { color: gray; } |
| span.green { color: green; } |
| span.lime { color: lime; } |
| span.maroon { color: maroon; } |
| span.navy { color: navy; } |
| span.olive { color: olive; } |
| span.purple { color: purple; } |
| span.red { color: red; } |
| span.silver { color: silver; } |
| span.teal { color: teal; } |
| span.white { color: white; } |
| span.yellow { color: yellow; } |
| |
| span.aqua-background { background: aqua; } |
| span.black-background { background: black; } |
| span.blue-background { background: blue; } |
| span.fuchsia-background { background: fuchsia; } |
| span.gray-background { background: gray; } |
| span.green-background { background: green; } |
| span.lime-background { background: lime; } |
| span.maroon-background { background: maroon; } |
| span.navy-background { background: navy; } |
| span.olive-background { background: olive; } |
| span.purple-background { background: purple; } |
| span.red-background { background: red; } |
| span.silver-background { background: silver; } |
| span.teal-background { background: teal; } |
| span.white-background { background: white; } |
| span.yellow-background { background: yellow; } |
| |
| span.big { font-size: 2em; } |
| span.small { font-size: 0.6em; } |
| |
| span.underline { text-decoration: underline; } |
| span.overline { text-decoration: overline; } |
| span.line-through { text-decoration: line-through; } |
| |
| div.unbreakable { page-break-inside: avoid; } |
| |
| |
| /* |
| * xhtml11 specific |
| * |
| * */ |
| |
| div.tableblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| div.tableblock > table { |
| border: 3px solid #527bbd; |
| } |
| thead, p.table.header { |
| font-weight: bold; |
| color: #527bbd; |
| } |
| p.table { |
| margin-top: 0; |
| } |
| /* Because the table frame attribute is overriden by CSS in most browsers. */ |
| div.tableblock > table[frame="void"] { |
| border-style: none; |
| } |
| div.tableblock > table[frame="hsides"] { |
| border-left-style: none; |
| border-right-style: none; |
| } |
| div.tableblock > table[frame="vsides"] { |
| border-top-style: none; |
| border-bottom-style: none; |
| } |
| |
| |
| /* |
| * html5 specific |
| * |
| * */ |
| |
| table.tableblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| thead, p.tableblock.header { |
| font-weight: bold; |
| color: #527bbd; |
| } |
| p.tableblock { |
| margin-top: 0; |
| } |
| table.tableblock { |
| border-width: 3px; |
| border-spacing: 0px; |
| border-style: solid; |
| border-color: #527bbd; |
| border-collapse: collapse; |
| } |
| th.tableblock, td.tableblock { |
| border-width: 1px; |
| padding: 4px; |
| border-style: solid; |
| border-color: #527bbd; |
| } |
| |
| table.tableblock.frame-topbot { |
| border-left-style: hidden; |
| border-right-style: hidden; |
| } |
| table.tableblock.frame-sides { |
| border-top-style: hidden; |
| border-bottom-style: hidden; |
| } |
| table.tableblock.frame-none { |
| border-style: hidden; |
| } |
| |
| th.tableblock.halign-left, td.tableblock.halign-left { |
| text-align: left; |
| } |
| th.tableblock.halign-center, td.tableblock.halign-center { |
| text-align: center; |
| } |
| th.tableblock.halign-right, td.tableblock.halign-right { |
| text-align: right; |
| } |
| |
| th.tableblock.valign-top, td.tableblock.valign-top { |
| vertical-align: top; |
| } |
| th.tableblock.valign-middle, td.tableblock.valign-middle { |
| vertical-align: middle; |
| } |
| th.tableblock.valign-bottom, td.tableblock.valign-bottom { |
| vertical-align: bottom; |
| } |
| |
| |
| /* |
| * manpage specific |
| * |
| * */ |
| |
| body.manpage h1 { |
| padding-top: 0.5em; |
| padding-bottom: 0.5em; |
| border-top: 2px solid silver; |
| border-bottom: 2px solid silver; |
| } |
| body.manpage h2 { |
| border-style: none; |
| } |
| body.manpage div.sectionbody { |
| margin-left: 3em; |
| } |
| |
| @media print { |
| body.manpage div#toc { display: none; } |
| } |
| |
| |
| @media screen { |
| body { |
| max-width: 50em; /* approximately 80 characters wide */ |
| margin-left: 16em; |
| } |
| |
| #toc { |
| position: fixed; |
| top: 0; |
| left: 0; |
| bottom: 0; |
| width: 13em; |
| padding: 0.5em; |
| padding-bottom: 1.5em; |
| margin: 0; |
| overflow: auto; |
| border-right: 3px solid #f8f8f8; |
| background-color: white; |
| } |
| |
| #toc .toclevel1 { |
| margin-top: 0.5em; |
| } |
| |
| #toc .toclevel2 { |
| margin-top: 0.25em; |
| display: list-item; |
| color: #aaaaaa; |
| } |
| |
| #toctitle { |
| margin-top: 0.5em; |
| } |
| } |
| </style> |
| <script type="text/javascript"> |
| /*<+'])'); |
| // Function that scans the DOM tree for header elements (the DOM2 |
| // nodeIterator API would be a better technique but not supported by all |
| // browsers). |
| var iterate = function (el) { |
| for (var i = el.firstChild; i != null; i = i.nextSibling) { |
| if (i.nodeType == 1 /* Node.ELEMENT_NODE */) { |
| var mo = re.exec(i.tagName); |
| if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") { |
| result[result.length] = new TocEntry(i, getText(i), mo[1]-1); |
| } |
| iterate(i); |
| } |
| } |
| } |
| iterate(el); |
| return result; |
| } |
| |
| var toc = document.getElementById("toc"); |
| if (!toc) { |
| return; |
| } |
| |
| // Delete existing TOC entries in case we're reloading the TOC. |
| var tocEntriesToRemove = []; |
| var i; |
| for (i = 0; i < toc.childNodes.length; i++) { |
| var entry = toc.childNodes[i]; |
| if (entry.nodeName.toLowerCase() == 'div' |
| && entry.getAttribute("class") |
| && entry.getAttribute("class").match(/^toclevel/)) |
| tocEntriesToRemove.push(entry); |
| } |
| for (i = 0; i < tocEntriesToRemove.length; i++) { |
| toc.removeChild(tocEntriesToRemove[i]); |
| } |
| |
| // Rebuild TOC entries. |
| var entries = tocEntries(document.getElementById("content"), toclevels); |
| for (var i = 0; i < entries.length; ++i) { |
| var entry = entries[i]; |
| if (entry.element.id == "") |
| entry.element.id = "_toc_" + i; |
| var a = document.createElement("a"); |
| a.href = "#" + entry.element.id; |
| a.appendChild(document.createTextNode(entry.text)); |
| var div = document.createElement("div"); |
| div.appendChild(a); |
| div.className = "toclevel" + entry.toclevel; |
| toc.appendChild(div); |
| } |
| if (entries.length == 0) |
| toc.parentNode.removeChild(toc); |
| }, |
| |
| |
| ///////////////////////////////////////////////////////////////////// |
| // Footnotes generator |
| ///////////////////////////////////////////////////////////////////// |
| |
| /* Based on footnote generation code from: |
| * http://www.brandspankingnew.net/archive/2005/07/format_footnote.html |
| */ |
| |
| footnotes: function () { |
| // Delete existing footnote entries in case we're reloading the footnodes. |
| var i; |
| var noteholder = document.getElementById("footnotes"); |
| if (!noteholder) { |
| return; |
| } |
| var entriesToRemove = []; |
| for (i = 0; i < noteholder.childNodes.length; i++) { |
| var entry = noteholder.childNodes[i]; |
| if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote") |
| entriesToRemove.push(entry); |
| } |
| for (i = 0; i < entriesToRemove.length; i++) { |
| noteholder.removeChild(entriesToRemove[i]); |
| } |
| |
| // Rebuild footnote entries. |
| var cont = document.getElementById("content"); |
| var spans = cont.getElementsByTagName("span"); |
| var refs = {}; |
| var n = 0; |
| for (i=0; i<spans.length; i++) { |
| if (spans[i].className == "footnote") { |
| n++; |
| var note = spans[i].getAttribute("data-note"); |
| if (!note) { |
| // Use [\s\S] in place of . so multi-line matches work. |
| // Because JavaScript has no s (dotall) regex flag. |
| note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1]; |
| spans[i].innerHTML = |
| "[<a id='_footnoteref_" + n + "' href='#_footnote_" + n + |
| "' title='View footnote' class='footnote'>" + n + "</a>]"; |
| spans[i].setAttribute("data-note", note); |
| } |
| noteholder.innerHTML += |
| "<div class='footnote' id='_footnote_" + n + "'>" + |
| "<a href='#_footnoteref_" + n + "' title='Return to text'>" + |
| n + "</a>. " + note + "</div>"; |
| var id =spans[i].getAttribute("id"); |
| if (id != null) refs["#"+id] = n; |
| } |
| } |
| if (n == 0) |
| noteholder.parentNode.removeChild(noteholder); |
| else { |
| // Process footnoterefs. |
| for (i=0; i<spans.length; i++) { |
| if (spans[i].className == "footnoteref") { |
| var href = spans[i].getElementsByTagName("a")[0].getAttribute("href"); |
| href = href.match(/#.*/)[0]; // Because IE return full URL. |
| n = refs[href]; |
| spans[i].innerHTML = |
| "[<a href='#_footnote_" + n + |
| "' title='View footnote' class='footnote'>" + n + "</a>]"; |
| } |
| } |
| } |
| }, |
| |
| install: function(toclevels) { |
| var timerId; |
| |
| function reinstall() { |
| asciidoc.footnotes(); |
| if (toclevels) { |
| asciidoc.toc(toclevels); |
| } |
| } |
| |
| function reinstallAndRemoveTimer() { |
| clearInterval(timerId); |
| reinstall(); |
| } |
| |
| timerId = setInterval(reinstall, 500); |
| if (document.addEventListener) |
| document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false); |
| else |
| window.onload = reinstallAndRemoveTimer; |
| } |
| |
| } |
| asciidoc.install(3); |
| /*]]>*/ |
| </script> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| MathML: { extensions: ["content-mathml.js"] }, |
| tex2jax: { inlineMath: [['$','$'], ['\\(','\\)']] } |
| }); |
| </script> |
| <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> |
| </script> |
| </head> |
| <body class="book"> |
| <div id="header"> |
| <h1>The OpenCL Specification</h1> |
| <span id="author">Khronos OpenCL Working Group</span><br> |
| <span id="revnumber">version v2.2-3</span> |
| <div id="toc"> |
| <div id="toctitle">Table of Contents</div> |
| <noscript><p><b>JavaScript must be enabled in your browser to display the table of contents.</b></p></noscript> |
| </div> |
| </div> |
| <div id="content"> |
| <div id="preamble"> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Copyright 2008-2017 The Khronos Group.</p></div> |
| <div class="paragraph"><p>This specification is protected by copyright laws and contains material proprietary |
| to the Khronos Group, Inc. Except as described by these terms, it or any components |
| may not be reproduced, republished, distributed, transmitted, displayed, broadcast |
| or otherwise exploited in any manner without the express prior written permission |
| of Khronos Group.</p></div> |
| <div class="paragraph"><p>Khronos Group grants a conditional copyright license to use and reproduce the |
| unmodified specification for any purpose, without fee or royalty, EXCEPT no licenses |
| to any patent, trademark or other intellectual property rights are granted under |
| these terms. Parties desiring to implement the specification and make use of |
| Khronos trademarks in relation to that implementation, and receive reciprocal patent |
| license protection under the Khronos IP Policy must become Adopters and confirm the |
| implementation as conformant under the process defined by Khronos for this |
| specification; see <a href="https://www.khronos.org/adopters">https://www.khronos.org/adopters</a>.</p></div> |
| <div class="paragraph"><p>Khronos Group makes no, and expressly disclaims any, representations or warranties, |
| express or implied, regarding this specification, including, without limitation: |
| merchantability, fitness for a particular purpose, non-infringement of any |
| intellectual property, correctness, accuracy, completeness, timeliness, and |
| reliability. Under no circumstances will the Khronos Group, or any of its Promoters, |
| Contributors or Members, or their respective partners, officers, directors, |
| employees, agents or representatives be liable for any damages, whether direct, |
| indirect, special or consequential damages for lost revenues, lost profits, or |
| otherwise, arising from or in connection with these materials.</p></div> |
| <div class="paragraph"><p>Vulkan is a registered trademark and Khronos, OpenXR, SPIR, SPIR-V, SYCL, WebGL, |
| WebCL, OpenVX, OpenVG, EGL, COLLADA, glTF, NNEF, OpenKODE, OpenKCAM, StreamInput, |
| OpenWF, OpenSL ES, OpenMAX, OpenMAX AL, OpenMAX IL, OpenMAX DL, OpenML and DevU are |
| trademarks of the Khronos Group Inc. ASTC is a trademark of ARM Holdings PLC, |
| OpenCL is a trademark of Apple Inc. and OpenGL and OpenML are registered trademarks |
| and the OpenGL ES and OpenGL SC logos are trademarks of Silicon Graphics |
| International used under license by Khronos. All other product names, trademarks, |
| and/or company names are used solely for identification and belong to their |
| respective owners.</p></div> |
| <div style="page-break-after:always"></div> |
| <div class="paragraph"><p><strong>Acknowledgements</strong></p></div> |
| <div class="paragraph"><p>The OpenCL specification is the result of the contributions of many |
| people, representing a cross section of the desktop, hand-held, and |
| embedded computer industry. Following is a partial list of the |
| contributors, including the company that they represented at the time of |
| their contribution:</p></div> |
| <div class="paragraph"><p>Chuck Rose, Adobe<br> |
| Eric Berdahl, Adobe<br> |
| Shivani Gupta, Adobe<br> |
| Bill Licea Kane, AMD<br> |
| Ed Buckingham, AMD<br> |
| Jan Civlin, AMD<br> |
| Laurent Morichetti, AMD<br> |
| Mark Fowler, AMD<br> |
| Marty Johnson, AMD<br> |
| Michael Mantor, AMD<br> |
| Norm Rubin, AMD<br> |
| Ofer Rosenberg, AMD<br> |
| Brian Sumner, AMD<br> |
| Victor Odintsov, AMD<br> |
| Aaftab Munshi, Apple<br> |
| Abe Stephens, Apple<br> |
| Alexandre Namaan, Apple<br> |
| Anna Tikhonova, Apple<br> |
| Chendi Zhang, Apple<br> |
| Eric Bainville, Apple<br> |
| David Hayward, Apple<br> |
| Giridhar Murthy, Apple<br> |
| Ian Ollmann, Apple<br> |
| Inam Rahman, Apple<br> |
| James Shearer, Apple<br> |
| MonPing Wang, Apple<br> |
| Tanya Lattner, Apple<br> |
| Mikael Bourges-Sevenier, Aptina<br> |
| Anton Lokhmotov, ARM<br> |
| Dave Shreiner, ARM<br> |
| Hedley Francis, ARM<br> |
| Robert Elliott, ARM<br> |
| Scott Moyers, ARM<br> |
| Tom Olson, ARM<br> |
| Anastasia Stulova, ARM<br> |
| Christopher Thompson-Walsh, Broadcom<br> |
| Holger Waechtler, Broadcom<br> |
| Norman Rink, Broadcom<br> |
| Andrew Richards, Codeplay<br> |
| Maria Rovatsou, Codeplay<br> |
| Alistair Donaldson, Codeplay<br> |
| Alastair Murray, Codeplay<br> |
| Stephen Frye, Electronic Arts<br> |
| Eric Schenk, Electronic Arts<br> |
| Daniel Laroche, Freescale<br> |
| David Neto, Google<br> |
| Robin Grosman, Huawei<br> |
| Craig Davies, Huawei<br> |
| Brian Horton, IBM<br> |
| Brian Watt, IBM<br> |
| Gordon Fossum, IBM<br> |
| Greg Bellows, IBM<br> |
| Joaquin Madruga, IBM<br> |
| Mark Nutter, IBM<br> |
| Mike Perks, IBM<br> |
| Sean Wagner, IBM<br> |
| Jon Parr, Imagination Technologies<br> |
| Robert Quill, Imagination Technologies<br> |
| James McCarthy, Imagination Technologie<br> |
| Aaron Kunze, Intel<br> |
| Aaron Lefohn, Intel<br> |
| Adam Lake, Intel<br> |
| Alexey Bader, Intel<br> |
| Allen Hux, Intel<br> |
| Andrew Brownsword, Intel<br> |
| Andrew Lauritzen, Intel<br> |
| Bartosz Sochacki, Intel<br> |
| Ben Ashbaugh, Intel<br> |
| Brian Lewis, Intel<br> |
| Geoff Berry, Intel<br> |
| Hong Jiang, Intel<br> |
| Jayanth Rao, Intel<br> |
| Josh Fryman, Intel<br> |
| Larry Seiler, Intel<br> |
| Mike MacPherson, Intel<br> |
| Murali Sundaresan, Intel<br> |
| Paul Lalonde, Intel<br> |
| Raun Krisch, Intel<br> |
| Stephen Junkins, Intel<br> |
| Tim Foley, Intel<br> |
| Timothy Mattson, Intel<br> |
| Yariv Aridor, Intel<br> |
| Michael Kinsner, Intel<br> |
| Kevin Stevens, Intel<br> |
| Jon Leech, Khronos<br> |
| Benjamin Bergen, Los Alamos National Laboratory<br> |
| Roy Ju, Mediatek<br> |
| Bor-Sung Liang, Mediatek<br> |
| Rahul Agarwal, Mediatek<br> |
| Michal Witaszek, Mobica<br> |
| JenqKuen Lee, NTHU<br> |
| Amit Rao, NVIDIA<br> |
| Ashish Srivastava, NVIDIA<br> |
| Bastiaan Aarts, NVIDIA<br> |
| Chris Cameron, NVIDIA<br> |
| Christopher Lamb, NVIDIA<br> |
| Dibyapran Sanyal, NVIDIA<br> |
| Guatam Chakrabarti, NVIDIA<br> |
| Ian Buck, NVIDIA<br> |
| Jaydeep Marathe, NVIDIA<br> |
| Jian-Zhong Wang, NVIDIA<br> |
| Karthik Raghavan Ravi, NVIDIA<br> |
| Kedar Patil, NVIDIA<br> |
| Manjunath Kudlur, NVIDIA<br> |
| Mark Harris, NVIDIA<br> |
| Michael Gold, NVIDIA<br> |
| Neil Trevett, NVIDIA<br> |
| Richard Johnson, NVIDIA<br> |
| Sean Lee, NVIDIA<br> |
| Tushar Kashalikar, NVIDIA<br> |
| Vinod Grover, NVIDIA<br> |
| Xiangyun Kong, NVIDIA<br> |
| Yogesh Kini, NVIDIA<br> |
| Yuan Lin, NVIDIA<br> |
| Mayuresh Pise, NVIDIA<br> |
| Allan Tzeng, QUALCOMM<br> |
| Alex Bourd, QUALCOMM<br> |
| Anirudh Acharya, QUALCOMM<br> |
| Andrew Gruber, QUALCOMM<br> |
| Andrzej Mamona, QUALCOMM<br> |
| Benedict Gaster, QUALCOMM<br> |
| Bill Torzewski, QUALCOMM<br> |
| Bob Rychlik, QUALCOMM<br> |
| Chihong Zhang, QUALCOMM<br> |
| Chris Mei, QUALCOMM<br> |
| Colin Sharp, QUALCOMM<br> |
| David Garcia, QUALCOMM<br> |
| David Ligon, QUALCOMM<br> |
| Jay Yun, QUALCOMM<br> |
| Lee Howes, QUALCOMM<br> |
| Richard Ruigrok, QUALCOMM<br> |
| Robert J. Simpson, QUALCOMM<br> |
| Sumesh Udayakumaran, QUALCOMM<br> |
| Vineet Goel, QUALCOMM<br> |
| Lihan Bin, QUALCOMM<br> |
| Vlad Shimanskiy, QUALCOMM<br> |
| Jian Liu, QUALCOMM<br> |
| Tasneem Brutch, Samsung<br> |
| Yoonseo Choi, Samsung<br> |
| Dennis Adams, Sony<br> |
| Pr-Anders Aronsson, Sony<br> |
| Jim Rasmusson, Sony<br> |
| Thierry Lepley, STMicroelectronics<br> |
| Anton Gorenko, StreamComputing<br> |
| Jakub Szuppe, StreamComputing<br> |
| Vincent Hindriksen, StreamComputing<br> |
| Alan Ward, Texas Instruments<br> |
| Yuan Zhao, Texas Instruments<br> |
| Pete Curry, Texas Instruments<br> |
| Simon McIntosh-Smith, University of Bristol<br> |
| James Price, University of Bristol<br> |
| Paul Preney, University of Windsor<br> |
| Shane Peelar, University of Windsor<br> |
| Brian Hutsell, Vivante<br> |
| Mike Cai, Vivante<br> |
| Sumeet Kumar, Vivante<br> |
| Wei-Lun Kao, Vivante<br> |
| Xing Wang, Vivante<br> |
| Jeff Fifield, Xilinx<br> |
| Hem C. Neema, Xilinx<br> |
| Henry Styles, Xilinx<br> |
| Ralph Wittig, Xilinx<br> |
| Ronan Keryell, Xilinx<br> |
| AJ Guillon, YetiWare Inc<br></p></div> |
| <div style="page-break-after:always"></div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_introduction">1. Introduction</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Modern processor architectures have embraced parallelism as an important |
| pathway to increased performance. Facing technical challenges with |
| higher clock speeds in a fixed power envelope, Central Processing Units |
| (CPUs) now improve performance by adding multiple cores. Graphics |
| Processing Units (GPUs) have also evolved from fixed function rendering |
| devices into programmable parallel processors. As todays computer |
| systems often include highly parallel CPUs, GPUs and other types of |
| processors, it is important to enable software developers to take full |
| advantage of these heterogeneous processing platforms. |
| <br> |
| <br> |
| Creating applications for heterogeneous parallel processing platforms is |
| challenging as traditional programming approaches for multi-core CPUs |
| and GPUs are very different. CPU-based parallel programming models are |
| typically based on standards but usually assume a shared address space |
| and do not encompass vector operations. General purpose GPU |
| programming models address complex memory hierarchies and vector |
| operations but are traditionally platform-, vendor- or |
| hardware-specific. These limitations make it difficult for a developer |
| to access the compute power of heterogeneous CPUs, GPUs and other types |
| of processors from a single, multi-platform source code base. More than |
| ever, there is a need to enable software developers to effectively take |
| full advantage of heterogeneous processing platforms from high |
| performance compute servers, through desktop computer systems to |
| handheld devices - that include a diverse mix of parallel CPUs, GPUs and |
| other processors such as DSPs and the Cell/B.E. processor. |
| <br> |
| <br> |
| <strong>OpenCL</strong> (Open Computing Language) is an open royalty-free standard for |
| general purpose parallel programming across CPUs, GPUs and other |
| processors, giving software developers portable and efficient access to |
| the power of these heterogeneous processing platforms. |
| <br> |
| <br> |
| OpenCL supports a wide range of applications, ranging from embedded and |
| consumer software to HPC solutions, through a low-level, |
| high-performance, portable abstraction. By creating an efficient, |
| close-to-the-metal programming interface, OpenCL will form the |
| foundation layer of a parallel computing ecosystem of |
| platform-independent tools, middleware and applications. OpenCL is |
| particularly suited to play an increasingly significant role in emerging |
| interactive graphics applications that combine general parallel compute |
| algorithms with graphics rendering pipelines. |
| <br> |
| <br> |
| OpenCL consists of an API for coordinating parallel computation across |
| heterogeneous processors; and a cross-platform intermediate language |
| with a well-specified computation environment. The OpenCL standard:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Supports both data- and |
| task-based parallel programming models |
| </p> |
| </li> |
| <li> |
| <p> |
| Utilizes a portable and |
| self-contained intermediate representation with support for parallel |
| execution |
| </p> |
| </li> |
| <li> |
| <p> |
| Defines consistent |
| numerical requirements based on IEEE 754 |
| </p> |
| </li> |
| <li> |
| <p> |
| Defines a configuration |
| profile for handheld and embedded devices |
| </p> |
| </li> |
| <li> |
| <p> |
| Efficiently interoperates |
| with OpenGL, OpenGL ES and other graphics APIs |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>This document begins with an overview of basic concepts and the |
| architecture of OpenCL, followed by a detailed description of its |
| execution model, memory model and synchronization support. It then |
| discusses the OpenCL__platform and runtime API. Some examples are given |
| that describe sample compute use-cases and how they would be written in |
| OpenCL. The specification is divided into a core specification that any |
| OpenCL compliant implementation must support; a handheld/embedded |
| profile which relaxes the OpenCL compliance requirements for handheld |
| and embedded devices; and a set of optional extensions that are likely |
| to move into the core specification in later revisions of the OpenCL |
| specification.</p></div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_glossary">2. Glossary</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p><strong>Application</strong>: The combination of the program running on the host and |
| OpenCL devices. |
| <br> |
| <br> |
| <strong>Acquire semantics</strong>: One of the memory order semantics defined for |
| synchronization operations. Acquire semantics apply to atomic |
| operations that load from memory. Given two units of execution, <strong>A</strong> and |
| <strong>B</strong>, acting on a shared atomic object <strong>M</strong>, if <strong>A</strong> uses an atomic load of |
| <strong>M</strong> with acquire semantics to synchronize-with an atomic store to <strong>M</strong> by |
| <strong>B</strong> that used release semantics, then <strong>A</strong>'s atomic load will occur before |
| any subsequent operations by <strong>A</strong>. Note that the memory orders |
| <em>release</em>, <em>sequentially consistent</em>, and <em>acquire_release</em> all include |
| <em>release semantics</em> and effectively pair with a load using acquire |
| semantics. |
| <br> |
| <br> |
| <strong>Acquire release semantics</strong>: A memory order semantics for |
| synchronization operations (such as atomic operations) that has the |
| properties of both acquire and release memory orders. It is used with |
| read-modify-write operations. |
| <br> |
| <br> |
| <strong>Atomic operations</strong>: Operations that at any point, and from any |
| perspective, have either occurred completely, or not at all. Memory |
| orders associated with atomic operations may constrain the visibility of |
| loads and stores with respect to the atomic operations (see <em>relaxed |
| semantics</em>, <em>acquire semantics</em>, <em>release semantics</em> or <em>acquire release |
| semantics</em>). |
| <br> |
| <br> |
| <strong>Blocking and Non-Blocking Enqueue API calls</strong>: A <em>non-blocking enqueue |
| API call</em> places a <em>command</em> on a <em>command-queue</em> and returns |
| immediately to the host. The <em>blocking-mode enqueue API calls</em> do not |
| return to the host until the command has completed. |
| <br> |
| <br> |
| <strong>Barrier</strong>: There are three types of <em>barriers</em> a command-queue barrier, |
| a work-group barrier and a sub-group barrier.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| The OpenCL API provides a |
| function to enqueue a <em>command-queue</em> <em>barrier</em> command. This <em>barrier</em> |
| command ensures that all previously enqueued commands to a command-queue |
| have finished execution before any following <em>commands</em> enqueued in the |
| <em>command-queue</em> can begin execution. |
| </p> |
| </li> |
| <li> |
| <p> |
| The OpenCL kernel |
| execution model provides built-in <em>work-group barrier</em> functionality. |
| This <em>barrier</em> built-in function can be used by a <em>kernel</em> executing on |
| a <em>device</em> to perform synchronization between <em>work-items</em> in a |
| <em>work-group</em> executing the <em>kernel</em>. All the <em>work-items</em> of a |
| <em>work-group</em> must execute the <em>barrier</em> construct before any are allowed |
| to continue execution beyond the <em>barrier</em>. |
| </p> |
| </li> |
| <li> |
| <p> |
| The OpenCL kernel |
| execution model provides built-in <em>sub-group barrier</em> functionality. |
| This <em>barrier</em> built-in function can be used by a <em>kernel</em> executing on |
| a <em>device</em> to perform synchronization between <em>work-items</em> in a |
| <em>sub-group</em> executing the <em>kernel</em>. All the <em>work-items</em> of a |
| <em>sub-group</em> must execute the <em>barrier</em> construct before any are allowed |
| to continue execution beyond the <em>barrier</em>. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p><strong>Buffer Object</strong>: A memory object that stores a linear collection of |
| bytes. Buffer objects are accessible using a pointer in a <em>kernel</em> |
| executing on a <em>device</em>. Buffer objects can be manipulated by the host |
| using OpenCL API calls. A <em>buffer object</em> encapsulates the following |
| information:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Size in bytes. |
| </p> |
| </li> |
| <li> |
| <p> |
| Properties that describe |
| usage information and which region to allocate from. |
| </p> |
| </li> |
| <li> |
| <p> |
| Buffer data. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p><strong>Built-in Kernel</strong>: A <em>built-in kernel</em> is a <em>kernel</em> that is executed on |
| an OpenCL <em>device</em> or <em>custom device</em> by fixed-function hardware or in |
| firmware. <em>Applications</em> can query the <em>built-in kernels</em> supported by |
| a <em>device</em> or <em>custom device</em>. A <em>program object</em> can only contain |
| <em>kernels</em> written in OpenCL C or <em>built-in kernels</em> but not both. See |
| also <em>Kernel</em> and <em>Program</em>. |
| <br> |
| <br> |
| <strong>Child kernel</strong>: see <em>device-side enqueue.</em> |
| <br> |
| <br> |
| <strong>Command</strong>: The OpenCL operations that are submitted to a <em>command-queue</em> |
| for execution. For example, OpenCL commands issue kernels for execution |
| on a compute device, manipulate memory objects, etc. |
| <br> |
| <br> |
| <strong>Command-queue</strong>: An object that holds <em>commands</em> that will be executed on |
| a specific <em>device</em>. The <em>command-queue</em> is created on a specific |
| <em>device</em> in a <em>context</em>. <em>Commands</em> to a <em>command-queue</em> are queued |
| in-order but may be executed in-order or out-of-order. <em>Refer to |
| In-order Execution_and_Out-of-order Execution</em>. |
| <br> |
| <br> |
| <strong>Command-queue Barrier</strong>. See <em>Barrier</em>. |
| <br> |
| <br> |
| <strong>Command synchronization</strong>: Constraints on the order that commands are |
| launched for execution on a device defined in terms of the |
| synchronization points that occur between commands in host |
| command-queues and between commands in device-side command-queues. See |
| <em>synchronization points</em>. |
| <br> |
| <br> |
| <strong>Complete</strong>: The final state in the six state model for the execution of |
| a command. The transition into this state occurs is signaled through |
| event objects or callback functions associated with a command. |
| <br> |
| <br> |
| <strong>Compute Device Memory</strong>: This refers to one or more memories attached |
| to the compute device. |
| <br> |
| <br> |
| <strong>Compute Unit</strong>: An OpenCL <em>device</em> has one or more <em>compute units</em>. A |
| <em>work-group</em> executes on a single <em>compute unit</em>. A <em>compute unit</em> is |
| composed of one or more <em>processing elements</em> and <em>local memory</em>. A |
| <em>compute unit</em> may also include dedicated texture filter units that can |
| be accessed by its processing elements. |
| <br> |
| <br> |
| <strong>Concurrency</strong>: A property of a system in which a set of tasks in a system |
| can remain active and make progress at the same time. To utilize |
| concurrent execution when running a program, a programmer must identify |
| the concurrency in their problem, expose it within the source code, and |
| then exploit it using a notation that supports concurrency. |
| <br> |
| <br> |
| <strong>Constant Memory</strong>: A region of <em>global memory</em> that remains constant |
| during the execution of a <em>kernel</em>. The <em>host</em> allocates and |
| initializes memory objects placed into <em>constant memory</em>.</p></div> |
| <div class="paragraph"><p><strong>Context</strong>: The environment within which the kernels execute and the |
| domain in which synchronization and memory management is defined. The |
| <em>context</em> includes a set of <em>devices</em>, the memory accessible to those |
| <em>devices</em>, the corresponding memory properties and one or more |
| <em>command-queues</em> used to schedule execution of a <em>kernel(s)</em> or |
| operations on <em>memory objects</em>. |
| <br> |
| <br> |
| <strong>Control flow</strong>: The flow of instructions executed by a work-item. |
| Multiple logically related work items may or may not execute the same |
| control flow. The control flow is said to be <em>converged</em> if all the |
| work-items in the set execution the same stream of instructions. In a |
| <em>diverged</em> control flow, the work-items in the set execute different |
| instructions. At a later point, if a diverged control flow becomes |
| converged, it is said to be a re-converged control flow. |
| <br> |
| <br> |
| <strong>Converged control flow</strong>: see <strong>control flow</strong>. |
| <br> |
| <br> |
| <strong>Custom Device</strong>: An OpenCL <em>device</em> that fully implements the OpenCL |
| Runtime but does not support <em>programs</em> written in OpenCL C. A custom |
| device may be specialized non-programmable hardware that is very power |
| efficient and performant for directed tasks or hardware with limited |
| programmable capabilities such as specialized DSPs. Custom devices are |
| not OpenCL conformant. Custom devices may support an online compiler. |
| Programs for custom devices can be created using the OpenCL runtime APIs |
| that allow OpenCL programs to be created from source (if an online |
| compiler is supported) and/or binary, or from <em>built-in |
| kernels_supported by the _device</em>. See also <em>Device</em>. |
| <br> |
| <br> |
| <strong>Data Parallel Programming Model</strong>: Traditionally, this term refers to a |
| programming model where concurrency is expressed as instructions from a |
| single program applied to multiple elements within a set of data |
| structures. The term has been generalized in OpenCL to refer to a model |
| wherein a set of instructions from a single program are applied |
| concurrently to each point within an abstract domain of indices. |
| <br> |
| <br> |
| <strong>Data race</strong>: The execution of a program contains a data race if it |
| contains two actions in different work items or host threads where (1) |
| one action modifies a memory location and the other action reads or |
| modifies the same memory location, and (2) at least one of these actions |
| is not atomic, or the corresponding memory scopes are not inclusive, and |
| (3) the actions are global actions unordered by the |
| global-happens-before relation or are local actions unordered by the |
| local-happens before relation. |
| <br> |
| <br> |
| <strong>Deprecation</strong>: existing features are marked as deprecated if their usage is not recommended as that feature is being de-emphasized, superseded and may be removed from a future version of the specification.[BA2] |
| <br> |
| <br> |
| <strong>Device</strong>: A <em>device</em> is a collection of <em>compute units</em>. A |
| <em>command-queue</em> is used to queue <em>commands</em> to a <em>device</em>. Examples of |
| <em>commands</em> include executing <em>kernels</em>, or reading and writing <em>memory |
| objects</em>. OpenCL devices typically correspond to a GPU, a multi-core |
| CPU, and other processors such as DSPs and the Cell/B.E. processor. |
| <br> |
| <br> |
| <strong>Device-side enqueue</strong>: A mechanism whereby a kernel-instance is enqueued |
| by a kernel-instance running on a device without direct involvement by |
| the host program. This produces <em>nested parallelism</em>; i.e. additional |
| levels of concurrency are nested inside a running kernel-instance. The |
| kernel-instance executing on a device (the <em>parent kernel</em>) enqueues a |
| kernel-instance (the <em>child kernel</em>) to a device-side command queue. |
| Child and parent kernels execute asynchronously though a parent kernel |
| does not complete until all of its child-kernels have completed. |
| <br> |
| <br> |
| <strong>Diverged control flow</strong>: see <em>control flow</em>. |
| <br> |
| <br> |
| <strong>Ended</strong>: The fifth state in the six state model for the execution of a |
| command. The transition into this state occurs when execution of a |
| command has ended. When a Kernel-enqueue command ends, all of the |
| work-groups associated with that command have finished their execution. |
| <br> |
| <br> |
| <strong>Event Object</strong>: An <em>event</em> <em>object_encapsulates the status of an |
| operation such as a _command</em>. It can be used to synchronize operations |
| in a context. |
| <br> |
| <br> |
| <strong>Event Wait List</strong>: An <em>event wait list</em> is a list of <em>event objects</em> that |
| can be used to control when a particular <em>command</em> begins execution. |
| <br> |
| <br> |
| <strong>Fence</strong>: A memory ordering operation without an associated atomic |
| object. A fence can use the <em>acquire semantics, release semantics</em>, or |
| <em>acquire release semantics</em>. |
| <br> |
| <br> |
| <strong>Framework</strong>: A software system that contains the set of components to |
| support software development and execution. A <em>framework</em> typically |
| includes libraries, APIs, runtime systems, compilers, etc. |
| <br> |
| <br> |
| <strong>Generic address space</strong>: An address space that include the <em>private</em>, |
| <em>local</em>, and <em>global</em> address spaces available to a device. The generic |
| address space supports conversion of pointers to and from private, local |
| and global address spaces, and hence lets a programmer write a single |
| function that at compile time can take arguments from any of the three |
| named address spaces. |
| <br> |
| <br> |
| <strong>Global Happens before</strong>: see <em>happens before</em>. |
| <br> |
| <br> |
| <strong>Global ID</strong>: A <em>global ID</em> is used to uniquely identify a <em>work-item</em> and |
| is derived from the number of <em>global work-items</em> specified when |
| executing a <em>kernel</em>. The <em>global ID</em> is a N-dimensional value that |
| starts at (0, 0, 0). See also <em>Local ID</em>. |
| <br> |
| <br> |
| <strong>Global Memory</strong>: A memory region accessible to all <em>work-items</em> executing |
| in a <em>context</em>. It is accessible to the <em>host</em> using <em>commands</em> such as |
| read, write and map. <em>Global memory</em> is included within the <em>generic |
| address space</em> that includes the private and local address spaces. |
| <br> |
| <br> |
| <strong>GL share group</strong>: A <em>GL share group</em> object manages shared OpenGL or |
| OpenGL ES resources |
| such as textures, buffers, framebuffers, and renderbuffers and is |
| associated with one or more GL context objects. The <em>GL share group</em> is |
| typically an opaque object and not directly accessible. |
| <br> |
| <br> |
| <strong>Handle</strong>: An opaque type that references an <em>object</em> allocated by |
| OpenCL. Any operation on an <em>object</em> occurs by reference to that |
| objects handle. |
| <br> |
| <br> |
| <strong>Happens before</strong>: An ordering relationship between operations that |
| execute on multiple units of execution. If an operation A happens-before |
| operation B then A must occur before B; in particular, any value written |
| by A will be visible to B.We define two separate happens before |
| relations: <em>global-happens-before</em> and <em>local-happens-before</em>. These are |
| defined in section 3.3.6. |
| <br> |
| <br> |
| <strong>Host</strong>: The <em>host</em> interacts with the <em>context</em> using the OpenCL API. |
| <br> |
| <br> |
| <strong>Host-thread</strong>: the unit of execution that executes the statements in the |
| Host program. |
| <br> |
| <br> |
| <strong>Host pointer</strong>: A pointer to memory that is in the virtual address space |
| on the <em>host</em>. |
| <br> |
| <br> |
| <strong>Illegal</strong>: Behavior of a system that is explicitly not allowed and will |
| be reported as an error when encountered by OpenCL. |
| <br> |
| <br> |
| <strong>Image Object</strong>: A <em>memory object</em> that stores a two- or three- |
| dimensional structured array. Image data can only be accessed with read |
| and write functions. The read functions use a <em>sampler</em>. |
| <br> |
| <br> |
| The <em>image object</em> encapsulates the following information:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Dimensions of the image. |
| </p> |
| </li> |
| <li> |
| <p> |
| Description of each |
| element in the image. |
| </p> |
| </li> |
| <li> |
| <p> |
| Properties that describe |
| usage information and which region to allocate from. |
| </p> |
| </li> |
| <li> |
| <p> |
| Image data. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>The elements of an image are selected from a list of predefined image |
| formats. |
| <br> |
| <br> |
| <strong>Implementation Defined</strong>: Behavior that is explicitly allowed to vary |
| between conforming implementations of OpenCL. An OpenCL implementor is |
| required to document the implementation-defined behavior. |
| <br> |
| <br> |
| <strong>Independent Forward Progress</strong>: If an entity supports independent forward |
| progress, then if it is otherwise not dependent on any actions due to be |
| performed by any other entity (for example it does not wait on a lock |
| held by, and thus that must be released by, any other entity), then its |
| execution cannot be blocked by the execution of any other entity in the |
| system (it will not be starved). Work items in a subgroup, for example, |
| typically do not support independent forward progress, so one work item |
| in a subgroup may be completely blocked (starved) if a different work |
| item in the same subgroup enters a spin loop. |
| <br> |
| <br> |
| <strong>In-order Execution</strong>: A model of execution in OpenCL where the <em>commands</em> |
| in a <em>command-queue_ are executed in order of submission with each |
| _command</em> running to completion before the next one begins. See |
| Out-of-order Execution. |
| <br> |
| <br> |
| <strong>Intermediate Language</strong>: A lower-level language that may be used to |
| create programs. SPIR-V is a required IL for OpenCL 2.2 runtimes. |
| Additional ILs may be accepted on an implementation-defined basis. |
| <br> |
| <br> |
| <strong>Kernel</strong>: A <em>kernel</em> is a function declared in a <em>program</em> and executed |
| on an OpenCL <em>device</em>. A <em>kernel</em> is identified by the kernel or |
| kernel qualifier applied to any function defined in a <em>program</em>. |
| <br> |
| <br> |
| <strong>Kernel-instance</strong>: The work carried out by an OpenCL program occurs |
| through the execution of kernel-instances on devices. The kernel |
| instance is the <em>kernel object</em>, the values associated with the |
| arguments to the kernel, and the parameters that define the <em>NDRange</em> |
| index space. |
| <br> |
| <br> |
| <strong>Kernel Object</strong>: A <em>kernel object</em> encapsulates a specific <em>kernel |
| function declared in a <em>program</em> and the argument values to be used when |
| executing this </em>kernel function. |
| <br> |
| <br> |
| <strong>Kernel Language</strong>: A language that is used to create source code for kernel. |
| Supported kernel languages include OpenCL C, OpenCL C++, and OpenCL dialect of SPIR-V. |
| <br> |
| <br> |
| <strong>Launch</strong>: The transition of a command from the <em>submitted</em> state to the |
| <em>ready</em> state. See <em>Ready</em>. |
| <br> |
| <br> |
| <strong>Local ID</strong>: A <em>local ID</em> specifies a unique <em>work-item ID</em> within a given |
| <em>work-group</em> that is executing a <em>kernel</em>. The <em>local ID</em> is a |
| N-dimensional value that starts at (0, 0, 0). See also <em>Global ID</em>. |
| <br> |
| <br> |
| <strong>Local Memory</strong>: A memory region associated with a <em>work-group</em> and |
| accessible only by <em>work-items</em> in that <em>work-group</em>. <em>Local memory</em> is |
| included within the <em>generic address space</em> that includes the private |
| and global address spaces. |
| <br> |
| <br> |
| <strong>Marker</strong>: A <em>command</em> queued in a <em>command-queue</em> that can be used to |
| tag all <em>commands</em> queued before the <em>marker</em> in the <em>command-queue</em>. |
| The <em>marker</em> command returns an <em>event</em> which can be used by the |
| <em>application</em> to queue a wait on the marker event i.e. wait for all |
| commands queued before the <em>marker</em> command to complete. |
| <br> |
| <br> |
| <strong>Memory Consistency Model</strong>: Rules that define which values are observed |
| when multiple units of execution load data from any shared memory plus |
| the synchronization operations that constrain the order of memory |
| operations and define synchronization relationships. The memory |
| consistency model in OpenCL is based on the memory model from the ISO |
| C11 programming language. |
| <br> |
| <br> |
| <strong>Memory Objects</strong>: A <em>memory object</em> is a handle to a reference counted |
| region of <em>global memory</em>. Also see_Buffer Object_and_Image Object_. |
| <br> |
| <br> |
| <strong>Memory Regions (or Pools)</strong>: A distinct address space in OpenCL. <em>Memory |
| regions</em> may overlap in physical memory though OpenCL will treat them as |
| logically distinct. The <em>memory regions</em> are denoted as <em>private</em>, |
| <em>local</em>, <em>constant,</em> and <em>global</em>. |
| <br> |
| <br> |
| <strong>Memory Scopes</strong>: These memory scopes define a hierarchy of visibilities |
| when analyzing the ordering constraints of memory operations. They are |
| defined by the values of the memory_scope enumeration constant. Current |
| values are <strong>memory_scope_work_item</strong>(memory constraints only apply to a |
| single work-item and in practice apply only to image operations)<strong>, |
| memory_scope_sub_group</strong> (memory-ordering constraints only apply to |
| work-items executing in a sub-group), <strong>memory_scope_work_group</strong> |
| (memory-ordering constraints only apply to work-items executing in a |
| work-group), <strong>memory_scope_device</strong> (memory-ordering constraints only |
| apply to work-items executing on a single device) and |
| <strong>memory_scope_all_svm_devices</strong> (memory-ordering constraints only apply |
| to work-items executing across multiple devices and when using shared |
| virtual memory). |
| <br> |
| <br> |
| <strong>Modification Order</strong>:All modifications to a particular atomic object M |
| occur in some particular <strong>total order</strong>, called the <strong>modification |
| order</strong> of M. If A and B are modifications of an atomic object M, and A |
| happens-before B, then A shall precede B in the modification order of M. |
| Note that the modification order of an atomic object M is independent of |
| whether M is in local or global memory. |
| <br> |
| <br> |
| <strong>Nested Parallelism</strong>: See <em>device-side enqueue</em>. |
| <br> |
| <br> |
| <strong>Object</strong>: Objects are abstract representation of the resources that can |
| be manipulated by the OpenCL API. Examples include <em>program objects</em>, |
| <em>kernel objects</em>, and <em>memory objects</em>. |
| <br> |
| <br> |
| <strong>Out-of-Order Execution</strong>: A model of execution in which <em>commands</em> placed |
| in the <em>work queue</em> may begin and complete execution in any order |
| consistent with constraints imposed by <em>event wait |
| lists_and_command-queue barrier</em>. See <em>In-order Execution</em>. |
| <br> |
| <br> |
| <strong>Parent device</strong>: The OpenCL <em>device</em> which is partitioned to create |
| <em>sub-devices</em>. Not all <em>parent devices_are _root devices</em>. A <em>root |
| device</em> might be partitioned and the <em>sub-devices</em> partitioned again. |
| In this case, the first set of <em>sub-devices</em> would be <em>parent devices</em> |
| of the second set, but not the <em>root devices</em>. Also see <em>device</em>, |
| <em>parent device</em> and <em>root device</em>. |
| <br> |
| <br> |
| <strong>Parent kernel</strong>: see <em>device-side enqueue</em>. |
| <br> |
| <br> |
| <strong>Pipe</strong>: The <em>pipe</em> memory object conceptually is an ordered sequence of |
| data items. A pipe has two endpoints: a write endpoint into which data |
| items are inserted, and a read endpoint from which data items are |
| removed. At any one time, only one kernel instance may write into a |
| pipe, and only one kernel instance may read from a pipe. To support the |
| producer consumer design pattern, one kernel instance connects to the |
| write endpoint (the producer) while another kernel instance connects to |
| the reading endpoint (the consumer). |
| <br> |
| <br> |
| <strong>Platform</strong>: The <em>host</em> plus a collection of <em>devices</em> managed by the |
| OpenCL <em>framework</em> that allow an application to share <em>resources</em> and |
| execute <em>kernels</em> on <em>devices</em> in the <em>platform</em>. |
| <br> |
| <br> |
| <strong>Private Memory</strong>: A region of memory private to a <em>work-item</em>. Variables |
| defined in one <em>work-items</em> <em>private memory</em> are not visible to another |
| <em>work-item</em>. |
| <br> |
| <br> |
| <strong>Processing Element</strong>: A virtual scalar processor. A work-item may |
| execute on one or more processing elements. |
| <br> |
| <br> |
| <strong>Program</strong>: An OpenCL <em>program</em> consists of a set of <em>kernels</em>. |
| <em>Programs</em> may also contain auxiliary functions called by the <em>_kernel |
| functions and constant data. |
| <br> |
| <br> |
| <strong>Program Object</strong>: A _program object</em> encapsulates the following |
| information:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| A reference to an |
| associated <em>context</em>. |
| </p> |
| </li> |
| <li> |
| <p> |
| A <em>program</em> source or |
| binary. |
| </p> |
| </li> |
| <li> |
| <p> |
| The latest successfully |
| built program executable, the list of <em>devices</em> for which the program |
| executable is built, the build options used and a build log. |
| </p> |
| </li> |
| <li> |
| <p> |
| The number of <em>kernel |
| objects</em> currently attached. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p> </p></div> |
| <div class="paragraph"><p><strong>Queued</strong>: The first state in the six state model for the execution of a |
| command. The transition into this state occurs when the command is |
| enqueued into a command-queue. |
| <br> |
| <br> |
| <strong>Ready</strong>: The third state in the six state model for the execution of a |
| command. The transition into this state occurs when pre-requisites |
| constraining execution of a command have been met; i.e. the command has |
| been launched. When a Kernel-enqueue command is launched, work-groups |
| associated with the command are placed in a devices work-pool from |
| which they are scheduled for execution. |
| <br> |
| <br> |
| <strong>Re-converged Control Flow</strong>: see <em>control flow</em>. |
| <br> |
| <br> |
| <strong>Reference Count</strong>: The life span of an OpenCL object is determined by its |
| <em>reference count_an internal count of the number of references to the |
| object. When you create an object in OpenCL, its _reference count</em> is |
| set to one. Subsequent calls to the appropriate <em>retain</em> API (such as |
| clRetainContext, clRetainCommandQueue) increment the <em>reference count</em>. |
| Calls to the appropriate <em>release</em> API (such as clReleaseContext, |
| clReleaseCommandQueue) decrement the <em>reference count</em>. |
| Implementations may also modify the <em>reference count</em>, e.g. to track |
| attached objects or to ensure correct operation of in-progress or |
| scheduled activities. The object becomes inaccessible to host code when |
| the number of <em>release</em> operations performed matches the number of |
| <em>retain</em> operations plus the allocation of the object. At this point the |
| reference count may be zero but this is not guaranteed. |
| <br> |
| <br> |
| <strong>Relaxed Consistency</strong>: A memory consistency model in which the contents |
| of memory visible to different <em>work-items</em> or <em>commands</em> may be |
| different except at a <em>barrier</em> or other explicit synchronization |
| points. |
| <br> |
| <br> |
| <strong>Relaxed Semantics</strong>: A memory order semantics for atomic operations that |
| implies no order constraints. The operation is <em>atomic</em> but it has no |
| impact on the order of memory operations. |
| <br> |
| <br> |
| <strong>Release Semantics</strong>: One of the memory order semantics defined for |
| synchronization operations. Release semantics apply to atomic |
| operations that store to memory. Given two units of execution, <strong>A</strong> and |
| <strong>B</strong>, acting on a shared atomic object <strong>M</strong>, if <strong>A</strong> uses an atomic store |
| of <strong>M</strong> with release semantics to synchronize-with an atomic load to <strong>M</strong> |
| by <strong>B*that used acquire semantics, then *A*s atomic store will occur |
| <em>after</em> any prior operations by *A</strong>. Note that the memory orders |
| <em>acquire</em>, <em>sequentialy consistent</em>, and <em>acquire_release</em> all include |
| <em>acquire semantics</em> and effectively pair with a store using release |
| semantics. |
| <br> |
| <br> |
| <strong>Remainder work-groups</strong>: When the work-groups associated with a |
| kernel-instance are defined, the sizes of a work-group in each dimension |
| may not evenly divide the size of the NDRange in the corresponding |
| dimensions. The result is a collection of work-groups on the boundaries |
| of the NDRange that are smaller than the base work-group size. These are |
| known as <em>remainder work-groups</em>. |
| <br> |
| <br> |
| <strong>Running</strong>: The fourth state in the six state model for the execution of |
| a command. The transition into this state occurs when the execution of |
| the command starts. When a Kernel-enqueue command starts, one or more |
| work-groups associated with the command start to execute. |
| <br> |
| <br> |
| <strong>Root device</strong>: A <em>root device</em> is an OpenCL <em>device</em> that has not been |
| partitioned. Also see <em>device</em>, <em>parent device</em> and <em>root device</em>. |
| <br> |
| <br> |
| <strong>Resource</strong>: A class of <em>objects</em> defined by OpenCL. An instance of a |
| <em>resource</em> is an <em>object</em>. The most common <em>resources</em> are the |
| <em>context</em>, <em>command-queue</em>, <em>program objects</em>, <em>kernel objects</em>, and |
| <em>memory objects</em>. Computational resources are hardware elements that |
| participate in the action of advancing a program counter. Examples |
| include the <em>host</em>, <em>devices</em>, <em>compute units</em> and <em>processing |
| elements</em>. |
| <br> |
| <br> |
| <strong>Retain</strong>, Release: The action of incrementing (retain) and decrementing |
| (release) the reference count using an OpenCL <em>object</em>. This is a book |
| keeping functionality to make sure the system doesnt remove an <em>object</em> |
| before all instances that use this <em>object</em> have finished. Refer to |
| <em>Reference Count</em>. |
| <br> |
| <br> |
| <strong>Sampler</strong>: An <em>object</em> that describes how to sample an image when the |
| image is read in the <em>kernel</em>. The image read functions take a |
| <em>sampler</em> as an argument. The <em>sampler</em> specifies the image |
| addressing-mode i.e. how out-of-range image coordinates are handled, the |
| filter mode, and whether the input image coordinate is a normalized or |
| unnormalized value. |
| <br> |
| <br> |
| <strong>Scope inclusion</strong>: Two actions <strong>A</strong> and <strong>B</strong> are defined to have an |
| inclusive scope if they have the same scope <strong>P</strong> such that: (1) if <strong>P</strong> is |
| memory_scope_sub_group, and <strong>A</strong> and <strong>B</strong> are executed by work-items |
| within the same sub-group, or (2) if <strong>P</strong> is memory_scope_work_group, and |
| <strong>A</strong> and <strong>B</strong> are executed by work-items within the same work-group, or |
| (3) if <strong>P</strong> is memory_scope_device, and <strong>A</strong> and <strong>B</strong> are executed by |
| work-items on the same device, or (4) if <strong>P</strong> is |
| memory_scope_all_svm_devices, if <strong>A</strong> and <strong>B</strong> are executed by host |
| threads or by work-items on one or more devices that can share SVM |
| memory with each other and the host process. |
| <br> |
| <br> |
| <strong>Sequenced before</strong>: A relation between evaluations executed by a single |
| unit of execution. Sequenced-before is an asymmetric, transitive, |
| pair-wise relation that induces a partial order between evaluations. |
| Given any two evaluations A and B, if A is sequenced-before B, then the |
| execution of A shall precede the execution of B. |
| <br> |
| <br> |
| <strong>Sequential consistency</strong>: Sequential consistency interleaves the steps |
| executed by each unit of execution. Each access to a memory location |
| sees the last assignment to that location in that interleaving. |
| <br> |
| <br> |
| <strong>Sequentially consistent semantics</strong>: One of the memory order semantics |
| defined for synchronization operations. When using |
| sequentially-consistent synchronization operations, the loads and stores |
| within one unit of execution appear to execute in program order (i.e., |
| the sequenced-before order), and loads and stores from different units |
| of execution appear to be simply interleaved. |
| <br> |
| <br> |
| <strong>Shared Virtual Memory (SVM)</strong>: An address space exposed to both the host |
| and the devices within a context. SVM causes addresses to be meaningful |
| between the host and all of the devices within a context and therefore |
| supports the use of pointer based data structures in OpenCL kernels. It |
| logically extends a portion of the global memory into the host address |
| space therefore giving work-items access to the host address space. |
| There are three types of SVM in OpenCL <strong>Coarse-Grained buffer SVM</strong>: |
| Sharing occurs at the granularity of regions of OpenCL buffer memory |
| objects. <strong>Fine-Grained buffer SVM</strong>: Sharing occurs at the granularity |
| of individual loads/stores into bytes within OpenCL buffer memory |
| objects. <strong>Fine-Grained system SVM</strong>: Sharing occurs at the granularity of |
| individual loads/stores into bytes occurring anywhere within the host |
| memory. |
| <br> |
| <br> |
| <strong>SIMD</strong>: Single Instruction Multiple Data. A programming model where a |
| <em>kernel</em> is executed concurrently on multiple <em>processing elements</em> each |
| with its own data and a shared program counter. All <em>processing |
| elements</em> execute a strictly identical set of instructions. |
| <br> |
| <br> |
| <strong>Specialization constants</strong>: Specialization is intended for constant |
| objects that will not have known constant values until after initial |
| generation of a SPIR-V module. Such objects are called specialization |
| constants. Application might provide values for |
| the specialization constants that will be used when SPIR-V program is |
| built. Specialization constants that do not receive a value from an |
| application shall use default value as defined in SPIR-V specification. |
| <br> |
| <br> |
| <strong>SPMD</strong>: Single Program Multiple Data. A programming model where a |
| <em>kernel</em> is executed concurrently on multiple <em>processing elements</em> each |
| with its own data and its own program counter. Hence, while all |
| computational resources run the same <em>kernel</em> they maintain their own |
| instruction counter and due to branches in a <em>kernel</em>, the actual |
| sequence of instructions can be quite different across the set of |
| <em>processing elements</em>. |
| <br> |
| <br> |
| <strong>Sub-device</strong>: An OpenCL <em>device</em> can be partitioned into multiple |
| <em>sub-devices</em>. The new <em>sub-devices</em> alias specific collections of |
| compute units within the parent <em>device</em>, according to a partition |
| scheme. The <em>sub-devices</em> may be used in any situation that their |
| parent <em>device</em> may be used. Partitioning a <em>device</em> does not destroy |
| the parent <em>device</em>, which may continue to be used along side and |
| intermingled with its child <em>sub-devices</em>. Also see <em>device</em>, <em>parent |
| device</em> and <em>root device</em>. |
| <br> |
| <br> |
| <strong>Sub-group</strong>: Sub-groups are an implementation-dependent grouping of |
| work-items within a work-group. The size and number of sub-groups is |
| implementation-defined. |
| <br> |
| <br> |
| <strong>Sub-group Barrier</strong>. See <em>Barrier</em>. |
| <br> |
| <br> |
| <strong>Submitted</strong>: The second state in the six state model for the execution |
| of a command. The transition into this state occurs when the command is |
| flushed from the command-queue and submitted for execution on the |
| device. Once submitted, a programmer can assume a command will execute |
| once its prerequisites have been met. |
| <br> |
| <br> |
| <strong>SVM Buffer</strong>: A memory allocation enabled to work with Shared Virtual |
| Memory (SVM). Depending on how the SVM buffer is created, it can be a |
| coarse-grained or fine-grained SVM buffer. Optionally it may be wrapped |
| by a Buffer Object. See <em>Shared Virtual Memory (SVM)</em>. |
| <br> |
| <br> |
| <strong>Synchronization</strong>: Synchronization refers to mechanisms that constrain |
| the order of execution and the visibility of memory operations between |
| two or more units of execution. |
| <br> |
| <br> |
| <strong>Synchronization operations</strong>: Operations that define memory order |
| constraints in a program. They play a special role in controlling how |
| memory operations in one unit of execution (such as work-items or, when |
| using SVM a host thread) are made visible to another. Synchronization |
| operations in OpenCL include <em>atomic operations</em> and <em>fences</em>. |
| <br> |
| <br> |
| <strong>Synchronization point</strong>: A synchronization point between a pair of |
| commands (A and B) assures that results of command A happens-before |
| command B is launched (i.e. enters the ready state) . |
| <br> |
| <br> |
| <strong>Synchronizes with</strong>: A relation between operations in two different |
| units of execution that defines a memory order constraint in global |
| memory (<em>global-synchronizes-with</em>) or local memory |
| (<em>local-synchronizes-with</em>). |
| <br> |
| <br> |
| <strong>Task Parallel Programming Model</strong>: A programming model in which |
| computations are expressed in terms of multiple concurrent tasks |
| executing in one or more <em>command-queues</em>. The concurrent tasks can be |
| running different <em>kernels</em>. |
| <br> |
| <br> |
| <strong>Thread-safe</strong>: An OpenCL API call is considered to be <em>thread-safe</em> if |
| the internal state as managed by OpenCL remains consistent when called |
| simultaneously by multiple <em>host</em> threads. OpenCL API calls that are |
| <em>thread-safe</em> allow an application to call these functions in multiple |
| <em>host</em> threads without having to implement mutual exclusion across these |
| <em>host</em> threads i.e. they are also re-entrant-safe. |
| <br> |
| <br> |
| <strong>Undefined</strong>: The behavior of an OpenCL API call, built-in function used |
| inside a <em>kernel</em> or execution of a <em>kernel</em> that is explicitly not |
| defined by OpenCL. A conforming implementation is not required to |
| specify what occurs when an undefined construct is encountered in |
| OpenCL. |
| <br> |
| <br> |
| <strong>Unit of execution</strong>: a generic term for a process, OS managed thread |
| running on the host (a host-thread), kernel-instance, host program, |
| work-item or any other executable agent that advances the work |
| associated with a program. |
| <br> |
| <br> |
| <strong>Work-group</strong>: A collection of related <em>work-items</em> that execute on a |
| single <em>compute unit</em>. The <em>work-items</em> in the group execute the same |
| <em>kernel-instance</em> and share <em>local</em> <em>memory</em> and <em>work-group functions</em>. |
| <br> |
| <br> |
| <strong>Work-group Barrier</strong>. See <em>Barrier</em>. |
| <br> |
| <br> |
| <strong>Work-group Function</strong>: A function that carries out collective operations |
| across all the work-items in a work-group. Available collective |
| operations are a barrier, reduction, broadcast, prefix sum, and |
| evaluation of a predicate. A work-group function must occur within a |
| <em>converged control flow</em>; i.e. all work-items in the work-group must |
| encounter precisely the same work-group function. |
| <br> |
| <br> |
| <strong>Work-group Synchronization</strong>: Constraints on the order of execution for |
| work-items in a single work-group. |
| <br> |
| <br> |
| <strong>Work-pool</strong>: A logical pool associated with a device that holds commands |
| and work-groups from kernel-instances that are ready to execute. OpenCL |
| does not constrain the order that commands and work-groups are scheduled |
| for execution from the work-pool; i.e. a programmer must assume that |
| they could be interleaved. There is one work-pool per device used by |
| all command-queues associated with that device. The work-pool may be |
| implemented in any manner as long as it assures that work-groups placed |
| in the pool will eventually execute. |
| <br> |
| <br> |
| <strong>Work-item</strong>: One of a collection of parallel executions of a <em>kernel</em> |
| invoked on a <em>device</em> by a <em>command</em>. A <em>work-item</em> is executed by one |
| or more <em>processing elements</em> as part of a <em>work-group</em> executing on a |
| <em>compute unit</em>. A <em>work-item</em> is distinguished from other work-items by |
| its <em>global ID</em> or the combination of its <em>work-group</em> ID and its <em>local |
| ID</em> within a <em>work-group</em>.</p></div> |
| <div class="paragraph"><p> </p></div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_the_opencl_architecture">3. The OpenCL Architecture</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p><strong>OpenCL</strong> is an open industry standard for programming a heterogeneous |
| collection of CPUs, GPUs and other discrete computing devices organized |
| into a single platform. It is more than a language. OpenCL is a |
| framework for parallel programming and includes a language, API, |
| libraries and a runtime system to support software development. Using |
| OpenCL, for example, a programmer can write general purpose programs |
| that execute on GPUs without the need to map their algorithms onto a 3D |
| graphics API such as OpenGL or DirectX. |
| <br> |
| <br> |
| The target of OpenCL is expert programmers wanting to write portable yet |
| efficient code. This includes library writers, middleware vendors, and |
| performance oriented application programmers. Therefore OpenCL provides |
| a low-level hardware abstraction plus a framework to support programming |
| and many details of the underlying hardware are exposed. |
| <br> |
| <br> |
| To describe the core ideas behind OpenCL, we will use a hierarchy of |
| models:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Platform Model |
| </p> |
| </li> |
| <li> |
| <p> |
| Memory Model |
| </p> |
| </li> |
| <li> |
| <p> |
| Execution Model |
| </p> |
| </li> |
| <li> |
| <p> |
| Programming Model |
| </p> |
| </li> |
| </ul></div> |
| <div class="sect2"> |
| <h3 id="_platform_model">3.1. Platform Model</h3> |
| <div class="paragraph"><p>The Platform model for OpenCL is defined in <em>figure 3.1</em>. The model |
| consists of a <strong>host</strong> connected to one or more <strong>OpenCL devices</strong>. An OpenCL |
| device is divided into one or more <strong>compute units</strong> (CUs) which are further |
| divided into one or more <strong>processing elements</strong> (PEs). Computations on a |
| device occur within the processing elements. |
| <br> |
| <br> |
| An OpenCL application is implemented as both host code and device kernel |
| code. The host code portion of an OpenCL application runs on a host |
| processor according to the models native to the host platform. The |
| OpenCL application host code submits the kernel code as commands from |
| the host to OpenCL devices. An OpenCL device executes the commands |
| computation on the processing elements within the device. |
| <br> |
| <br> |
| An OpenCL device has considerable latitude on how computations are |
| mapped onto the devices processing elements. When processing elements |
| within a compute unit execute the same sequence of statements across the |
| processing elements, the control flow is said to be <em>converged.</em> |
| Hardware optimized for executing a single stream of instructions over |
| multiple processing elements is well suited to converged control |
| flows. When the control flow varies from one processing element to |
| another, it is said to be <em>diverged.</em> While a kernel always begins |
| execution with a converged control flow, due to branching statements |
| within a kernel, converged and diverged control flows may occur within a |
| single kernel. This provides a great deal of flexibility in the |
| algorithms that can be implemented with OpenCL. |
| <br> |
| <br></p></div> |
| <div class="paragraph"><p><span class="image"> |
| <img src="opencl22-API_files/image004_new.png" alt="opencl22-API_files/image004_new.png" width="320" height="180"> |
| </span></p></div> |
| <div class="paragraph"><p><strong>Figure 3.1</strong>: <em>Platform model … one host plus one or more compute devices each |
| with one or more compute units composed of one or more processing elements</em>. |
| <br> |
| <br> |
| Programmers provide programs in the form of SPIR-V source binaries, |
| OpenCL C or OpenCL C++ source strings or implementation-defined binary objects. The |
| OpenCL platform provides a compiler to translate program input of either |
| form into executable program objects. The device code compiler may be |
| <em>online</em> or <em>offline</em>. An <em>online</em> <em>compiler</em> is available during host |
| program execution using standard APIs. An <em>offline compiler</em> is |
| invoked outside of host program control, using platform-specific |
| methods. The OpenCL runtime allows developers to get a previously |
| compiled device program executable and be able to load and execute a |
| previously compiled device program executable. |
| <br> |
| <br> |
| OpenCL defines two kinds of platform profiles: a <em>full profile</em> and a |
| reduced-functionality <em>embedded profile</em>. A full profile platform must |
| provide an online compiler for all its devices. An embedded platform |
| may provide an online compiler, but is not required to do so. |
| <br> |
| <br> |
| A device may expose special purpose functionality as a <em>built-in |
| function</em>. The platform provides APIs for enumerating and invoking the |
| built-in functions offered by a device, but otherwise does not define |
| their construction or semantics. A <em>custom device</em> supports only |
| built-in functions, and cannot be programmed via a kernel language. |
| <br> |
| <br> |
| All device types support the OpenCL execution model, the OpenCL memory |
| model, and the APIs used in OpenCL to manage devices. |
| <br> |
| <br> |
| The platform model is an abstraction describing how OpenCL views the |
| hardware. The relationship between the elements of the platform model |
| and the hardware in a system may be a fixed property of a device or it |
| may be a dynamic feature of a program dependent on how a compiler |
| optimizes code to best utilize physical hardware.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_execution_model">3.2. Execution Model</h3> |
| <div class="paragraph"><p>The OpenCL execution model is defined in terms of two distinct units of |
| execution: <strong>kernels</strong> that execute on one or more OpenCL devices and a |
| <strong>host program</strong> that executes on the host. With regard to OpenCL, the |
| kernels are where the "work" associated with a computation occurs. This |
| work occurs through <strong>work-items</strong> that execute in groups (<strong>work-groups</strong>). |
| <br> |
| <br> |
| A kernel executes within a well-defined context managed by the host. |
| The context defines the environment within which kernels execute. It |
| includes the following resources:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Devices</strong>: One or |
| more devices exposed by the OpenCL platform. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Kernel Objects</strong>:The |
| OpenCL functions with their associated argument values that run on |
| OpenCL devices. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Program Objects</strong>:The |
| program source and executable that implement the kernels. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Memory |
| Objects</strong>:Variables visible to the host and the OpenCL devices. |
| Instances of kernels operate on these objects as they execute. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>The host program uses the OpenCL API to create and manage the context. |
| Functions from the OpenCL API enable the host to interact with a device |
| through a <em>command-queue</em>. Each command-queue is associated with a |
| single device. The commands placed into the command-queue fall into |
| one of three types:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Kernel-enqueue commands</strong>: |
| Enqueue a kernel for execution on a device. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Memory commands</strong>: |
| Transfer data between the host and device memory, between memory |
| objects, or map and unmap memory objects from the host address space. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Synchronization |
| commands</strong>: Explicit synchronization points that define order constraints |
| between commands. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>In addition to commands submitted from the host command-queue, a kernel |
| running on a device can enqueue commands to a device-side command queue. |
| This results in <em>child kernels</em> enqueued by a kernel executing on a |
| device (the <em>parent kernel</em>). Regardless of whether the command-queue |
| resides on the host or a device, each command passes through six states.</p></div> |
| <div class="olist arabic"><ol class="arabic"> |
| <li> |
| <p> |
| <strong>Queued</strong>: The command is enqueued to a command-queue. A |
| command may reside in the queue until it is flushed either explicitly (a |
| call to clFlush) or implicitly by some other command. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Submitted</strong>: The command is flushed from the command-queue and |
| submitted for execution on the device. Once flushed from the |
| command-queue, a command will execute after any prerequisites for |
| execution are met. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Ready</strong>: All prerequisites constraining execution of a command |
| have been met. The command, or for a kernel-enqueue command the |
| collection of work groups associated with a command, is placed in a |
| device work-pool from which it is scheduled for execution. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Running</strong>: Execution of the command starts. For the case of a |
| kernel-enqueue command, one or more work-groups associated with the |
| command start to execute. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Ended</strong>: Execution of a command ends. When a Kernel-enqueue |
| command ends, all of the work-groups associated with that command have |
| finished their execution. <em>Immediate side effects</em>, i.e. those |
| associated with the kernel but not necessarily with its child kernels, |
| are visible to other units of execution. These side effects include |
| updates to values in global memory. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Complete</strong>: The command and its child commands have finished |
| execution and the status of the event object, if any, associated with |
| the command is set to CL_COMPLETE. |
| </p> |
| </li> |
| </ol></div> |
| <div class="paragraph"><p>The execution states and the transitions between them are summarized in |
| Figure 3-2. These states and the concept of a device work-pool are |
| conceptual elements of the execution model. An implementation of OpenCL |
| has considerable freedom in how these are exposed to a program. Five of |
| the transitions, however, are directly observable through a profiling |
| interface. These profiled states are shown in Figure 3-2.</p></div> |
| <div class="paragraph"><p><span class="image"> |
| <img src="opencl22-API_files/image006.jpg" alt="image"> |
| </span></p></div> |
| <div class="paragraph"><p><strong>Figure 3-2: The states and transitions between states defined in the |
| OpenCL execution model. A subset of these transitions is exposed |
| through the profiling interface (see section 5.14).</strong></p></div> |
| <div class="paragraph"><p>Commands communicate their status through <em>Event objects</em>. Successful |
| completion is indicated by setting the event status associated with a |
| command to CL_COMPLETE. Unsuccessful completion results in abnormal |
| termination of the command which is indicated by setting the event |
| status to a negative value. In this case, the command-queue associated |
| with the abnormally terminated command and all other command-queues in |
| the same context may no longer be available and their behavior is |
| implementation defined. |
| <br> |
| <br> |
| A command submitted to a device will not launch until prerequisites that |
| constrain the order of commands have been resolved. These |
| prerequisites have three sources:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| They may arise from |
| commands submitted to a command-queue that constrain the order in which |
| commands are launched. For example, commands that follow a command queue |
| barrier will not launch until all commands prior to the barrier are |
| complete. |
| </p> |
| </li> |
| <li> |
| <p> |
| The second source of |
| prerequisites is dependencies between commands expressed through events. |
| A command may include an optional list of events. The command will wait |
| and not launch until all the events in the list are in the state CL |
| COMPLETE. By this mechanism, event objects define order constraints |
| between commands and coordinate execution between the host and one or |
| more devices. |
| </p> |
| </li> |
| <li> |
| <p> |
| The third source of |
| prerequisities can be the presence of non-trivial C initializers or C<span class="monospaced"> |
| constructors for program scope global variables. In this case, OpenCL |
| C/C</span> compiler shall generate program initialization kernels that |
| perform C initialization or C<span class="monospaced"> construction. These kernels must be |
| executed by OpenCL runtime on a device before any kernel from the same |
| program can be executed on the same device. The ND-range for any program |
| initialization kernel is (1,1,1). When multiple programs are linked |
| together, the order of execution of program initialization kernels |
| that belong to different programs is undefined. |
| <br> |
| <br> |
| Program clean up may result in the execution of one or more program |
| clean up kernels by the OpenCL runtime. This is due to the presence of |
| non-trivial C\</span> destructors for program scope variables. The ND-range |
| for executing any program clean up kernel is (1,1,1). The order of |
| execution of clean up kernels from different programs (that are linked |
| together) is undefined. |
| <br> |
| <br> |
| Note that C initializers, C<span class="monospaced"> constructors, or C</span> destructors for |
| program scope variables cannot use pointers to coarse grain and fine |
| grain SVM allocations. |
| <br> |
| <br> |
| A command may be submitted to a device and yet have no visible side effects |
| outside of waiting on and satisfying event dependences. Examples include |
| markers, kernels executed over ranges of no work-items or copy |
| operations with zero sizes. Such commands may pass directly from the |
| <em>ready</em> state to the <em>ended</em> state. |
| <br> |
| <br> |
| Command execution can be blocking or non-blocking. Consider a sequence |
| of OpenCL commands. For blocking commands, the OpenCL API functions |
| that enqueue commands don’t return until the command has completed. |
| Alternatively, OpenCL functions that enqueue non-blocking commands |
| return immediately and require that a programmer defines dependencies |
| between enqueued commands to ensure that enqueued commands are not |
| launched before needed resources are available. In both cases, the |
| actual execution of the command may occur asynchronously with execution |
| of the host program. |
| <br> |
| <br> |
| Commands within a single command-queue execute relative to each other in |
| one of two modes: |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p> </p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>In-order Execution</strong>: |
| Commands and any side effects associated with commands appear to the |
| OpenCL application as if they execute in the same order they are |
| enqueued to a command-queue. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Out-of-order Execution</strong>: |
| Commands execute in any order constrained only by explicit |
| synchronization points (e.g. through command queue barriers) or explicit |
| dependencies on events. |
| <br> |
| <br> |
| Multiple command-queues can be present within a single context. |
| Multiple command-queues execute commands independently. Event objects |
| visible to the host program can be used to define synchronization points |
| between commands in multiple command queues. If such synchronization |
| points are established between commands in multiple command-queues, an |
| implementation must assure that the command-queues progress concurrently |
| and correctly account for the dependencies established by the |
| synchronization points. For a detailed explanation of synchronization |
| points, see section 3.2.4. |
| <br> |
| <br> |
| The core of the OpenCL execution model is defined by how the kernels |
| execute. When a kernel-enqueue command submits a kernel for execution, |
| an index space is defined. The kernel, the argument values associated |
| with the arguments to the kernel, and the parameters that define the |
| index space define a <em>kernel-instance</em>. When a kernel-instance executes |
| on a device, the kernel function executes for each point in the defined |
| index space. Each of these executing kernel functions is called a |
| <em>work-item</em>. The work-items associated with a given kernel-instance are |
| managed by the device in groups called <em>work-groups</em>. These work-groups |
| define a coarse grained decomposition of the Index space. Work-groups |
| are further divided into <em>sub-groups</em>, which provide an additional level |
| of control over execution. |
| <br> |
| <br> |
| Work-items have a global ID based on their coordinates within the Index |
| space. They can also be defined in terms of their work-group and the |
| local ID within a work-group. The details of this mapping are described |
| in the following section. |
| </p> |
| </li> |
| </ul></div> |
| <div class="sect3"> |
| <h4 id="_execution_model_mapping_work_items_onto_an_ndrange">3.2.1. Execution Model: Mapping work-items onto an NDRange</h4> |
| <div class="paragraph"><p>The index space supported by OpenCL is called an NDRange. An NDRange is |
| an N-dimensional index space, where N is one, two or three. The NDRange |
| is decomposed into work-groups forming blocks that cover the Index |
| space. An NDRange is defined by three integer arrays of length N:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| The extent of the index |
| space (or global size) in each dimension. |
| </p> |
| </li> |
| <li> |
| <p> |
| An offset index F |
| indicating the initial value of the indices in each dimension (zero by |
| default). |
| </p> |
| </li> |
| <li> |
| <p> |
| The size of a work-group |
| (local size) in each dimension. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p> </p></div> |
| <div class="paragraph"><p>Each work-items global ID is an N-dimensional tuple. The global ID |
| components are values in the range from F, to F plus the number of |
| elements in that dimension minus one. |
| <br> |
| <br> |
| If a kernel is created from OpenCL C 2.0 or SPIR-V, the size of work-groups |
| in an NDRange (the local size) need not be the same for all work-groups. |
| In this case, any single dimension for which the global size is not |
| divisible by the local size will be partitioned into two regions. One |
| region will have work-groups that have the same number of work items as |
| was specified for that dimension by the programmer (the local size). The |
| other region will have work-groups with less than the number of work |
| items specified by the local size parameter in that dimension (the |
| <em>remainder work-groups</em>). Work-group sizes could be non-uniform in |
| multiple dimensions, potentially producing work-groups of up to 4 |
| different sizes in a 2D range and 8 different sizes in a 3D range. |
| <br> |
| <br> |
| Each work-item is assigned to a work-group and given a local ID to |
| represent its position within the work-group. A work-item’s local ID is |
| an N-dimensional tuple with components in the range from zero to the |
| size of the work-group in that dimension minus one. |
| <br> |
| <br> |
| Work-groups are assigned IDs similarly. The number of work-groups in |
| each dimension is not directly defined but is inferred from the local |
| and global NDRanges provided when a kernel-instance is enqueued. A |
| work-group’s ID is an N-dimensional tuple with components in the range 0 |
| to the ceiling of the global size in that dimension divided by the local |
| size in the same dimension. As a result, the combination of a |
| work-group ID and the local-ID within a work-group uniquely defines a |
| work-item. Each work-item is identifiable in two ways; in terms of a |
| global index, and in terms of a work-group index plus a local index |
| within a work group. |
| <br> |
| <br> |
| For example, consider the 2-dimensional index space in figure 3-3. We |
| input the index space for the work-items (G<sub>x</sub>, G<sub>y</sub>), the size of each |
| work-group (S<sub>x</sub>, S<sub>y</sub>) and the global ID offset (F<sub>x</sub>, F<sub>y</sub>). The |
| global indices define an G<sub>x</sub>by G<sub>y</sub> index space where the total number |
| of work-items is the product of G<sub>x</sub> and G<sub>y</sub>. The local indices define |
| an S<sub>x</sub> by S<sub>y</sub> index space where the number of work-items in a single |
| work-group is the product of S<sub>x</sub> and S<sub>y</sub>. Given the size of each |
| work-group and the total number of work-items we can compute the number |
| of work-groups. A 2-dimensional index space is used to uniquely identify |
| a work-group. Each work-item is identified by its global ID (<em>g</em><sub>x</sub>, |
| <em>g</em><sub>y</sub>) or by the combination of the work-group ID (<em>w</em><sub>x</sub>, <em>w</em><sub>y</sub>), the |
| size of each work-group (S<sub>x</sub>,S<sub>y</sub>) and the local ID (s<sub>x</sub>, s<sub>y</sub>) inside |
| the work-group such that |
| <br></p></div> |
| <div class="paragraph"><p>        (g<sub>x</sub> , g<sub>y</sub>) = (w<sub>x</sub> * S<sub>x</sub> + s<sub>x</sub> + F<sub>x</sub>, w<sub>y</sub> * S<sub>y</sub> + s<sub>y</sub> + F<sub>y</sub>) |
| <br> |
| <br> |
| The number of work-groups can be computed as: |
| <br></p></div> |
| <div class="paragraph"><p>        (W<sub>x</sub>, W<sub>y</sub>) = (ceil(G<sub>x</sub> / S<sub>x</sub>),ceil( G<sub>y</sub> / S<sub>y</sub>)) |
| <br> |
| <br> |
| Given a global ID and the work-group size, the work-group ID for a |
| work-item is computed as: |
| <br></p></div> |
| <div class="paragraph"><p>        (w<sub>x</sub>, w<sub>y</sub>) = ( (g<sub>x</sub> s<sub>x</sub> F<sub>x</sub>) / S<sub>x</sub>, (g<sub>y</sub> s<sub>y</sub> F<sub>y</sub>) / |
| S<sub>y</sub> )</p></div> |
| <div class="paragraph"><p><span class="image"> |
| <img src="opencl22-API_files/image007.jpg" alt="image"> |
| </span></p></div> |
| <div class="paragraph"><p><strong>Figure 3-3: An example of an NDRange index space showing work-items, |
| their global IDs and their mapping onto the pair of work-group and local |
| IDs. In this case, we assume that in each dimension, the size of the |
| work-group evenly divides the global NDRange size (i.e. all work-groups |
| have the same size) and that the offset is equal to zero.</strong> |
| <br> |
| <br> |
| Within a work-group work-items may be divided into sub-groups. The |
| mapping of work-items to sub-groups is implementation-defined and may be |
| queried at runtime. While sub-groups may be used in multi-dimensional |
| work-groups, each sub-group is 1-dimensional and any given work-item may |
| query which sub-group it is a member of. |
| <br> |
| <br> |
| Work items are mapped into sub-groups through a combination of |
| compile-time decisions and the parameters of the dispatch. The mapping |
| to sub-groups is invariant for the duration of a kernels execution, |
| across dispatches of a given kernel with the same work-group dimensions, |
| between dispatches and query operations consistent with the dispatch |
| parameterization, and from one work-group to another within the dispatch |
| (excluding the trailing edge work-groups in the presence of non-uniform |
| work-group sizes). In addition, all sub-groups within a work-group will |
| be the same size, apart from the sub-group with the maximum index which |
| may be smaller if the size of the work-group is not evenly divisible by |
| the size of the sub-groups. |
| <br> |
| <br> |
| In the degenerate case, a single sub-group must be supported for each |
| work-group. In this situation all sub-group scope functions are |
| equivalent to their work-group level equivalents.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_execution_model_execution_of_kernel_instances">3.2.2. Execution Model: Execution of kernel-instances</h4> |
| <div class="paragraph"><p>The work carried out by an OpenCL program occurs through the execution |
| of kernel-instances on compute devices. To understand the details of |
| OpenCLs execution model, we need to consider how a kernel object moves |
| from the kernel-enqueue command, into a command-queue, executes on a |
| device, and completes. |
| <br> |
| <br> |
| A kernel-object is defined from a function within the program object and |
| a collection of arguments connecting the kernel to a set of argument |
| values. The host program enqueues a kernel-object to the command queue |
| along with the NDRange, and the work-group decomposition. These define |
| a <em>kernel-instance</em>. In addition, an optional set of events may be |
| defined when the kernel is enqueued. The events associated with a |
| particular kernel-instance are used to constrain when the |
| kernel-instance is launched with respect to other commands in the queue |
| or to commands in other queues within the same context. |
| <br> |
| <br> |
| A kernel-instance is submitted to a device. For an in-order command |
| queue, the kernel instances appear to launch and then execute in that |
| same order; where we use the term appear to emphasize that when there |
| are no dependencies between commands and hence differences in the order |
| that commands execute cannot be observed in a program, an implementation |
| can reorder commands even in an in-order command queue. For an out of |
| order command-queue, kernel-instances wait to be launched until:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Synchronization commands |
| enqueued prior to the kernel-instance are satisfied. |
| </p> |
| </li> |
| <li> |
| <p> |
| Each of the events in an |
| optional event list defined when the kernel-instance was enqueued are |
| set to CL_COMPLETE. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Once these conditions are met, the kernel-instance is launched and the |
| work-groups associated with the kernel-instance are placed into a pool |
| of ready to execute work-groups. This pool is called a <em>work-pool</em>. |
| The work-pool may be implemented in any manner as long as it assures |
| that work-groups placed in the pool will eventually execute. The |
| device schedules work-groups from the work-pool for execution on the |
| compute units of the device. The kernel-enqueue command is complete when |
| all work-groups associated with the kernel-instance end their execution, |
| updates to global memory associated with a command are visible globally, |
| and the device signals successful completion by setting the event |
| associated with the kernel-enqueue command to CL_COMPLETE. |
| <br> |
| <br> |
| While a command-queue is associated with only one device, a single |
| device may be associated with multiple command-queues all feeding into |
| the single work-pool. A device may also be associated with command |
| queues associated with different contexts within the same platform, |
| again all feeding into the single work-pool. The device will pull |
| work-groups from the work-pool and execute them on one or several |
| compute units in any order; possibly interleaving execution of |
| work-groups from multiple commands. A conforming implementation may |
| choose to serialize the work-groups so a correct algorithm cannot assume |
| that work-groups will execute in parallel. There is no safe and |
| portable way to synchronize across the independent execution of |
| work-groups since once in the work-pool, they can execute in any order. |
| <br> |
| <br> |
| The work-items within a single sub-group execute concurrently but not |
| necessarily in parallel (i.e. they are not guaranteed to make |
| independent forward progress). Therefore, only high-level |
| synchronization constructs (e.g. sub-group functions such as barriers) |
| that apply to all the work-items in a sub-group are well defined and |
| included in OpenCL. |
| <br> |
| <br> |
| Sub-groups execute concurrently within a given work-group and with |
| appropriate device support (<em>see Section__4.2</em>) may make independent |
| forward progress with respect to each other, with respect to host |
| threads and with respect to any entities external to the OpenCL system |
| but running on an OpenCL device, even in the absence of work-group |
| barrier operations. In this situation, sub-groups are able to internally |
| synchronize using barrier operations without synchronizing with each |
| other and may perform operations that rely on runtime dependencies on |
| operations other sub-groups perform. |
| <br> |
| <br> |
| The work-items within a single work-group execute concurrently but are |
| only guaranteed to make independent progress in the presence of |
| sub-groups and device support. In the absence of this capability, only |
| high-level synchronization constructs (e.g. work-group functions such as |
| barriers) that apply to all the work-items in a work-group are well |
| defined and included in OpenCL for synchronization within the |
| work-group. |
| <br> |
| <br> |
| In the absence of synchronization functions (e.g. a barrier), work-items |
| within a sub-group may be serialized. In the presence of sub -group |
| functions, work-items within a sub -group may be serialized before any |
| given sub -group function, between dynamically encountered pairs of sub |
| -group functions and between a work-group function and the end of the |
| kernel. |
| <br> |
| <br> |
| In the absence of independent forward progress of constituent |
| sub-groups, work-items within a work-group may be serialized before, |
| after or between work-group synchronization functions.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_execution_model_device_side_enqueue">3.2.3. Execution Model: Device-side enqueue</h4> |
| <div class="paragraph"><p>Algorithms may need to generate additional work as they execute. In |
| many cases, this additional work cannot be determined statically; so the |
| work associated with a kernel only emerges at runtime as the |
| kernel-instance executes. This capability could be implemented in logic |
| running within the host program, but involvement of the host may add |
| significant overhead and/or complexity to the application control |
| flow. A more efficient approach would be to nest kernel-enqueue |
| commands from inside other kernels. This <strong>nested parallelism</strong> can be |
| realized by supporting the enqueuing of kernels on a device without |
| direct involvement by the host program; so-called <strong>device-side |
| enqueue</strong>. |
| <br> |
| <br> |
| Device-side kernel-enqueue commands are similar to host-side |
| kernel-enqueue commands. The kernel executing on a device (the <strong>parent |
| kernel</strong>) enqueues a kernel-instance (the <strong>child kernel</strong>) to a |
| device-side command queue. This is an out-of-order command-queue and |
| follows the same behavior as the out-of-order command-queues exposed to |
| the host program. Commands enqueued to a device side command-queue |
| generate and use events to enforce order constraints just as for the |
| command-queue on the host. These events, however, are only visible to |
| the parent kernel running on the device. When these prerequisite |
| events take on the value CL_COMPLETE, the work-groups associated with |
| the child kernel are launched into the devices work pool. The device |
| then schedules them for execution on the compute units of the device. |
| Child and parent kernels execute asynchronously. However, a parent will |
| not indicate that it is complete by setting its event to CL_COMPLETE |
| until all child kernels have ended execution and have signaled |
| completion by setting any associated events to the value CL_COMPLETE. |
| Should any child kernel complete with an event status set to a negative |
| value (i.e. abnormally terminate), the parent kernel will abnormally |
| terminate and propagate the childs negative event value as the value of |
| the parents event. If there are multiple children that have an event |
| status set to a negative value, the selection of which childs negative |
| event value is propagated is implementation-defined.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_execution_model_synchronization">3.2.4. Execution Model: Synchronization</h4> |
| <div class="paragraph"><p>Synchronization refers to mechanisms that constrain the order of |
| execution between two or more units of execution. Consider the |
| following three domains of synchronization in OpenCL:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Work-group |
| synchronization: Constraints on the order of execution for work-items in |
| a single work-group |
| </p> |
| </li> |
| <li> |
| <p> |
| Sub-group synchronization: |
| Contraints on the order of execution for work-items in a single |
| sub-group |
| </p> |
| </li> |
| <li> |
| <p> |
| Command synchronization: |
| Constraints on the order of commands launched for execution |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Synchronization across all work-items within a single work-group is |
| carried out using a <em>work-group function</em>. These functions carry out |
| collective operations across all the work-items in a work-group. |
| Available collective operations are: barrier, reduction, broadcast, |
| prefix sum, and evaluation of a predicate. A work-group function must |
| occur within a converged control flow; i.e. all work-items in the |
| work-group must encounter precisely the same work-group function. For |
| example, if a work-group function occurs within a loop, the work-items |
| must encounter the same work-group function in the same loop |
| iterations. All the work-items of a work-group must execute the |
| work-group function and complete reads and writes to memory before any |
| are allowed to continue execution beyond the work-group function. |
| Work-group functions that apply between work-groups are not provided in |
| OpenCL since OpenCL does not define forward-progress or ordering |
| relations between work-groups, hence collective synchronization |
| operations are not well defined. |
| <br> |
| <br> |
| Synchronization across all work-items within a single sub-group is |
| carried out using a <em>sub-group function</em>. These functions carry out |
| collective operations across all the work-items in a sub-group. |
| Available collective operations are: barrier, reduction, broadcast, |
| prefix sum, and evaluation of a predicate. A sub-group function must |
| occur within a converged control flow; i.e. all work-items in the |
| sub-group must encounter precisely the same sub-group function. For |
| example, if a work-group function occurs within a loop, the work-items |
| must encounter the same sub-group function in the same loop iterations. |
| All the work-items of a sub-group must execute the sub-group function |
| and complete reads and writes to memory before any are allowed to |
| continue execution beyond the sub-group function. Synchronization |
| between sub-groups must either be performed using work-group functions, |
| or through memory operations. Using memory operations for sub-group |
| synchronization should be used carefully as forward progress of |
| sub-groups relative to each other is only supported optionally by OpenCL |
| implementations. |
| <br> |
| <br> |
| Command synchronization is defined in terms of distinct <strong>synchronization |
| points</strong>. The synchronization points occur between commands in host |
| command-queues and between commands in device-side command-queues. The |
| synchronization points defined in OpenCL include:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Launching a command:</strong> A |
| kernel-instance is launched onto a device after all events that kernel |
| is waiting-on have been set to CL_COMPLETE. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Ending a command:</strong> Child |
| kernels may be enqueued such that they wait for the parent kernel to |
| reach the <em>end</em> state before they can be launched. In this case, the |
| ending of the parent command defines a synchronization point. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Completion of a command:</strong> |
| A kernel-instance is complete after all of the work-groups in the kernel |
| and all of its child kernels have completed. This is signaled to the |
| host, a parent kernel or other kernels within command queues by setting |
| the value of the event associated with a kernel to CL_COMPLETE. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Blocking Commands:</strong> A |
| blocking command defines a synchronization point between the unit of |
| execution that calls the blocking API function and the enqueued command |
| reaching the complete state. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Command-queue barrier:</strong> |
| The command-queue barrier ensures that all previously enqueued commands |
| have completed before subsequently enqueued commands can be launched. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>clFinish:</strong> This function |
| blocks until all previously enqueued commands in the command queue have |
| completed after which clFinish defines a synchronization point and the |
| clFinish function returns. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>A synchronization point between a pair of commands (A and B) assures |
| that results of command A happens-before command B is launched. This |
| requires that any updates to memory from command A complete and are made |
| available to other commands before the synchronization point completes. |
| Likewise, this requires that command B waits until after the |
| synchronization point before loading values from global memory. The |
| concept of a synchronization point works in a similar fashion for |
| commands such as a barrier that apply to two sets of commands. All the |
| commands prior to the barrier must complete and make their results |
| available to following commands. Furthermore, any commands following |
| the barrier must wait for the commands prior to the barrier before |
| loading values and continuing their execution. |
| <br> |
| <br> |
| These <em>happens-before</em> relationships are a fundamental part of the |
| OpenCL memory model. When applied at the level of commands, they are |
| straightforward to define at a language level in terms of ordering |
| relationships between different commands. Ordering memory operations |
| inside different commands, however, requires rules more complex than can |
| be captured by the high level concept of a synchronization point. |
| These rules are described in detail in section 3.3.6.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_execution_model_categories_of_kernels">3.2.5. Execution Model: Categories of Kernels</h4> |
| <div class="paragraph"><p>The OpenCL execution model supports three types of kernels:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>OpenCL kernels</strong> are |
| managed by the OpenCL API as kernel-objects associated with kernel |
| functions within program-objects. OpenCL kernels are provided via a |
| kernel language. |
| All OpenCL implementations must support OpenCL kernels supplied in the |
| standard SPIR-V intermediate language with the appropriate environment |
| specification, and the OpenCL C programming language defined in earlier |
| versions of the OpenCL specification. Implementations must also support |
| OpenCL kernels in |
| SPIR-V intermediate language. SPIR-V binaries nay be |
| generated from an |
| OpenCL kernel language or by a third party compiler from an |
| alternative input. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Native kernels</strong> are |
| accessed through a host function pointer. Native kernels are queued for |
| execution along with OpenCL kernels on a device and share memory objects |
| with OpenCL kernels. For example, these native kernels could be |
| functions defined in application code or exported from a library. The |
| ability to execute native kernels is optional within OpenCL and the |
| semantics of native kernels are implementation-defined. The OpenCL API |
| includes functions to query capabilities of a device(s) and determine if |
| this capability is supported. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Built-in kernels</strong> are tied |
| to particular device and are not built at runtime from source code in a |
| program object. The common use of built in kernels is to expose |
| fixed-function hardware or firmware associated with a particular OpenCL |
| device or custom device. The semantics of a built-in kernel may be |
| defined outside of OpenCL and hence are implementation defined. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>All three types of kernels are manipulated through the OpenCL command |
| queues and must conform to the synchronization points defined in the |
| OpenCL execution model.</p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_memory_model">3.3. Memory Model</h3> |
| <div class="paragraph"><p>The OpenCL memory model describes the structure, contents, and behavior |
| of the memory exposed by an OpenCL platform as an OpenCL program runs. |
| The model allows a programmer to reason about values in memory as the |
| host program and multiple kernel-instances execute. |
| <br> |
| <br> |
| An OpenCL program defines a context that includes a host, one or more |
| devices, command-queues, and memory exposed within the context. |
| Consider the units of execution involved with such a program. The host |
| program runs as one or more host threads managed by the operating system |
| running on the host (the details of which are defined outside of |
| OpenCL). There may be multiple devices in a single context which all |
| have access to memory objects defined by OpenCL. On a single device, |
| multiple work-groups may execute in parallel with potentially |
| overlapping updates to memory. Finally, within a single work-group, |
| multiple work-items concurrently execute, once again with potentially |
| overlapping updates to memory. |
| <br> |
| <br> |
| The memory model must precisely define how the values in memory as seen |
| from each of these units of execution interact so a programmer can |
| reason about the correctness of OpenCL programs. We define the memory |
| model in four parts.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Memory regions: The |
| distinct memories visible to the host and the devices that share a |
| context. |
| </p> |
| </li> |
| <li> |
| <p> |
| Memory objects: The |
| objects defined by the OpenCL API and their management by the host and |
| devices. |
| </p> |
| </li> |
| <li> |
| <p> |
| Shared Virtual Memory: A |
| virtual address space exposed to both the host and the devices within a |
| context. |
| </p> |
| </li> |
| <li> |
| <p> |
| Consistency Model: Rules |
| that define which values are observed when multiple units of execution |
| load data from memory plus the atomic/fence operations that constrain |
| the order of memory operations and define synchronization relationships. |
| </p> |
| </li> |
| </ul></div> |
| <div class="sect3"> |
| <h4 id="_memory_model_fundamental_memory_regions">3.3.1. Memory Model: Fundamental Memory Regions</h4> |
| <div class="paragraph"><p>Memory in OpenCL is divided into two parts.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Host Memory:</strong> The memory |
| directly available to the host. The detailed behavior of host memory is |
| defined outside of OpenCL. Memory objects move between the Host and the |
| devices through functions within the OpenCL API or through a shared |
| virtual memory interface. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Device Memory:</strong> Memory |
| directly available to kernels executing on OpenCL devices. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Device memory consists of four named address spaces or <em>memory regions</em>:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Global Memory:</strong> This |
| memory region permits read/write access to all work-items in all |
| work-groups running on any device within a context. Work-items can read |
| from or write to any element of a memory object. Reads and writes to |
| global memory may be cached depending on the capabilities of the device. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Constant Memory</strong>: A |
| region of global memory that remains constant during the execution of a |
| kernel-instance. The host allocates and initializes memory objects |
| placed into constant memory. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Local Memory</strong>: A memory |
| region local to a work-group. This memory region can be used to allocate |
| variables that are shared by all work-items in that work-group. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Private Memory</strong>: A region |
| of memory private to a work-item. Variables defined in one work-items |
| private memory are not visible to another work-item. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p> </p></div> |
| <div class="paragraph"><p>The memory regions and their relationship to the OpenCL Platform model |
| are summarized in figure 3-4. Local and private memories are always |
| associated with a particular device. The global and constant memories, |
| however, are shared between all devices within a given context. An |
| OpenCL device may include a cache to support efficient access to these |
| shared memories |
| <br> |
| <br> |
| To understand memory in OpenCL, it is important to appreciate the |
| relationships between these named address spaces. The four named |
| address spaces available to a device are disjoint meaning they do not |
| overlap. This is a logical relationship, however, and an |
| implementation may choose to let these disjoint named address spaces |
| share physical memory. |
| <br> |
| <br> |
| Programmers often need functions callable from kernels where the |
| pointers manipulated by those functions can point to multiple named |
| address spaces. This saves a programmer from the error-prone and |
| wasteful practice of creating multiple copies of functions; one for each |
| named address space. Therefore the global, local and private address |
| spaces belong to a single <em>generic address space</em>. This is closely |
| modeled after the concept of a generic address space used in the |
| embedded C standard (ISO/IEC 9899:1999). Since they all belong to a |
| single generic address space, the following properties are supported for |
| pointers to named address spaces in device memory:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| A pointer to the generic |
| address space can be cast to a pointer to a global, local or private |
| address space |
| </p> |
| </li> |
| <li> |
| <p> |
| A pointer to a global, |
| local or private address space can be cast to a pointer to the generic |
| address space. |
| </p> |
| </li> |
| <li> |
| <p> |
| A pointer to a global, |
| local or private address space can be implicitly converted to a pointer |
| to the generic address space, but the converse is not allowed. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p> </p></div> |
| <div class="paragraph"><p>The constant address space is disjoint from the generic address space. |
| <br> |
| <br> |
| The addresses of memory associated with memory objects in Global memory |
| are not preserved between kernel instances, between a device and the |
| host, and between devices. In this regard global memory acts as a global |
| pool of memory objects rather than an address space. This restriction is |
| relaxed when shared virtual memory (SVM) is used. |
| <br> |
| <br> |
| SVM causes addresses to be meaningful between the host and all of the |
| devices within a context hence supporting the use of pointer based data |
| structures in OpenCL kernels. It logically extends a portion of the |
| global memory into the host address space giving work-items access to |
| the host address space. On platforms with hardware support for a shared |
| address space between the host and one or more devices, SVM may also |
| provide a more efficient way to share data between devices and the host. |
| Details about SVM are presented in section 3.3.3.</p></div> |
| <div class="paragraph"><p><span class="image"> |
| <img src="opencl22-API_files/image008.jpg" alt="image"> |
| </span></p></div> |
| <div class="paragraph"><p><strong>Figure 3-4: The named address spaces exposed in an OpenCL Platform. |
| Global and Constant memories are shared between the one or more devices |
| within a context, while local and private memories are associated with a |
| single device. Each device may include an optional cache to support |
| efficient access to their view of the global and constant address |
| spaces.</strong></p></div> |
| <div class="paragraph"><p>A programmer may use the features of the memory consistency model |
| (section 3.3.4) to manage safe access to global memory from multiple |
| work-items potentially running on one or more devices. In addition, when |
| using shared virtual memory (SVM), the memory consistency model may also |
| be used to ensure that host threads safely access memory locations in |
| the shared memory region.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_memory_model_memory_objects">3.3.2. Memory Model: Memory Objects</h4> |
| <div class="paragraph"><p>The contents of global memory are <em>memory objects</em>. A memory object is |
| a handle to a reference counted region of global memory. Memory objects |
| use the OpenCL type <em>cl_mem</em> and fall into three distinct classes.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Buffer</strong>: A memory object |
| stored as a block of contiguous memory and used as a general purpose |
| object to hold data used in an OpenCL program. The types of the values |
| within a buffer may be any of the built in types (such as int, float), |
| vector types, or user-defined structures. The buffer can be |
| manipulated through pointers much as one would with any block of memory |
| in C. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Image</strong>: An image memory |
| object holds one, two or three dimensional images. The formats are |
| based on the standard image formats used in graphics applications. An |
| image is an opaque data structure managed by functions defined in the |
| OpenCL API. To optimize the manipulation of images stored in the |
| texture memories found in many GPUs, OpenCL kernels have traditionally |
| been disallowed from both reading and writing a single image. In OpenCL |
| 2.0, however, we have relaxed this restriction by providing |
| synchronization and fence operations that let programmers properly |
| synchronize their code to safely allow a kernel to read and write a |
| single image. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Pipe</strong>: The <em>pipe</em> memory |
| object conceptually is an ordered sequence of data items. A pipe has |
| two endpoints: a write endpoint into which data items are inserted, and |
| a read endpoint from which data items are removed. At any one time, |
| only one kernel instance may write into a pipe, and only one kernel |
| instance may read from a pipe. To support the producer consumer design |
| pattern, one kernel instance connects to the write endpoint (the |
| producer) while another kernel instance connects to the reading endpoint |
| (the consumer). |
| </p> |
| </li> |
|