| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <meta name="generator" content="AsciiDoc 8.6.9"> |
| <title>The OpenCL Specification</title> |
| <style type="text/css"> |
| /* Shared CSS for AsciiDoc xhtml11 and html5 backends */ |
| |
| /* Default font. */ |
| body { |
| font-family: Georgia,serif; |
| } |
| |
| /* Title font. */ |
| h1, h2, h3, h4, h5, h6, |
| div.title, caption.title, |
| thead, p.table.header, |
| #toctitle, |
| #author, #revnumber, #revdate, #revremark, |
| #footer { |
| font-family: Arial,Helvetica,sans-serif; |
| } |
| |
| body { |
| margin: 1em 5% 1em 5%; |
| } |
| |
| a { |
| color: blue; |
| text-decoration: underline; |
| } |
| a:visited { |
| color: fuchsia; |
| } |
| |
| em { |
| font-style: italic; |
| color: navy; |
| } |
| |
| strong { |
| font-weight: bold; |
| color: #083194; |
| } |
| |
| h1, h2, h3, h4, h5, h6 { |
| color: #527bbd; |
| margin-top: 1.2em; |
| margin-bottom: 0.5em; |
| line-height: 1.3; |
| } |
| |
| h1, h2, h3 { |
| border-bottom: 2px solid silver; |
| } |
| h2 { |
| padding-top: 0.5em; |
| } |
| h3 { |
| float: left; |
| } |
| h3 + * { |
| clear: left; |
| } |
| h5 { |
| font-size: 1.0em; |
| } |
| |
| div.sectionbody { |
| margin-left: 0; |
| } |
| |
| hr { |
| border: 1px solid silver; |
| } |
| |
| p { |
| margin-top: 0.5em; |
| margin-bottom: 0.5em; |
| } |
| |
| ul, ol, li > p { |
| margin-top: 0; |
| } |
| ul > li { color: #aaa; } |
| ul > li > * { color: black; } |
| |
| .monospaced, code, pre { |
| font-family: "Courier New", Courier, monospace; |
| font-size: inherit; |
| color: navy; |
| padding: 0; |
| margin: 0; |
| } |
| pre { |
| white-space: pre-wrap; |
| } |
| |
| #author { |
| color: #527bbd; |
| font-weight: bold; |
| font-size: 1.1em; |
| } |
| #email { |
| } |
| #revnumber, #revdate, #revremark { |
| } |
| |
| #footer { |
| font-size: small; |
| border-top: 2px solid silver; |
| padding-top: 0.5em; |
| margin-top: 4.0em; |
| } |
| #footer-text { |
| float: left; |
| padding-bottom: 0.5em; |
| } |
| #footer-badges { |
| float: right; |
| padding-bottom: 0.5em; |
| } |
| |
| #preamble { |
| margin-top: 1.5em; |
| margin-bottom: 1.5em; |
| } |
| div.imageblock, div.exampleblock, div.verseblock, |
| div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock, |
| div.admonitionblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| div.admonitionblock { |
| margin-top: 2.0em; |
| margin-bottom: 2.0em; |
| margin-right: 10%; |
| color: #606060; |
| } |
| |
| div.content { /* Block element content. */ |
| padding: 0; |
| } |
| |
| /* Block element titles. */ |
| div.title, caption.title { |
| color: #527bbd; |
| font-weight: bold; |
| text-align: left; |
| margin-top: 1.0em; |
| margin-bottom: 0.5em; |
| } |
| div.title + * { |
| margin-top: 0; |
| } |
| |
| td div.title:first-child { |
| margin-top: 0.0em; |
| } |
| div.content div.title:first-child { |
| margin-top: 0.0em; |
| } |
| div.content + div.title { |
| margin-top: 0.0em; |
| } |
| |
| div.sidebarblock > div.content { |
| background: #ffffee; |
| border: 1px solid #dddddd; |
| border-left: 4px solid #f0f0f0; |
| padding: 0.5em; |
| } |
| |
| div.listingblock > div.content { |
| border: 1px solid #dddddd; |
| border-left: 5px solid #f0f0f0; |
| background: #f8f8f8; |
| padding: 0.5em; |
| } |
| |
| div.quoteblock, div.verseblock { |
| padding-left: 1.0em; |
| margin-left: 1.0em; |
| margin-right: 10%; |
| border-left: 5px solid #f0f0f0; |
| color: #888; |
| } |
| |
| div.quoteblock > div.attribution { |
| padding-top: 0.5em; |
| text-align: right; |
| } |
| |
| div.verseblock > pre.content { |
| font-family: inherit; |
| font-size: inherit; |
| } |
| div.verseblock > div.attribution { |
| padding-top: 0.75em; |
| text-align: left; |
| } |
| /* DEPRECATED: Pre version 8.2.7 verse style literal block. */ |
| div.verseblock + div.attribution { |
| text-align: left; |
| } |
| |
| div.admonitionblock .icon { |
| vertical-align: top; |
| font-size: 1.1em; |
| font-weight: bold; |
| text-decoration: underline; |
| color: #527bbd; |
| padding-right: 0.5em; |
| } |
| div.admonitionblock td.content { |
| padding-left: 0.5em; |
| border-left: 3px solid #dddddd; |
| } |
| |
| div.exampleblock > div.content { |
| border-left: 3px solid #dddddd; |
| padding-left: 0.5em; |
| } |
| |
| div.imageblock div.content { padding-left: 0; } |
| span.image img { border-style: none; vertical-align: text-bottom; } |
| a.image:visited { color: white; } |
| |
| dl { |
| margin-top: 0.8em; |
| margin-bottom: 0.8em; |
| } |
| dt { |
| margin-top: 0.5em; |
| margin-bottom: 0; |
| font-style: normal; |
| color: navy; |
| } |
| dd > *:first-child { |
| margin-top: 0.1em; |
| } |
| |
| ul, ol { |
| list-style-position: outside; |
| } |
| ol.arabic { |
| list-style-type: decimal; |
| } |
| ol.loweralpha { |
| list-style-type: lower-alpha; |
| } |
| ol.upperalpha { |
| list-style-type: upper-alpha; |
| } |
| ol.lowerroman { |
| list-style-type: lower-roman; |
| } |
| ol.upperroman { |
| list-style-type: upper-roman; |
| } |
| |
| div.compact ul, div.compact ol, |
| div.compact p, div.compact p, |
| div.compact div, div.compact div { |
| margin-top: 0.1em; |
| margin-bottom: 0.1em; |
| } |
| |
| tfoot { |
| font-weight: bold; |
| } |
| td > div.verse { |
| white-space: pre; |
| } |
| |
| div.hdlist { |
| margin-top: 0.8em; |
| margin-bottom: 0.8em; |
| } |
| div.hdlist tr { |
| padding-bottom: 15px; |
| } |
| dt.hdlist1.strong, td.hdlist1.strong { |
| font-weight: bold; |
| } |
| td.hdlist1 { |
| vertical-align: top; |
| font-style: normal; |
| padding-right: 0.8em; |
| color: navy; |
| } |
| td.hdlist2 { |
| vertical-align: top; |
| } |
| div.hdlist.compact tr { |
| margin: 0; |
| padding-bottom: 0; |
| } |
| |
| .comment { |
| background: yellow; |
| } |
| |
| .footnote, .footnoteref { |
| font-size: 0.8em; |
| } |
| |
| span.footnote, span.footnoteref { |
| vertical-align: super; |
| } |
| |
| #footnotes { |
| margin: 20px 0 20px 0; |
| padding: 7px 0 0 0; |
| } |
| |
| #footnotes div.footnote { |
| margin: 0 0 5px 0; |
| } |
| |
| #footnotes hr { |
| border: none; |
| border-top: 1px solid silver; |
| height: 1px; |
| text-align: left; |
| margin-left: 0; |
| width: 20%; |
| min-width: 100px; |
| } |
| |
| div.colist td { |
| padding-right: 0.5em; |
| padding-bottom: 0.3em; |
| vertical-align: top; |
| } |
| div.colist td img { |
| margin-top: 0.3em; |
| } |
| |
| @media print { |
| #footer-badges { display: none; } |
| } |
| |
| #toc { |
| margin-bottom: 2.5em; |
| } |
| |
| #toctitle { |
| color: #527bbd; |
| font-size: 1.1em; |
| font-weight: bold; |
| margin-top: 1.0em; |
| margin-bottom: 0.1em; |
| } |
| |
| div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 { |
| margin-top: 0; |
| margin-bottom: 0; |
| } |
| div.toclevel2 { |
| margin-left: 2em; |
| font-size: 0.9em; |
| } |
| div.toclevel3 { |
| margin-left: 4em; |
| font-size: 0.9em; |
| } |
| div.toclevel4 { |
| margin-left: 6em; |
| font-size: 0.9em; |
| } |
| |
| span.aqua { color: aqua; } |
| span.black { color: black; } |
| span.blue { color: blue; } |
| span.fuchsia { color: fuchsia; } |
| span.gray { color: gray; } |
| span.green { color: green; } |
| span.lime { color: lime; } |
| span.maroon { color: maroon; } |
| span.navy { color: navy; } |
| span.olive { color: olive; } |
| span.purple { color: purple; } |
| span.red { color: red; } |
| span.silver { color: silver; } |
| span.teal { color: teal; } |
| span.white { color: white; } |
| span.yellow { color: yellow; } |
| |
| span.aqua-background { background: aqua; } |
| span.black-background { background: black; } |
| span.blue-background { background: blue; } |
| span.fuchsia-background { background: fuchsia; } |
| span.gray-background { background: gray; } |
| span.green-background { background: green; } |
| span.lime-background { background: lime; } |
| span.maroon-background { background: maroon; } |
| span.navy-background { background: navy; } |
| span.olive-background { background: olive; } |
| span.purple-background { background: purple; } |
| span.red-background { background: red; } |
| span.silver-background { background: silver; } |
| span.teal-background { background: teal; } |
| span.white-background { background: white; } |
| span.yellow-background { background: yellow; } |
| |
| span.big { font-size: 2em; } |
| span.small { font-size: 0.6em; } |
| |
| span.underline { text-decoration: underline; } |
| span.overline { text-decoration: overline; } |
| span.line-through { text-decoration: line-through; } |
| |
| div.unbreakable { page-break-inside: avoid; } |
| |
| |
| /* |
| * xhtml11 specific |
| * |
| * */ |
| |
| div.tableblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| div.tableblock > table { |
| border: 3px solid #527bbd; |
| } |
| thead, p.table.header { |
| font-weight: bold; |
| color: #527bbd; |
| } |
| p.table { |
| margin-top: 0; |
| } |
| /* Because the table frame attribute is overriden by CSS in most browsers. */ |
| div.tableblock > table[frame="void"] { |
| border-style: none; |
| } |
| div.tableblock > table[frame="hsides"] { |
| border-left-style: none; |
| border-right-style: none; |
| } |
| div.tableblock > table[frame="vsides"] { |
| border-top-style: none; |
| border-bottom-style: none; |
| } |
| |
| |
| /* |
| * html5 specific |
| * |
| * */ |
| |
| table.tableblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| thead, p.tableblock.header { |
| font-weight: bold; |
| color: #527bbd; |
| } |
| p.tableblock { |
| margin-top: 0; |
| } |
| table.tableblock { |
| border-width: 3px; |
| border-spacing: 0px; |
| border-style: solid; |
| border-color: #527bbd; |
| border-collapse: collapse; |
| } |
| th.tableblock, td.tableblock { |
| border-width: 1px; |
| padding: 4px; |
| border-style: solid; |
| border-color: #527bbd; |
| } |
| |
| table.tableblock.frame-topbot { |
| border-left-style: hidden; |
| border-right-style: hidden; |
| } |
| table.tableblock.frame-sides { |
| border-top-style: hidden; |
| border-bottom-style: hidden; |
| } |
| table.tableblock.frame-none { |
| border-style: hidden; |
| } |
| |
| th.tableblock.halign-left, td.tableblock.halign-left { |
| text-align: left; |
| } |
| th.tableblock.halign-center, td.tableblock.halign-center { |
| text-align: center; |
| } |
| th.tableblock.halign-right, td.tableblock.halign-right { |
| text-align: right; |
| } |
| |
| th.tableblock.valign-top, td.tableblock.valign-top { |
| vertical-align: top; |
| } |
| th.tableblock.valign-middle, td.tableblock.valign-middle { |
| vertical-align: middle; |
| } |
| th.tableblock.valign-bottom, td.tableblock.valign-bottom { |
| vertical-align: bottom; |
| } |
| |
| |
| /* |
| * manpage specific |
| * |
| * */ |
| |
| body.manpage h1 { |
| padding-top: 0.5em; |
| padding-bottom: 0.5em; |
| border-top: 2px solid silver; |
| border-bottom: 2px solid silver; |
| } |
| body.manpage h2 { |
| border-style: none; |
| } |
| body.manpage div.sectionbody { |
| margin-left: 3em; |
| } |
| |
| @media print { |
| body.manpage div#toc { display: none; } |
| } |
| |
| |
| @media screen { |
| body { |
| max-width: 50em; /* approximately 80 characters wide */ |
| margin-left: 16em; |
| } |
| |
| #toc { |
| position: fixed; |
| top: 0; |
| left: 0; |
| bottom: 0; |
| width: 13em; |
| padding: 0.5em; |
| padding-bottom: 1.5em; |
| margin: 0; |
| overflow: auto; |
| border-right: 3px solid #f8f8f8; |
| background-color: white; |
| } |
| |
| #toc .toclevel1 { |
| margin-top: 0.5em; |
| } |
| |
| #toc .toclevel2 { |
| margin-top: 0.25em; |
| display: list-item; |
| color: #aaaaaa; |
| } |
| |
| #toctitle { |
| margin-top: 0.5em; |
| } |
| } |
| </style> |
| <script type="text/javascript"> |
| /*<![CDATA[*/ |
| var asciidoc = { // Namespace. |
| |
| ///////////////////////////////////////////////////////////////////// |
| // Table Of Contents generator |
| ///////////////////////////////////////////////////////////////////// |
| |
| /* Author: Mihai Bazon, September 2002 |
| * http://students.infoiasi.ro/~mishoo |
| * |
| * Table Of Content generator |
| * Version: 0.4 |
| * |
| * Feel free to use this script under the terms of the GNU General Public |
| * License, as long as you do not remove or alter this notice. |
| */ |
| |
| /* modified by Troy D. Hanson, September 2006. License: GPL */ |
| /* modified by Stuart Rackham, 2006, 2009. License: GPL */ |
| |
| // toclevels = 1..4. |
| toc: function (toclevels) { |
| |
| function getText(el) { |
| var text = ""; |
| for (var i = el.firstChild; i != null; i = i.nextSibling) { |
| if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants. |
| text += i.data; |
| else if (i.firstChild != null) |
| text += getText(i); |
| } |
| return text; |
| } |
| |
| function TocEntry(el, text, toclevel) { |
| this.element = el; |
| this.text = text; |
| this.toclevel = toclevel; |
| } |
| |
| function tocEntries(el, toclevels) { |
| var result = new Array; |
| var re = new RegExp('[hH]([1-'+(toclevels+1)+'])'); |
| // Function that scans the DOM tree for header elements (the DOM2 |
| // nodeIterator API would be a better technique but not supported by all |
| // browsers). |
| var iterate = function (el) { |
| for (var i = el.firstChild; i != null; i = i.nextSibling) { |
| if (i.nodeType == 1 /* Node.ELEMENT_NODE */) { |
| var mo = re.exec(i.tagName); |
| if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") { |
| result[result.length] = new TocEntry(i, getText(i), mo[1]-1); |
| } |
| iterate(i); |
| } |
| } |
| } |
| iterate(el); |
| return result; |
| } |
| |
| var toc = document.getElementById("toc"); |
| if (!toc) { |
| return; |
| } |
| |
| // Delete existing TOC entries in case we're reloading the TOC. |
| var tocEntriesToRemove = []; |
| var i; |
| for (i = 0; i < toc.childNodes.length; i++) { |
| var entry = toc.childNodes[i]; |
| if (entry.nodeName.toLowerCase() == 'div' |
| && entry.getAttribute("class") |
| && entry.getAttribute("class").match(/^toclevel/)) |
| tocEntriesToRemove.push(entry); |
| } |
| for (i = 0; i < tocEntriesToRemove.length; i++) { |
| toc.removeChild(tocEntriesToRemove[i]); |
| } |
| |
| // Rebuild TOC entries. |
| var entries = tocEntries(document.getElementById("content"), toclevels); |
| for (var i = 0; i < entries.length; ++i) { |
| var entry = entries[i]; |
| if (entry.element.id == "") |
| entry.element.id = "_toc_" + i; |
| var a = document.createElement("a"); |
| a.href = "#" + entry.element.id; |
| a.appendChild(document.createTextNode(entry.text)); |
| var div = document.createElement("div"); |
| div.appendChild(a); |
| div.className = "toclevel" + entry.toclevel; |
| toc.appendChild(div); |
| } |
| if (entries.length == 0) |
| toc.parentNode.removeChild(toc); |
| }, |
| |
| |
| ///////////////////////////////////////////////////////////////////// |
| // Footnotes generator |
| ///////////////////////////////////////////////////////////////////// |
| |
| /* Based on footnote generation code from: |
| * http://www.brandspankingnew.net/archive/2005/07/format_footnote.html |
| */ |
| |
| footnotes: function () { |
| // Delete existing footnote entries in case we're reloading the footnodes. |
| var i; |
| var noteholder = document.getElementById("footnotes"); |
| if (!noteholder) { |
| return; |
| } |
| var entriesToRemove = []; |
| for (i = 0; i < noteholder.childNodes.length; i++) { |
| var entry = noteholder.childNodes[i]; |
| if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote") |
| entriesToRemove.push(entry); |
| } |
| for (i = 0; i < entriesToRemove.length; i++) { |
| noteholder.removeChild(entriesToRemove[i]); |
| } |
| |
| // Rebuild footnote entries. |
| var cont = document.getElementById("content"); |
| var spans = cont.getElementsByTagName("span"); |
| var refs = {}; |
| var n = 0; |
| for (i=0; i<spans.length; i++) { |
| if (spans[i].className == "footnote") { |
| n++; |
| var note = spans[i].getAttribute("data-note"); |
| if (!note) { |
| // Use [\s\S] in place of . so multi-line matches work. |
| // Because JavaScript has no s (dotall) regex flag. |
| note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1]; |
| spans[i].innerHTML = |
| "[<a id='_footnoteref_" + n + "' href='#_footnote_" + n + |
| "' title='View footnote' class='footnote'>" + n + "</a>]"; |
| spans[i].setAttribute("data-note", note); |
| } |
| noteholder.innerHTML += |
| "<div class='footnote' id='_footnote_" + n + "'>" + |
| "<a href='#_footnoteref_" + n + "' title='Return to text'>" + |
| n + "</a>. " + note + "</div>"; |
| var id =spans[i].getAttribute("id"); |
| if (id != null) refs["#"+id] = n; |
| } |
| } |
| if (n == 0) |
| noteholder.parentNode.removeChild(noteholder); |
| else { |
| // Process footnoterefs. |
| for (i=0; i<spans.length; i++) { |
| if (spans[i].className == "footnoteref") { |
| var href = spans[i].getElementsByTagName("a")[0].getAttribute("href"); |
| href = href.match(/#.*/)[0]; // Because IE return full URL. |
| n = refs[href]; |
| spans[i].innerHTML = |
| "[<a href='#_footnote_" + n + |
| "' title='View footnote' class='footnote'>" + n + "</a>]"; |
| } |
| } |
| } |
| }, |
| |
| install: function(toclevels) { |
| var timerId; |
| |
| function reinstall() { |
| asciidoc.footnotes(); |
| if (toclevels) { |
| asciidoc.toc(toclevels); |
| } |
| } |
| |
| function reinstallAndRemoveTimer() { |
| clearInterval(timerId); |
| reinstall(); |
| } |
| |
| timerId = setInterval(reinstall, 500); |
| if (document.addEventListener) |
| document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false); |
| else |
| window.onload = reinstallAndRemoveTimer; |
| } |
| |
| } |
| asciidoc.install(3); |
| /*]]>*/ |
| </script> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| MathML: { extensions: ["content-mathml.js"] }, |
| tex2jax: { inlineMath: [['$','$'], ['\\(','\\)']] } |
| }); |
| </script> |
| <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> |
| </script> |
| </head> |
| <body class="book"> |
| <div id="header"> |
| <h1>The OpenCL Specification</h1> |
| <span id="author">Khronos OpenCL Working Group</span><br> |
| <span id="revnumber">version v2.2-3</span> |
| <div id="toc"> |
| <div id="toctitle">Table of Contents</div> |
| <noscript><p><b>JavaScript must be enabled in your browser to display the table of contents.</b></p></noscript> |
| </div> |
| </div> |
| <div id="content"> |
| <div id="preamble"> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Copyright 2008-2017 The Khronos Group.</p></div> |
| <div class="paragraph"><p>This specification is protected by copyright laws and contains material proprietary |
| to the Khronos Group, Inc. Except as described by these terms, it or any components |
| may not be reproduced, republished, distributed, transmitted, displayed, broadcast |
| or otherwise exploited in any manner without the express prior written permission |
| of Khronos Group.</p></div> |
| <div class="paragraph"><p>Khronos Group grants a conditional copyright license to use and reproduce the |
| unmodified specification for any purpose, without fee or royalty, EXCEPT no licenses |
| to any patent, trademark or other intellectual property rights are granted under |
| these terms. Parties desiring to implement the specification and make use of |
| Khronos trademarks in relation to that implementation, and receive reciprocal patent |
| license protection under the Khronos IP Policy must become Adopters and confirm the |
| implementation as conformant under the process defined by Khronos for this |
| specification; see <a href="https://www.khronos.org/adopters">https://www.khronos.org/adopters</a>.</p></div> |
| <div class="paragraph"><p>Khronos Group makes no, and expressly disclaims any, representations or warranties, |
| express or implied, regarding this specification, including, without limitation: |
| merchantability, fitness for a particular purpose, non-infringement of any |
| intellectual property, correctness, accuracy, completeness, timeliness, and |
| reliability. Under no circumstances will the Khronos Group, or any of its Promoters, |
| Contributors or Members, or their respective partners, officers, directors, |
| employees, agents or representatives be liable for any damages, whether direct, |
| indirect, special or consequential damages for lost revenues, lost profits, or |
| otherwise, arising from or in connection with these materials.</p></div> |
| <div class="paragraph"><p>Vulkan is a registered trademark and Khronos, OpenXR, SPIR, SPIR-V, SYCL, WebGL, |
| WebCL, OpenVX, OpenVG, EGL, COLLADA, glTF, NNEF, OpenKODE, OpenKCAM, StreamInput, |
| OpenWF, OpenSL ES, OpenMAX, OpenMAX AL, OpenMAX IL, OpenMAX DL, OpenML and DevU are |
| trademarks of the Khronos Group Inc. ASTC is a trademark of ARM Holdings PLC, |
| OpenCL is a trademark of Apple Inc. and OpenGL and OpenML are registered trademarks |
| and the OpenGL ES and OpenGL SC logos are trademarks of Silicon Graphics |
| International used under license by Khronos. All other product names, trademarks, |
| and/or company names are used solely for identification and belong to their |
| respective owners.</p></div> |
| <div style="page-break-after:always"></div> |
| <div class="paragraph"><p><strong>Acknowledgements</strong></p></div> |
| <div class="paragraph"><p>The OpenCL specification is the result of the contributions of many |
| people, representing a cross section of the desktop, hand-held, and |
| embedded computer industry. Following is a partial list of the |
| contributors, including the company that they represented at the time of |
| their contribution:</p></div> |
| <div class="paragraph"><p>Chuck Rose, Adobe<br> |
| Eric Berdahl, Adobe<br> |
| Shivani Gupta, Adobe<br> |
| Bill Licea Kane, AMD<br> |
| Ed Buckingham, AMD<br> |
| Jan Civlin, AMD<br> |
| Laurent Morichetti, AMD<br> |
| Mark Fowler, AMD<br> |
| Marty Johnson, AMD<br> |
| Michael Mantor, AMD<br> |
| Norm Rubin, AMD<br> |
| Ofer Rosenberg, AMD<br> |
| Brian Sumner, AMD<br> |
| Victor Odintsov, AMD<br> |
| Aaftab Munshi, Apple<br> |
| Abe Stephens, Apple<br> |
| Alexandre Namaan, Apple<br> |
| Anna Tikhonova, Apple<br> |
| Chendi Zhang, Apple<br> |
| Eric Bainville, Apple<br> |
| David Hayward, Apple<br> |
| Giridhar Murthy, Apple<br> |
| Ian Ollmann, Apple<br> |
| Inam Rahman, Apple<br> |
| James Shearer, Apple<br> |
| MonPing Wang, Apple<br> |
| Tanya Lattner, Apple<br> |
| Mikael Bourges-Sevenier, Aptina<br> |
| Anton Lokhmotov, ARM<br> |
| Dave Shreiner, ARM<br> |
| Hedley Francis, ARM<br> |
| Robert Elliott, ARM<br> |
| Scott Moyers, ARM<br> |
| Tom Olson, ARM<br> |
| Anastasia Stulova, ARM<br> |
| Christopher Thompson-Walsh, Broadcom<br> |
| Holger Waechtler, Broadcom<br> |
| Norman Rink, Broadcom<br> |
| Andrew Richards, Codeplay<br> |
| Maria Rovatsou, Codeplay<br> |
| Alistair Donaldson, Codeplay<br> |
| Alastair Murray, Codeplay<br> |
| Stephen Frye, Electronic Arts<br> |
| Eric Schenk, Electronic Arts<br> |
| Daniel Laroche, Freescale<br> |
| David Neto, Google<br> |
| Robin Grosman, Huawei<br> |
| Craig Davies, Huawei<br> |
| Brian Horton, IBM<br> |
| Brian Watt, IBM<br> |
| Gordon Fossum, IBM<br> |
| Greg Bellows, IBM<br> |
| Joaquin Madruga, IBM<br> |
| Mark Nutter, IBM<br> |
| Mike Perks, IBM<br> |
| Sean Wagner, IBM<br> |
| Jon Parr, Imagination Technologies<br> |
| Robert Quill, Imagination Technologies<br> |
| James McCarthy, Imagination Technologie<br> |
| Aaron Kunze, Intel<br> |
| Aaron Lefohn, Intel<br> |
| Adam Lake, Intel<br> |
| Alexey Bader, Intel<br> |
| Allen Hux, Intel<br> |
| Andrew Brownsword, Intel<br> |
| Andrew Lauritzen, Intel<br> |
| Bartosz Sochacki, Intel<br> |
| Ben Ashbaugh, Intel<br> |
| Brian Lewis, Intel<br> |
| Geoff Berry, Intel<br> |
| Hong Jiang, Intel<br> |
| Jayanth Rao, Intel<br> |
| Josh Fryman, Intel<br> |
| Larry Seiler, Intel<br> |
| Mike MacPherson, Intel<br> |
| Murali Sundaresan, Intel<br> |
| Paul Lalonde, Intel<br> |
| Raun Krisch, Intel<br> |
| Stephen Junkins, Intel<br> |
| Tim Foley, Intel<br> |
| Timothy Mattson, Intel<br> |
| Yariv Aridor, Intel<br> |
| Michael Kinsner, Intel<br> |
| Kevin Stevens, Intel<br> |
| Jon Leech, Khronos<br> |
| Benjamin Bergen, Los Alamos National Laboratory<br> |
| Roy Ju, Mediatek<br> |
| Bor-Sung Liang, Mediatek<br> |
| Rahul Agarwal, Mediatek<br> |
| Michal Witaszek, Mobica<br> |
| JenqKuen Lee, NTHU<br> |
| Amit Rao, NVIDIA<br> |
| Ashish Srivastava, NVIDIA<br> |
| Bastiaan Aarts, NVIDIA<br> |
| Chris Cameron, NVIDIA<br> |
| Christopher Lamb, NVIDIA<br> |
| Dibyapran Sanyal, NVIDIA<br> |
| Guatam Chakrabarti, NVIDIA<br> |
| Ian Buck, NVIDIA<br> |
| Jaydeep Marathe, NVIDIA<br> |
| Jian-Zhong Wang, NVIDIA<br> |
| Karthik Raghavan Ravi, NVIDIA<br> |
| Kedar Patil, NVIDIA<br> |
| Manjunath Kudlur, NVIDIA<br> |
| Mark Harris, NVIDIA<br> |
| Michael Gold, NVIDIA<br> |
| Neil Trevett, NVIDIA<br> |
| Richard Johnson, NVIDIA<br> |
| Sean Lee, NVIDIA<br> |
| Tushar Kashalikar, NVIDIA<br> |
| Vinod Grover, NVIDIA<br> |
| Xiangyun Kong, NVIDIA<br> |
| Yogesh Kini, NVIDIA<br> |
| Yuan Lin, NVIDIA<br> |
| Mayuresh Pise, NVIDIA<br> |
| Allan Tzeng, QUALCOMM<br> |
| Alex Bourd, QUALCOMM<br> |
| Anirudh Acharya, QUALCOMM<br> |
| Andrew Gruber, QUALCOMM<br> |
| Andrzej Mamona, QUALCOMM<br> |
| Benedict Gaster, QUALCOMM<br> |
| Bill Torzewski, QUALCOMM<br> |
| Bob Rychlik, QUALCOMM<br> |
| Chihong Zhang, QUALCOMM<br> |
| Chris Mei, QUALCOMM<br> |
| Colin Sharp, QUALCOMM<br> |
| David Garcia, QUALCOMM<br> |
| David Ligon, QUALCOMM<br> |
| Jay Yun, QUALCOMM<br> |
| Lee Howes, QUALCOMM<br> |
| Richard Ruigrok, QUALCOMM<br> |
| Robert J. Simpson, QUALCOMM<br> |
| Sumesh Udayakumaran, QUALCOMM<br> |
| Vineet Goel, QUALCOMM<br> |
| Lihan Bin, QUALCOMM<br> |
| Vlad Shimanskiy, QUALCOMM<br> |
| Jian Liu, QUALCOMM<br> |
| Tasneem Brutch, Samsung<br> |
| Yoonseo Choi, Samsung<br> |
| Dennis Adams, Sony<br> |
| Pr-Anders Aronsson, Sony<br> |
| Jim Rasmusson, Sony<br> |
| Thierry Lepley, STMicroelectronics<br> |
| Anton Gorenko, StreamComputing<br> |
| Jakub Szuppe, StreamComputing<br> |
| Vincent Hindriksen, StreamComputing<br> |
| Alan Ward, Texas Instruments<br> |
| Yuan Zhao, Texas Instruments<br> |
| Pete Curry, Texas Instruments<br> |
| Simon McIntosh-Smith, University of Bristol<br> |
| James Price, University of Bristol<br> |
| Paul Preney, University of Windsor<br> |
| Shane Peelar, University of Windsor<br> |
| Brian Hutsell, Vivante<br> |
| Mike Cai, Vivante<br> |
| Sumeet Kumar, Vivante<br> |
| Wei-Lun Kao, Vivante<br> |
| Xing Wang, Vivante<br> |
| Jeff Fifield, Xilinx<br> |
| Hem C. Neema, Xilinx<br> |
| Henry Styles, Xilinx<br> |
| Ralph Wittig, Xilinx<br> |
| Ronan Keryell, Xilinx<br> |
| AJ Guillon, YetiWare Inc<br></p></div> |
| <div style="page-break-after:always"></div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_introduction">1. Introduction</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Modern processor architectures have embraced parallelism as an important |
| pathway to increased performance. Facing technical challenges with |
| higher clock speeds in a fixed power envelope, Central Processing Units |
| (CPUs) now improve performance by adding multiple cores. Graphics |
| Processing Units (GPUs) have also evolved from fixed function rendering |
| devices into programmable parallel processors. As todays computer |
| systems often include highly parallel CPUs, GPUs and other types of |
| processors, it is important to enable software developers to take full |
| advantage of these heterogeneous processing platforms. |
| <br> |
| <br> |
| Creating applications for heterogeneous parallel processing platforms is |
| challenging as traditional programming approaches for multi-core CPUs |
| and GPUs are very different. CPU-based parallel programming models are |
| typically based on standards but usually assume a shared address space |
| and do not encompass vector operations. General purpose GPU |
| programming models address complex memory hierarchies and vector |
| operations but are traditionally platform-, vendor- or |
| hardware-specific. These limitations make it difficult for a developer |
| to access the compute power of heterogeneous CPUs, GPUs and other types |
| of processors from a single, multi-platform source code base. More than |
| ever, there is a need to enable software developers to effectively take |
| full advantage of heterogeneous processing platforms from high |
| performance compute servers, through desktop computer systems to |
| handheld devices - that include a diverse mix of parallel CPUs, GPUs and |
| other processors such as DSPs and the Cell/B.E. processor. |
| <br> |
| <br> |
| <strong>OpenCL</strong> (Open Computing Language) is an open royalty-free standard for |
| general purpose parallel programming across CPUs, GPUs and other |
| processors, giving software developers portable and efficient access to |
| the power of these heterogeneous processing platforms. |
| <br> |
| <br> |
| OpenCL supports a wide range of applications, ranging from embedded and |
| consumer software to HPC solutions, through a low-level, |
| high-performance, portable abstraction. By creating an efficient, |
| close-to-the-metal programming interface, OpenCL will form the |
| foundation layer of a parallel computing ecosystem of |
| platform-independent tools, middleware and applications. OpenCL is |
| particularly suited to play an increasingly significant role in emerging |
| interactive graphics applications that combine general parallel compute |
| algorithms with graphics rendering pipelines. |
| <br> |
| <br> |
| OpenCL consists of an API for coordinating parallel computation across |
| heterogeneous processors; and a cross-platform intermediate language |
| with a well-specified computation environment. The OpenCL standard:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Supports both data- and |
| task-based parallel programming models |
| </p> |
| </li> |
| <li> |
| <p> |
| Utilizes a portable and |
| self-contained intermediate representation with support for parallel |
| execution |
| </p> |
| </li> |
| <li> |
| <p> |
| Defines consistent |
| numerical requirements based on IEEE 754 |
| </p> |
| </li> |
| <li> |
| <p> |
| Defines a configuration |
| profile for handheld and embedded devices |
| </p> |
| </li> |
| <li> |
| <p> |
| Efficiently interoperates |
| with OpenGL, OpenGL ES and other graphics APIs |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>This document begins with an overview of basic concepts and the |
| architecture of OpenCL, followed by a detailed description of its |
| execution model, memory model and synchronization support. It then |
| discusses the OpenCL__platform and runtime API. Some examples are given |
| that describe sample compute use-cases and how they would be written in |
| OpenCL. The specification is divided into a core specification that any |
| OpenCL compliant implementation must support; a handheld/embedded |
| profile which relaxes the OpenCL compliance requirements for handheld |
| and embedded devices; and a set of optional extensions that are likely |
| to move into the core specification in later revisions of the OpenCL |
| specification.</p></div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_glossary">2. Glossary</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p><strong>Application</strong>: The combination of the program running on the host and |
| OpenCL devices. |
| <br> |
| <br> |
| <strong>Acquire semantics</strong>: One of the memory order semantics defined for |
| synchronization operations. Acquire semantics apply to atomic |
| operations that load from memory. Given two units of execution, <strong>A</strong> and |
| <strong>B</strong>, acting on a shared atomic object <strong>M</strong>, if <strong>A</strong> uses an atomic load of |
| <strong>M</strong> with acquire semantics to synchronize-with an atomic store to <strong>M</strong> by |
| <strong>B</strong> that used release semantics, then <strong>A</strong>'s atomic load will occur before |
| any subsequent operations by <strong>A</strong>. Note that the memory orders |
| <em>release</em>, <em>sequentially consistent</em>, and <em>acquire_release</em> all include |
| <em>release semantics</em> and effectively pair with a load using acquire |
| semantics. |
| <br> |
| <br> |
| <strong>Acquire release semantics</strong>: A memory order semantics for |
| synchronization operations (such as atomic operations) that has the |
| properties of both acquire and release memory orders. It is used with |
| read-modify-write operations. |
| <br> |
| <br> |
| <strong>Atomic operations</strong>: Operations that at any point, and from any |
| perspective, have either occurred completely, or not at all. Memory |
| orders associated with atomic operations may constrain the visibility of |
| loads and stores with respect to the atomic operations (see <em>relaxed |
| semantics</em>, <em>acquire semantics</em>, <em>release semantics</em> or <em>acquire release |
| semantics</em>). |
| <br> |
| <br> |
| <strong>Blocking and Non-Blocking Enqueue API calls</strong>: A <em>non-blocking enqueue |
| API call</em> places a <em>command</em> on a <em>command-queue</em> and returns |
| immediately to the host. The <em>blocking-mode enqueue API calls</em> do not |
| return to the host until the command has completed. |
| <br> |
| <br> |
| <strong>Barrier</strong>: There are three types of <em>barriers</em> a command-queue barrier, |
| a work-group barrier and a sub-group barrier.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| The OpenCL API provides a |
| function to enqueue a <em>command-queue</em> <em>barrier</em> command. This <em>barrier</em> |
| command ensures that all previously enqueued commands to a command-queue |
| have finished execution before any following <em>commands</em> enqueued in the |
| <em>command-queue</em> can begin execution. |
| </p> |
| </li> |
| <li> |
| <p> |
| The OpenCL kernel |
| execution model provides built-in <em>work-group barrier</em> functionality. |
| This <em>barrier</em> built-in function can be used by a <em>kernel</em> executing on |
| a <em>device</em> to perform synchronization between <em>work-items</em> in a |
| <em>work-group</em> executing the <em>kernel</em>. All the <em>work-items</em> of a |
| <em>work-group</em> must execute the <em>barrier</em> construct before any are allowed |
| to continue execution beyond the <em>barrier</em>. |
| </p> |
| </li> |
| <li> |
| <p> |
| The OpenCL kernel |
| execution model provides built-in <em>sub-group barrier</em> functionality. |
| This <em>barrier</em> built-in function can be used by a <em>kernel</em> executing on |
| a <em>device</em> to perform synchronization between <em>work-items</em> in a |
| <em>sub-group</em> executing the <em>kernel</em>. All the <em>work-items</em> of a |
| <em>sub-group</em> must execute the <em>barrier</em> construct before any are allowed |
| to continue execution beyond the <em>barrier</em>. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p><strong>Buffer Object</strong>: A memory object that stores a linear collection of |
| bytes. Buffer objects are accessible using a pointer in a <em>kernel</em> |
| executing on a <em>device</em>. Buffer objects can be manipulated by the host |
| using OpenCL API calls. A <em>buffer object</em> encapsulates the following |
| information:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Size in bytes. |
| </p> |
| </li> |
| <li> |
| <p> |
| Properties that describe |
| usage information and which region to allocate from. |
| </p> |
| </li> |
| <li> |
| <p> |
| Buffer data. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p><strong>Built-in Kernel</strong>: A <em>built-in kernel</em> is a <em>kernel</em> that is executed on |
| an OpenCL <em>device</em> or <em>custom device</em> by fixed-function hardware or in |
| firmware. <em>Applications</em> can query the <em>built-in kernels</em> supported by |
| a <em>device</em> or <em>custom device</em>. A <em>program object</em> can only contain |
| <em>kernels</em> written in OpenCL C or <em>built-in kernels</em> but not both. See |
| also <em>Kernel</em> and <em>Program</em>. |
| <br> |
| <br> |
| <strong>Child kernel</strong>: see <em>device-side enqueue.</em> |
| <br> |
| <br> |
| <strong>Command</strong>: The OpenCL operations that are submitted to a <em>command-queue</em> |
| for execution. For example, OpenCL commands issue kernels for execution |
| on a compute device, manipulate memory objects, etc. |
| <br> |
| <br> |
| <strong>Command-queue</strong>: An object that holds <em>commands</em> that will be executed on |
| a specific <em>device</em>. The <em>command-queue</em> is created on a specific |
| <em>device</em> in a <em>context</em>. <em>Commands</em> to a <em>command-queue</em> are queued |
| in-order but may be executed in-order or out-of-order. <em>Refer to |
| In-order Execution_and_Out-of-order Execution</em>. |
| <br> |
| <br> |
| <strong>Command-queue Barrier</strong>. See <em>Barrier</em>. |
| <br> |
| <br> |
| <strong>Command synchronization</strong>: Constraints on the order that commands are |
| launched for execution on a device defined in terms of the |
| synchronization points that occur between commands in host |
| command-queues and between commands in device-side command-queues. See |
| <em>synchronization points</em>. |
| <br> |
| <br> |
| <strong>Complete</strong>: The final state in the six state model for the execution of |
| a command. The transition into this state occurs is signaled through |
| event objects or callback functions associated with a command. |
| <br> |
| <br> |
| <strong>Compute Device Memory</strong>: This refers to one or more memories attached |
| to the compute device. |
| <br> |
| <br> |
| <strong>Compute Unit</strong>: An OpenCL <em>device</em> has one or more <em>compute units</em>. A |
| <em>work-group</em> executes on a single <em>compute unit</em>. A <em>compute unit</em> is |
| composed of one or more <em>processing elements</em> and <em>local memory</em>. A |
| <em>compute unit</em> may also include dedicated texture filter units that can |
| be accessed by its processing elements. |
| <br> |
| <br> |
| <strong>Concurrency</strong>: A property of a system in which a set of tasks in a system |
| can remain active and make progress at the same time. To utilize |
| concurrent execution when running a program, a programmer must identify |
| the concurrency in their problem, expose it within the source code, and |
| then exploit it using a notation that supports concurrency. |
| <br> |
| <br> |
| <strong>Constant Memory</strong>: A region of <em>global memory</em> that remains constant |
| during the execution of a <em>kernel</em>. The <em>host</em> allocates and |
| initializes memory objects placed into <em>constant memory</em>.</p></div> |
| <div class="paragraph"><p><strong>Context</strong>: The environment within which the kernels execute and the |
| domain in which synchronization and memory management is defined. The |
| <em>context</em> includes a set of <em>devices</em>, the memory accessible to those |
| <em>devices</em>, the corresponding memory properties and one or more |
| <em>command-queues</em> used to schedule execution of a <em>kernel(s)</em> or |
| operations on <em>memory objects</em>. |
| <br> |
| <br> |
| <strong>Control flow</strong>: The flow of instructions executed by a work-item. |
| Multiple logically related work items may or may not execute the same |
| control flow. The control flow is said to be <em>converged</em> if all the |
| work-items in the set execution the same stream of instructions. In a |
| <em>diverged</em> control flow, the work-items in the set execute different |
| instructions. At a later point, if a diverged control flow becomes |
| converged, it is said to be a re-converged control flow. |
| <br> |
| <br> |
| <strong>Converged control flow</strong>: see <strong>control flow</strong>. |
| <br> |
| <br> |
| <strong>Custom Device</strong>: An OpenCL <em>device</em> that fully implements the OpenCL |
| Runtime but does not support <em>programs</em> written in OpenCL C. Â A custom |
| device may be specialized non-programmable hardware that is very power |
| efficient and performant for directed tasks or hardware with limited |
| programmable capabilities such as specialized DSPs. Custom devices are |
| not OpenCL conformant. Custom devices may support an online compiler. Â |
| Programs for custom devices can be created using the OpenCL runtime APIs |
| that allow OpenCL programs to be created from source (if an online |
| compiler is supported) and/or binary, or from <em>built-in |
| kernels_supported by the _device</em>. See also <em>Device</em>. |
| <br> |
| <br> |
| <strong>Data Parallel Programming Model</strong>: Traditionally, this term refers to a |
| programming model where concurrency is expressed as instructions from a |
| single program applied to multiple elements within a set of data |
| structures. Â The term has been generalized in OpenCL to refer to a model |
| wherein a set of  instructions from a single program are applied |
| concurrently to each point within an abstract domain of indices. |
| <br> |
| <br> |
| <strong>Data race</strong>: The execution of a program contains a data race if it |
| contains two actions in different work items or host threads where (1) |
| one action modifies a memory location and the other action reads or |
| modifies the same memory location, and (2) at least one of these actions |
| is not atomic, or the corresponding memory scopes are not inclusive, and |
| (3) the actions are global actions unordered by the |
| global-happens-before relation or are local actions unordered by the |
| local-happens before relation. |
| <br> |
| <br> |
| <strong>Deprecation</strong>: existing features are marked as deprecated if their usage is not recommended as that feature is being de-emphasized, superseded and may be removed from a future version of the specification.[BA2]Â |
| <br> |
| <br> |
| <strong>Device</strong>: A <em>device</em> is a collection of <em>compute units</em>. A |
| <em>command-queue</em> is used to queue <em>commands</em> to a <em>device</em>. Examples of |
| <em>commands</em> include executing <em>kernels</em>, or reading and writing <em>memory |
| objects</em>. OpenCL devices typically correspond to a GPU, a multi-core |
| CPU, and other processors such as DSPs and the Cell/B.E. processor. |
| <br> |
| <br> |
| <strong>Device-side enqueue</strong>: A mechanism whereby a kernel-instance is enqueued |
| by a kernel-instance running on a device without direct involvement by |
| the host program. This produces <em>nested parallelism</em>; i.e. additional |
| levels of concurrency are nested inside a running kernel-instance. The |
| kernel-instance executing on a device (the <em>parent kernel</em>) enqueues a |
| kernel-instance (the <em>child kernel</em>) to a device-side command queue. |
| Child and parent kernels execute asynchronously though a parent kernel |
| does not complete until all of its child-kernels have completed. |
| <br> |
| <br> |
| <strong>Diverged control flow</strong>: see <em>control flow</em>. |
| <br> |
| <br> |
| <strong>Ended</strong>: The fifth state in the six state model for the execution of a |
| command. The transition into this state occurs when execution of a |
| command has ended. When a Kernel-enqueue command ends, all of the |
| work-groups associated with that command have finished their execution. |
| <br> |
| <br> |
| <strong>Event Object</strong>: An <em>event</em> <em>object_encapsulates the status of an |
| operation such as a _command</em>. It can be used to synchronize operations |
| in a context. |
| <br> |
| <br> |
| <strong>Event Wait List</strong>: An <em>event wait list</em> is a list of <em>event objects</em> that |
| can be used to control when a particular <em>command</em> begins execution. |
| <br> |
| <br> |
| <strong>Fence</strong>: A memory ordering operation without an associated atomic |
| object. A fence can use the <em>acquire semantics, release semantics</em>, or |
| <em>acquire release semantics</em>. |
| <br> |
| <br> |
| <strong>Framework</strong>: A software system that contains the set of components to |
| support software development and execution. A <em>framework</em> typically |
| includes libraries, APIs, runtime systems, compilers, etc. |
| <br> |
| <br> |
| <strong>Generic address space</strong>: An address space that include the <em>private</em>, |
| <em>local</em>, and <em>global</em> address spaces available to a device. The generic |
| address space supports conversion of pointers to and from private, local |
| and global address spaces, and hence lets a programmer write a single |
| function that at compile time can take arguments from any of the three |
| named address spaces. |
| <br> |
| <br> |
| <strong>Global Happens before</strong>: see <em>happens before</em>. |
| <br> |
| <br> |
| <strong>Global ID</strong>: A <em>global ID</em> is used to uniquely identify a <em>work-item</em> and |
| is derived from the number of <em>global work-items</em> specified when |
| executing a <em>kernel</em>. The <em>global ID</em> is a N-dimensional value that |
| starts at (0, 0, 0). See also <em>Local ID</em>. |
| <br> |
| <br> |
| <strong>Global Memory</strong>: A memory region accessible to all <em>work-items</em> executing |
| in a <em>context</em>. It is accessible to the <em>host</em> using <em>commands</em> such as |
| read, write and map. <em>Global memory</em> is included within the <em>generic |
| address space</em> that includes the private and local address spaces. |
| <br> |
| <br> |
| <strong>GL share group</strong>: A <em>GL share group</em> object manages shared OpenGL or |
| OpenGL ES resources |
| such as textures, buffers, framebuffers, and renderbuffers and is |
| associated with one or more GL context objects. The <em>GL share group</em> is |
| typically an opaque object and not directly accessible. |
| <br> |
| <br> |
| <strong>Handle</strong>: An opaque type that references an <em>object</em> allocated by |
| OpenCL. Any operation on an <em>object</em> occurs by reference to that |
| objects handle. |
| <br> |
| <br> |
| <strong>Happens before</strong>: An ordering relationship between operations that |
| execute on multiple units of execution. If an operation A happens-before |
| operation B then A must occur before B; in particular, any value written |
| by A will be visible to B.We define two separate happens before |
| relations: <em>global-happens-before</em> and <em>local-happens-before</em>. These are |
| defined in section 3.3.6. |
| <br> |
| <br> |
| <strong>Host</strong>: The <em>host</em> interacts with the <em>context</em> using the OpenCL API. |
| <br> |
| <br> |
| <strong>Host-thread</strong>: the unit of execution that executes the statements in the |
| Host program. |
| <br> |
| <br> |
| <strong>Host pointer</strong>: A pointer to memory that is in the virtual address space |
| on the <em>host</em>. |
| <br> |
| <br> |
| <strong>Illegal</strong>: Behavior of a system that is explicitly not allowed and will |
| be reported as an error when encountered by OpenCL. |
| <br> |
| <br> |
| <strong>Image Object</strong>: A <em>memory object</em> that stores a two- or three- |
| dimensional structured array. Image data can only be accessed with read |
| and write functions. The read functions use a <em>sampler</em>. |
| <br> |
| <br> |
| The <em>image object</em> encapsulates the following information:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Dimensions of the image. |
| </p> |
| </li> |
| <li> |
| <p> |
| Description of each |
| element in the image. |
| </p> |
| </li> |
| <li> |
| <p> |
| Properties that describe |
| usage information and which region to allocate from. |
| </p> |
| </li> |
| <li> |
| <p> |
| Image data. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>The elements of an image are selected from a list of predefined image |
| formats. |
| <br> |
| <br> |
| <strong>Implementation Defined</strong>: Behavior that is explicitly allowed to vary |
| between conforming implementations of OpenCL. An OpenCL implementor is |
| required to document the implementation-defined behavior. |
| <br> |
| <br> |
| <strong>Independent Forward Progress</strong>: If an entity supports independent forward |
| progress, then if it is otherwise not dependent on any actions due to be |
| performed by any other entity (for example it does not wait on a lock |
| held by, and thus that must be released by, any other entity), then its |
| execution cannot be blocked by the execution of any other entity in the |
| system (it will not be starved). Work items in a subgroup, for example, |
| typically do not support independent forward progress, so one work item |
| in a subgroup may be completely blocked (starved) if a different work |
| item in the same subgroup enters a spin loop. |
| <br> |
| <br> |
| <strong>In-order Execution</strong>: A model of execution in OpenCL where the <em>commands</em> |
| in a <em>command-queue_Â are executed in order of submission with each |
| _command</em> running to completion before the next one begins. See |
| Out-of-order Execution. |
| <br> |
| <br> |
| <strong>Intermediate Language</strong>: A lower-level language that may be used to |
| create programs. SPIR-V is a required IL for OpenCL 2.2 runtimes. |
| Additional ILs may be accepted on an implementation-defined basis. |
| <br> |
| <br> |
| <strong>Kernel</strong>: A <em>kernel</em> is a function declared in a <em>program</em> and executed |
| on an OpenCL <em>device</em>. A <em>kernel</em> is identified by the kernel or |
| kernel qualifier applied to any function defined in a <em>program</em>. |
| <br> |
| <br> |
| <strong>Kernel-instance</strong>: The work carried out by an OpenCL program occurs |
| through the execution of kernel-instances on devices. The kernel |
| instance is the <em>kernel object</em>, the values associated with the |
| arguments to the kernel, and the parameters that define the <em>NDRange</em> |
| index space. |
| <br> |
| <br> |
| <strong>Kernel Object</strong>: A <em>kernel object</em> encapsulates a specific <em>kernel |
| function declared in a <em>program</em> and the argument values to be used when |
| executing this </em>kernel function. |
| <br> |
| <br> |
| <strong>Kernel Language</strong>: A language that is used to create source code for kernel. |
| Supported kernel languages include OpenCL C, OpenCL C++, and OpenCL dialect of SPIR-V. |
| <br> |
| <br> |
| <strong>Launch</strong>: The transition of a command from the <em>submitted</em> state to the |
| <em>ready</em> state. See <em>Ready</em>. |
| <br> |
| <br> |
| <strong>Local ID</strong>: A <em>local ID</em> specifies a unique <em>work-item ID</em> within a given |
| <em>work-group</em> that is executing a <em>kernel</em>. The <em>local ID</em> is a |
| N-dimensional value that starts at (0, 0, 0). See also <em>Global ID</em>. |
| <br> |
| <br> |
| <strong>Local Memory</strong>: A memory region associated with a <em>work-group</em> and |
| accessible only by <em>work-items</em> in that <em>work-group</em>. <em>Local memory</em> is |
| included within the <em>generic address space</em> that includes the private |
| and global address spaces. |
| <br> |
| <br> |
| <strong>Marker</strong>: A <em>command</em> queued in a <em>command-queue</em> that can be used to |
| tag all <em>commands</em> queued before the <em>marker</em> in the <em>command-queue</em>. |
| The <em>marker</em> command returns an <em>event</em> which can be used by the |
| <em>application</em> to queue a wait on the marker event i.e. wait for all |
| commands queued before the <em>marker</em> command to complete. |
| <br> |
| <br> |
| <strong>Memory Consistency Model</strong>: Rules that define which values are observed |
| when multiple units of execution load data from any shared memory plus |
| the synchronization operations that constrain the order of memory |
| operations and define synchronization relationships. The memory |
| consistency model in OpenCL is based on the memory model from the ISO |
| C11 programming language. |
| <br> |
| <br> |
| <strong>Memory Objects</strong>: A <em>memory object</em> is a handle to a reference counted |
| region of <em>global memory</em>. Also see_Buffer Object_and_Image Object_. |
| <br> |
| <br> |
| <strong>Memory Regions (or Pools)</strong>: A distinct address space in OpenCL. <em>Memory |
| regions</em> may overlap in physical memory though OpenCL will treat them as |
| logically distinct. The <em>memory regions</em> are denoted as <em>private</em>, |
| <em>local</em>, <em>constant,</em> and <em>global</em>. |
| <br> |
| <br> |
| <strong>Memory Scopes</strong>: These memory scopes define a hierarchy of visibilities |
| when analyzing the ordering constraints of memory operations. They are |
| defined by the values of the memory_scope enumeration constant. Current |
| values are <strong>memory_scope_work_item</strong>(memory constraints only apply to a |
| single work-item and in practice apply only to image operations)<strong>, |
| memory_scope_sub_group</strong> (memory-ordering constraints only apply to |
| work-items executing in a sub-group), <strong>memory_scope_work_group</strong> |
| (memory-ordering constraints only apply to work-items executing in a |
| work-group), <strong>memory_scope_device</strong> (memory-ordering constraints only |
| apply to work-items executing on a single device) and |
| <strong>memory_scope_all_svm_devices</strong> (memory-ordering constraints only apply |
| to work-items executing across multiple devices and when using shared |
| virtual memory). |
| <br> |
| <br> |
| <strong>Modification Order</strong>:All modifications to a particular atomic object M |
| occur in some particular <strong>total order</strong>, called the <strong>modification |
| order</strong> of M. If A and B are modifications of an atomic object M, and A |
| happens-before B, then A shall precede B in the modification order of M. |
| Note that the modification order of an atomic object M is independent of |
| whether M is in local or global memory. |
| <br> |
| <br> |
| <strong>Nested Parallelism</strong>: See <em>device-side enqueue</em>. |
| <br> |
| <br> |
| <strong>Object</strong>: Objects are abstract representation of the resources that can |
| be manipulated by the OpenCL API. Examples include <em>program objects</em>, |
| <em>kernel objects</em>, and <em>memory objects</em>. |
| <br> |
| <br> |
| <strong>Out-of-Order Execution</strong>: A model of execution in which <em>commands</em> placed |
| in the <em>work queue</em> may begin and complete execution in any order |
| consistent with constraints imposed by <em>event wait |
| lists_and_command-queue barrier</em>. See <em>In-order Execution</em>. |
| <br> |
| <br> |
| <strong>Parent device</strong>: The OpenCL <em>device</em> which is partitioned to create |
| <em>sub-devices</em>. Not all <em>parent devices_are _root devices</em>. A <em>root |
| device</em> might be partitioned and the <em>sub-devices</em> partitioned again. |
| In this case, the first set of <em>sub-devices</em> would be <em>parent devices</em> |
| of the second set, but not the <em>root devices</em>. Also see <em>device</em>, |
| <em>parent device</em> and <em>root device</em>. |
| <br> |
| <br> |
| <strong>Parent kernel</strong>: see <em>device-side enqueue</em>. |
| <br> |
| <br> |
| <strong>Pipe</strong>: The <em>pipe</em> memory object conceptually is an ordered sequence of |
| data items. A pipe has two endpoints: a write endpoint into which data |
| items are inserted, and a read endpoint from which data items are |
| removed. At any one time, only one kernel instance may write into a |
| pipe, and only one kernel instance may read from a pipe. To support the |
| producer consumer design pattern, one kernel instance connects to the |
| write endpoint (the producer) while another kernel instance connects to |
| the reading endpoint (the consumer). |
| <br> |
| <br> |
| <strong>Platform</strong>: The <em>host</em> plus a collection of <em>devices</em> managed by the |
| OpenCL <em>framework</em> that allow an application to share <em>resources</em> and |
| execute <em>kernels</em> on <em>devices</em> in the <em>platform</em>. |
| <br> |
| <br> |
| <strong>Private Memory</strong>: A region of memory private to a <em>work-item</em>. Variables |
| defined in one <em>work-items</em> <em>private memory</em> are not visible to another |
| <em>work-item</em>. |
| <br> |
| <br> |
| <strong>Processing Element</strong>: A virtual scalar processor. A work-item may |
| execute on one or more processing elements. |
| <br> |
| <br> |
| <strong>Program</strong>: An OpenCL <em>program</em> consists of a set of <em>kernels</em>. |
| <em>Programs</em> may also contain auxiliary functions called by the <em>_kernel |
| functions and constant data. |
| <br> |
| <br> |
| <strong>Program Object</strong>: A _program object</em> encapsulates the following |
| information:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| A reference to an |
| associated <em>context</em>. |
| </p> |
| </li> |
| <li> |
| <p> |
| A <em>program</em> source or |
| binary. |
| </p> |
| </li> |
| <li> |
| <p> |
| The latest successfully |
| built program executable, the list of <em>devices</em> for which the program |
| executable is built, the build options used and a build log. |
| </p> |
| </li> |
| <li> |
| <p> |
| The number of <em>kernel |
| objects</em> currently attached. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Â </p></div> |
| <div class="paragraph"><p><strong>Queued</strong>: The first state in the six state model for the execution of a |
| command. The transition into this state occurs when the command is |
| enqueued into a command-queue. |
| <br> |
| <br> |
| <strong>Ready</strong>: The third state in the six state model for the execution of a |
| command. The transition into this state occurs when pre-requisites |
| constraining execution of a command have been met; i.e. the command has |
| been launched. When a Kernel-enqueue command is launched, work-groups |
| associated with the command are placed in a devices work-pool from |
| which they are scheduled for execution. |
| <br> |
| <br> |
| <strong>Re-converged Control Flow</strong>: see <em>control flow</em>. |
| <br> |
| <br> |
| <strong>Reference Count</strong>: The life span of an OpenCL object is determined by its |
| <em>reference count_an internal count of the number of references to the |
| object. When you create an object in OpenCL, its _reference count</em> is |
| set to one. Subsequent calls to the appropriate <em>retain</em> API (such as |
| clRetainContext, clRetainCommandQueue) increment the <em>reference count</em>. |
| Calls to the appropriate <em>release</em> API (such as clReleaseContext, |
| clReleaseCommandQueue) decrement the <em>reference count</em>. |
| Implementations may also modify the <em>reference count</em>, e.g. to track |
| attached objects or to ensure correct operation of in-progress or |
| scheduled activities. The object becomes inaccessible to host code when |
| the number of <em>release</em> operations performed matches the number of |
| <em>retain</em> operations plus the allocation of the object. At this point the |
| reference count may be zero but this is not guaranteed. |
| <br> |
| <br> |
| <strong>Relaxed Consistency</strong>: A memory consistency model in which the contents |
| of memory visible to different <em>work-items</em> or <em>commands</em> may be |
| different except at a <em>barrier</em> or other explicit synchronization |
| points. |
| <br> |
| <br> |
| <strong>Relaxed Semantics</strong>: A memory order semantics for atomic operations that |
| implies no order constraints. The operation is <em>atomic</em> but it has no |
| impact on the order of memory operations. |
| <br> |
| <br> |
| <strong>Release Semantics</strong>: One of the memory order semantics defined for |
| synchronization operations. Release semantics apply to atomic |
| operations that store to memory. Given two units of execution, <strong>A</strong> and |
| <strong>B</strong>, acting on a shared atomic object <strong>M</strong>, if <strong>A</strong> uses an atomic store |
| of <strong>M</strong> with release semantics to synchronize-with an atomic load to <strong>M</strong> |
| by <strong>B*that used acquire semantics, then *A*s atomic store will occur |
| <em>after</em> any prior operations by *A</strong>. Note that the memory orders |
| <em>acquire</em>, <em>sequentialy consistent</em>, and <em>acquire_release</em> all include |
| <em>acquire semantics</em> and effectively pair with a store using release |
| semantics. |
| <br> |
| <br> |
| <strong>Remainder work-groups</strong>: When the work-groups associated with a |
| kernel-instance are defined, the sizes of a work-group in each dimension |
| may not evenly divide the size of the NDRange in the corresponding |
| dimensions. The result is a collection of work-groups on the boundaries |
| of the NDRange that are smaller than the base work-group size. These are |
| known as <em>remainder work-groups</em>. |
| <br> |
| <br> |
| <strong>Running</strong>: The fourth state in the six state model for the execution of |
| a command. The transition into this state occurs when the execution of |
| the command starts. When a Kernel-enqueue command starts, one or more |
| work-groups associated with the command start to execute. |
| <br> |
| <br> |
| <strong>Root device</strong>: A <em>root device</em> is an OpenCL <em>device</em> that has not been |
| partitioned. Also see <em>device</em>, <em>parent device</em> and <em>root device</em>. |
| <br> |
| <br> |
| <strong>Resource</strong>: A class of <em>objects</em> defined by OpenCL. An instance of a |
| <em>resource</em> is an <em>object</em>. The most common <em>resources</em> are the |
| <em>context</em>, <em>command-queue</em>, <em>program objects</em>, <em>kernel objects</em>, and |
| <em>memory objects</em>. Computational resources are hardware elements that |
| participate in the action of advancing a program counter. Examples |
| include the <em>host</em>, <em>devices</em>, <em>compute units</em> and <em>processing |
| elements</em>. |
| <br> |
| <br> |
| <strong>Retain</strong>, Release: The action of incrementing (retain) and decrementing |
| (release) the reference count using an OpenCL <em>object</em>. This is a book |
| keeping functionality to make sure the system doesnt remove an <em>object</em> |
| before all instances that use this <em>object</em> have finished. Refer to |
| <em>Reference Count</em>. |
| <br> |
| <br> |
| <strong>Sampler</strong>: An <em>object</em> that describes how to sample an image when the |
| image is read in the <em>kernel</em>. The image read functions take a |
| <em>sampler</em> as an argument. The <em>sampler</em> specifies the image |
| addressing-mode i.e. how out-of-range image coordinates are handled, the |
| filter mode, and whether the input image coordinate is a normalized or |
| unnormalized value. |
| <br> |
| <br> |
| <strong>Scope inclusion</strong>: Two actions <strong>A</strong> and <strong>B</strong> are defined to have an |
| inclusive scope if they have the same scope <strong>P</strong> such that: (1) if <strong>P</strong> is |
| memory_scope_sub_group, and <strong>A</strong> and <strong>B</strong> are executed by work-items |
| within the same sub-group, or (2) if <strong>P</strong> is memory_scope_work_group, and |
| <strong>A</strong> and <strong>B</strong> are executed by work-items within the same work-group, or |
| (3) if <strong>P</strong> is memory_scope_device, and <strong>A</strong> and <strong>B</strong> are executed by |
| work-items on the same device, or (4) if <strong>P</strong> is |
| memory_scope_all_svm_devices, if <strong>A</strong> and <strong>B</strong> are executed by host |
| threads or by work-items on one or more devices that can share SVM |
| memory with each other and the host process. |
| <br> |
| <br> |
| <strong>Sequenced before</strong>: A relation between evaluations executed by a single |
| unit of execution. Sequenced-before is an asymmetric, transitive, |
| pair-wise relation that induces a partial order between evaluations. |
| Given any two evaluations A and B, if A is sequenced-before B, then the |
| execution of A shall precede the execution of B. |
| <br> |
| <br> |
| <strong>Sequential consistency</strong>: Sequential consistency interleaves the steps |
| executed by each unit of execution. Each access to a memory location |
| sees the last assignment to that location in that interleaving. |
| <br> |
| <br> |
| <strong>Sequentially consistent semantics</strong>: One of the memory order semantics |
| defined for synchronization operations. When using |
| sequentially-consistent synchronization operations, the loads and stores |
| within one unit of execution appear to execute in program order (i.e., |
| the sequenced-before order), and loads and stores from different units |
| of execution appear to be simply interleaved. |
| <br> |
| <br> |
| <strong>Shared Virtual Memory (SVM)</strong>: An address space exposed to both the host |
| and the devices within a context. SVM causes addresses to be meaningful |
| between the host and all of the devices within a context and therefore |
| supports the use of pointer based data structures in OpenCL kernels. It |
| logically extends a portion of the global memory into the host address |
| space therefore giving work-items access to the host address space. |
| There are three types of SVM in OpenCL <strong>Coarse-Grained buffer SVM</strong>: |
| Sharing occurs at the granularity of regions of OpenCL buffer memory |
| objects. <strong>Fine-Grained buffer SVM</strong>: Sharing occurs at the granularity |
| of individual loads/stores into bytes within OpenCL buffer memory |
| objects. <strong>Fine-Grained system SVM</strong>: Sharing occurs at the granularity of |
| individual loads/stores into bytes occurring anywhere within the host |
| memory. |
| <br> |
| <br> |
| <strong>SIMD</strong>: Single Instruction Multiple Data. A programming model where a |
| <em>kernel</em> is executed concurrently on multiple <em>processing elements</em> each |
| with its own data and a shared program counter. All <em>processing |
| elements</em> execute a strictly identical set of instructions. |
| <br> |
| <br> |
| <strong>Specialization constants</strong>: Specialization is intended for constant |
| objects that will not have known constant values until after initial |
| generation of a SPIR-V module. Such objects are called specialization |
| constants. Application might provide values for |
| the specialization constants that will be used when SPIR-V program is |
| built. Specialization constants that do not receive a value from an |
| application shall use default value as defined in SPIR-V specification. |
| <br> |
| <br> |
| <strong>SPMD</strong>: Single Program Multiple Data. A programming model where a |
| <em>kernel</em> is executed concurrently on multiple <em>processing elements</em> each |
| with its own data and its own program counter. Hence, while all |
| computational resources run the same <em>kernel</em> they maintain their own |
| instruction counter and due to branches in a <em>kernel</em>, the actual |
| sequence of instructions can be quite different across the set of |
| <em>processing elements</em>. |
| <br> |
| <br> |
| <strong>Sub-device</strong>: An OpenCL <em>device</em> can be partitioned into multiple |
| <em>sub-devices</em>. The new <em>sub-devices</em> alias specific collections of |
| compute units within the parent <em>device</em>, according to a partition |
| scheme. The <em>sub-devices</em> may be used in any situation that their |
| parent <em>device</em> may be used. Partitioning a <em>device</em> does not destroy |
| the parent <em>device</em>, which may continue to be used along side and |
| intermingled with its child <em>sub-devices</em>. Also see <em>device</em>, <em>parent |
| device</em> and <em>root device</em>. |
| <br> |
| <br> |
| <strong>Sub-group</strong>: Sub-groups are an implementation-dependent grouping of |
| work-items within a work-group.  The size and number of sub-groups is |
| implementation-defined. |
| <br> |
| <br> |
| <strong>Sub-group Barrier</strong>. See <em>Barrier</em>. |
| <br> |
| <br> |
| <strong>Submitted</strong>: The second state in the six state model for the execution |
| of a command. The transition into this state occurs when the command is |
| flushed from the command-queue and submitted for execution on the |
| device. Once submitted, a programmer can assume a command will execute |
| once its prerequisites have been met. |
| <br> |
| <br> |
| <strong>SVM Buffer</strong>: A memory allocation enabled to work with Shared Virtual |
| Memory (SVM). Depending on how the SVM buffer is created, it can be a |
| coarse-grained or fine-grained SVM buffer. Optionally it may be wrapped |
| by a Buffer Object. See <em>Shared Virtual Memory (SVM)</em>. |
| <br> |
| <br> |
| <strong>Synchronization</strong>: Synchronization refers to mechanisms that constrain |
| the order of execution and the visibility of memory operations between |
| two or more units of execution. |
| <br> |
| <br> |
| <strong>Synchronization operations</strong>: Operations that define memory order |
| constraints in a program. They play a special role in controlling how |
| memory operations in one unit of execution (such as work-items or, when |
| using SVM a host thread) are made visible to another. Synchronization |
| operations in OpenCL include <em>atomic operations</em> and <em>fences</em>. |
| <br> |
| <br> |
| <strong>Synchronization point</strong>: A synchronization point between a pair of |
| commands (A and B) assures that results of command A happens-before |
| command B is launched (i.e. enters the ready state) . |
| <br> |
| <br> |
| <strong>Synchronizes with</strong>: A relation between operations in two different |
| units of execution that defines a memory order constraint in global |
| memory (<em>global-synchronizes-with</em>) or local memory |
| (<em>local-synchronizes-with</em>). |
| <br> |
| <br> |
| <strong>Task Parallel Programming Model</strong>: A programming model in which |
| computations are expressed in terms of multiple concurrent tasks |
| executing in one or more <em>command-queues</em>. The concurrent tasks can be |
| running different <em>kernels</em>. |
| <br> |
| <br> |
| <strong>Thread-safe</strong>: An OpenCL API call is considered to be <em>thread-safe</em> if |
| the internal state as managed by OpenCL remains consistent when called |
| simultaneously by multiple <em>host</em> threads. OpenCL API calls that are |
| <em>thread-safe</em> allow an application to call these functions in multiple |
| <em>host</em> threads without having to implement mutual exclusion across these |
| <em>host</em> threads i.e. they are also re-entrant-safe. |
| <br> |
| <br> |
| <strong>Undefined</strong>: The behavior of an OpenCL API call, built-in function used |
| inside a <em>kernel</em> or execution of a <em>kernel</em> that is explicitly not |
| defined by OpenCL. A conforming implementation is not required to |
| specify what occurs when an undefined construct is encountered in |
| OpenCL. |
| <br> |
| <br> |
| <strong>Unit of execution</strong>: a generic term for a process, OS managed thread |
| running on the host (a host-thread), kernel-instance, host program, |
| work-item or any other executable agent that advances the work |
| associated with a program. |
| <br> |
| <br> |
| <strong>Work-group</strong>: A collection of related <em>work-items</em> that execute on a |
| single <em>compute unit</em>. The <em>work-items</em> in the group execute the same |
| <em>kernel-instance</em> and share <em>local</em> <em>memory</em> and <em>work-group functions</em>. |
| <br> |
| <br> |
| <strong>Work-group Barrier</strong>. See <em>Barrier</em>. |
| <br> |
| <br> |
| <strong>Work-group Function</strong>: A function that carries out collective operations |
| across all the work-items in a work-group. Available collective |
| operations are a barrier, reduction, broadcast, prefix sum, and |
| evaluation of a predicate. A work-group function must occur within a |
| <em>converged control flow</em>; i.e. all work-items in the work-group must |
| encounter precisely the same work-group function. |
| <br> |
| <br> |
| <strong>Work-group Synchronization</strong>: Constraints on the order of execution for |
| work-items in a single work-group. |
| <br> |
| <br> |
| <strong>Work-pool</strong>: A logical pool associated with a device that holds commands |
| and work-groups from kernel-instances that are ready to execute. OpenCL |
| does not constrain the order that commands and work-groups are scheduled |
| for execution from the work-pool; i.e. a programmer must assume that |
| they could be interleaved. There is one work-pool per device used by |
| all command-queues associated with that device. The work-pool may be |
| implemented in any manner as long as it assures that work-groups placed |
| in the pool will eventually execute. |
| <br> |
| <br> |
| <strong>Work-item</strong>: One of a collection of parallel executions of a <em>kernel</em> |
| invoked on a <em>device</em> by a <em>command</em>. A <em>work-item</em> is executed by one |
| or more <em>processing elements</em> as part of a <em>work-group</em> executing on a |
| <em>compute unit</em>. A <em>work-item</em> is distinguished from other work-items by |
| its <em>global ID</em> or the combination of its <em>work-group</em> ID and its <em>local |
| ID</em> within a <em>work-group</em>.</p></div> |
| <div class="paragraph"><p>Â </p></div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_the_opencl_architecture">3. The OpenCL Architecture</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p><strong>OpenCL</strong> is an open industry standard for programming a heterogeneous |
| collection of CPUs, GPUs and other discrete computing devices organized |
| into a single platform. It is more than a language. OpenCL is a |
| framework for parallel programming and includes a language, API, |
| libraries and a runtime system to support software development. Using |
| OpenCL, for example, a programmer can write general purpose programs |
| that execute on GPUs without the need to map their algorithms onto a 3D |
| graphics API such as OpenGL or DirectX. |
| <br> |
| <br> |
| The target of OpenCL is expert programmers wanting to write portable yet |
| efficient code. This includes library writers, middleware vendors, and |
| performance oriented application programmers. Therefore OpenCL provides |
| a low-level hardware abstraction plus a framework to support programming |
| and many details of the underlying hardware are exposed. |
| <br> |
| <br> |
| To describe the core ideas behind OpenCL, we will use a hierarchy of |
| models:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Platform Model |
| </p> |
| </li> |
| <li> |
| <p> |
| Memory Model |
| </p> |
| </li> |
| <li> |
| <p> |
| Execution Model |
| </p> |
| </li> |
| <li> |
| <p> |
| Programming Model |
| </p> |
| </li> |
| </ul></div> |
| <div class="sect2"> |
| <h3 id="_platform_model">3.1. Platform Model</h3> |
| <div class="paragraph"><p>The Platform model for OpenCL is defined in <em>figure 3.1</em>. The model |
| consists of a <strong>host</strong> connected to one or more <strong>OpenCL devices</strong>. An OpenCL |
| device is divided into one or more <strong>compute units</strong> (CUs) which are further |
| divided into one or more <strong>processing elements</strong> (PEs). Computations on a |
| device occur within the processing elements. |
| <br> |
| <br> |
| An OpenCL application is implemented as both host code and device kernel |
| code. The host code portion of an OpenCL application runs on a host |
| processor according to the models native to the host platform. The |
| OpenCL application host code submits the kernel code as commands from |
| the host to OpenCL devices. An OpenCL device executes the commands |
| computation on the processing elements within the device. |
| <br> |
| <br> |
| An OpenCL device has considerable latitude on how computations are |
| mapped onto the devices processing elements. When processing elements |
| within a compute unit execute the same sequence of statements across the |
| processing elements, the control flow is said to be <em>converged.</em> |
| Hardware optimized for executing a single stream of instructions over |
| multiple processing elements is well suited to converged control |
| flows. When the control flow varies from one processing element to |
| another, it is said to be <em>diverged.</em> While a kernel always begins |
| execution with a converged control flow, due to branching statements |
| within a kernel, converged and diverged control flows may occur within a |
| single kernel. This provides a great deal of flexibility in the |
| algorithms that can be implemented with OpenCL. |
| <br> |
| <br></p></div> |
| <div class="paragraph"><p><span class="image"> |
| <img src="opencl22-API_files/image004_new.png" alt="opencl22-API_files/image004_new.png" width="320" height="180"> |
| </span></p></div> |
| <div class="paragraph"><p><strong>Figure 3.1</strong>: <em>Platform model … one host plus one or more compute devices each |
| with one or more compute units composed of one or more processing elements</em>. |
| <br> |
| <br> |
| Programmers provide programs in the form of SPIR-V source binaries, |
| OpenCL C or OpenCL C++ source strings or implementation-defined binary objects. The |
| OpenCL platform provides a compiler to translate program input of either |
| form into executable program objects. The device code compiler may be |
| <em>online</em> or <em>offline</em>. An <em>online</em> <em>compiler</em> is available during host |
| program execution using standard APIs. An <em>offline compiler</em> is |
| invoked outside of host program control, using platform-specific |
| methods. The OpenCL runtime allows developers to get a previously |
| compiled device program executable and be able to load and execute a |
| previously compiled device program executable. |
| <br> |
| <br> |
| OpenCL defines two kinds of platform profiles: a <em>full profile</em> and a |
| reduced-functionality <em>embedded profile</em>. A full profile platform must |
| provide an online compiler for all its devices. An embedded platform |
| may provide an online compiler, but is not required to do so. |
| <br> |
| <br> |
| A device may expose special purpose functionality as a <em>built-in |
| function</em>. The platform provides APIs for enumerating and invoking the |
| built-in functions offered by a device, but otherwise does not define |
| their construction or semantics. A <em>custom device</em> supports only |
| built-in functions, and cannot be programmed via a kernel language. |
| <br> |
| <br> |
| All device types support the OpenCL execution model, the OpenCL memory |
| model, and the APIs used in OpenCL to manage devices. |
| <br> |
| <br> |
| The platform model is an abstraction describing how OpenCL views the |
| hardware. The relationship between the elements of the platform model |
| and the hardware in a system may be a fixed property of a device or it |
| may be a dynamic feature of a program dependent on how a compiler |
| optimizes code to best utilize physical hardware.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_execution_model">3.2. Execution Model</h3> |
| <div class="paragraph"><p>The OpenCL execution model is defined in terms of two distinct units of |
| execution: <strong>kernels</strong> that execute on one or more OpenCL devices and a |
| <strong>host program</strong> that executes on the host. With regard to OpenCL, the |
| kernels are where the "work" associated with a computation occurs. This |
| work occurs through <strong>work-items</strong> that execute in groups (<strong>work-groups</strong>). |
| <br> |
| <br> |
| A kernel executes within a well-defined context managed by the host. |
| The context defines the environment within which kernels execute. It |
| includes the following resources:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Devices</strong>: One or |
| more devices exposed by the OpenCL platform. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Kernel Objects</strong>:The |
| OpenCL functions with their associated argument values that run on |
| OpenCL devices. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Program Objects</strong>:The |
| program source and executable that implement the kernels. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Memory |
| Objects</strong>:Variables visible to the host and the OpenCL devices. |
| Instances of kernels operate on these objects as they execute. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>The host program uses the OpenCL API to create and manage the context. |
| Functions from the OpenCL API enable the host to interact with a device |
| through a <em>command-queue</em>. Each command-queue is associated with a |
| single device. The commands placed into the command-queue fall into |
| one of three types:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Kernel-enqueue commands</strong>: |
| Enqueue a kernel for execution on a device. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Memory commands</strong>: |
| Transfer data between the host and device memory, between memory |
| objects, or map and unmap memory objects from the host address space. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Synchronization |
| commands</strong>: Explicit synchronization points that define order constraints |
| between commands. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>In addition to commands submitted from the host command-queue, a kernel |
| running on a device can enqueue commands to a device-side command queue. |
| This results in <em>child kernels</em> enqueued by a kernel executing on a |
| device (the <em>parent kernel</em>). Regardless of whether the command-queue |
| resides on the host or a device, each command passes through six states.</p></div> |
| <div class="olist arabic"><ol class="arabic"> |
| <li> |
| <p> |
| <strong>Queued</strong>: The command is enqueued to a command-queue. A |
| command may reside in the queue until it is flushed either explicitly (a |
| call to clFlush) or implicitly by some other command. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Submitted</strong>: The command is flushed from the command-queue and |
| submitted for execution on the device. Once flushed from the |
| command-queue, a command will execute after any prerequisites for |
| execution are met. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Ready</strong>: All prerequisites constraining execution of a command |
| have been met. The command, or for a kernel-enqueue command the |
| collection of work groups associated with a command, is placed in a |
| device work-pool from which it is scheduled for execution. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Running</strong>: Execution of the command starts. For the case of a |
| kernel-enqueue command, one or more work-groups associated with the |
| command start to execute. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Ended</strong>: Execution of a command ends. When a Kernel-enqueue |
| command ends, all of the work-groups associated with that command have |
| finished their execution. <em>Immediate side effects</em>, i.e. those |
| associated with the kernel but not necessarily with its child kernels, |
| are visible to other units of execution. These side effects include |
| updates to values in global memory. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Complete</strong>: The command and its child commands have finished |
| execution and the status of the event object, if any, associated with |
| the command is set to CL_COMPLETE. |
| </p> |
| </li> |
| </ol></div> |
| <div class="paragraph"><p>The execution states and the transitions between them are summarized in |
| Figure 3-2. These states and the concept of a device work-pool are |
| conceptual elements of the execution model. An implementation of OpenCL |
| has considerable freedom in how these are exposed to a program. Five of |
| the transitions, however, are directly observable through a profiling |
| interface. These profiled states are shown in Figure 3-2.</p></div> |
| <div class="paragraph"><p><span class="image"> |
| <img src="opencl22-API_files/image006.jpg" alt="image"> |
| </span></p></div> |
| <div class="paragraph"><p><strong>Figure 3-2: The states and transitions between states defined in the |
| OpenCL execution model. A subset of these transitions is exposed |
| through the profiling interface (see section 5.14).</strong></p></div> |
| <div class="paragraph"><p>Commands communicate their status through <em>Event objects</em>. Successful |
| completion is indicated by setting the event status associated with a |
| command to CL_COMPLETE. Unsuccessful completion results in abnormal |
| termination of the command which is indicated by setting the event |
| status to a negative value. In this case, the command-queue associated |
| with the abnormally terminated command and all other command-queues in |
| the same context may no longer be available and their behavior is |
| implementation defined. |
| <br> |
| <br> |
| A command submitted to a device will not launch until prerequisites that |
| constrain the order of commands have been resolved. These |
| prerequisites have three sources:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| They may arise from |
| commands submitted to a command-queue that constrain the order in which |
| commands are launched. For example, commands that follow a command queue |
| barrier will not launch until all commands prior to the barrier are |
| complete. |
| </p> |
| </li> |
| <li> |
| <p> |
| The second source of |
| prerequisites is dependencies between commands expressed through events. |
| A command may include an optional list of events. The command will wait |
| and not launch until all the events in the list are in the state CL |
| COMPLETE. By this mechanism, event objects define order constraints |
| between commands and coordinate execution between the host and one or |
| more devices. |
| </p> |
| </li> |
| <li> |
| <p> |
| The third source of |
| prerequisities can be the presence of non-trivial C initializers or C<span class="monospaced"> |
| constructors for program scope global variables. In this case, OpenCL |
| C/C</span> compiler shall generate program initialization kernels that |
| perform C initialization or C<span class="monospaced"> construction. These kernels must be |
| executed by OpenCL runtime on a device before any kernel from the same |
| program can be executed on the same device. The ND-range for any program |
| initialization kernel is (1,1,1). When multiple programs are linked |
| together, the order of execution of program initialization kernels |
| that belong to different programs is undefined. |
| <br> |
| <br> |
| Program clean up may result in the execution of one or more program |
| clean up kernels by the OpenCL runtime. This is due to the presence of |
| non-trivial C\</span> destructors for program scope variables. The ND-range |
| for executing any program clean up kernel is (1,1,1). The order of |
| execution of clean up kernels from different programs (that are linked |
| together) is undefined. |
| <br> |
| <br> |
| Note that C initializers, C<span class="monospaced"> constructors, or C</span> destructors for |
| program scope variables cannot use pointers to coarse grain and fine |
| grain SVM allocations. |
| <br> |
| <br> |
| A command may be submitted to a device and yet have no visible side effects |
| outside of waiting on and satisfying event dependences. Examples include |
| markers, kernels executed over ranges of no work-items or copy |
| operations with zero sizes. Such commands may pass directly from the |
| <em>ready</em> state to the <em>ended</em> state. |
| <br> |
| <br> |
| Command execution can be blocking or non-blocking. Consider a sequence |
| of OpenCL commands. For blocking commands, the OpenCL API functions |
| that enqueue commands don’t return until the command has completed. |
| Alternatively, OpenCL functions that enqueue non-blocking commands |
| return immediately and require that a programmer defines dependencies |
| between enqueued commands to ensure that enqueued commands are not |
| launched before needed resources are available. In both cases, the |
| actual execution of the command may occur asynchronously with execution |
| of the host program. |
| <br> |
| <br> |
| Commands within a single command-queue execute relative to each other in |
| one of two modes: |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Â </p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>In-order Execution</strong>: |
| Commands and any side effects associated with commands appear to the |
| OpenCL application as if they execute in the same order they are |
| enqueued to a command-queue. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Out-of-order Execution</strong>: |
| Commands execute in any order constrained only by explicit |
| synchronization points (e.g. through command queue barriers) or explicit |
| dependencies on events. |
| <br> |
| <br> |
| Multiple command-queues can be present within a single context. |
| Multiple command-queues execute commands independently. Event objects |
| visible to the host program can be used to define synchronization points |
| between commands in multiple command queues. If such synchronization |
| points are established between commands in multiple command-queues, an |
| implementation must assure that the command-queues progress concurrently |
| and correctly account for the dependencies established by the |
| synchronization points. For a detailed explanation of synchronization |
| points, see section 3.2.4. |
| <br> |
| <br> |
| The core of the OpenCL execution model is defined by how the kernels |
| execute. When a kernel-enqueue command submits a kernel for execution, |
| an index space is defined. The kernel, the argument values associated |
| with the arguments to the kernel, and the parameters that define the |
| index space define a <em>kernel-instance</em>. When a kernel-instance executes |
| on a device, the kernel function executes for each point in the defined |
| index space. Each of these executing kernel functions is called a |
| <em>work-item</em>. The work-items associated with a given kernel-instance are |
| managed by the device in groups called <em>work-groups</em>. These work-groups |
| define a coarse grained decomposition of the Index space. Work-groups |
| are further divided into <em>sub-groups</em>, which provide an additional level |
| of control over execution. |
| <br> |
| <br> |
| Work-items have a global ID based on their coordinates within the Index |
| space. They can also be defined in terms of their work-group and the |
| local ID within a work-group. The details of this mapping are described |
| in the following section. |
| </p> |
| </li> |
| </ul></div> |
| <div class="sect3"> |
| <h4 id="_execution_model_mapping_work_items_onto_an_ndrange">3.2.1. Execution Model: Mapping work-items onto an NDRange</h4> |
| <div class="paragraph"><p>The index space supported by OpenCL is called an NDRange. An NDRange is |
| an N-dimensional index space, where N is one, two or three. The NDRange |
| is decomposed into work-groups forming blocks that cover the Index |
| space. An NDRange is defined by three integer arrays of length N:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| The extent of the index |
| space (or global size) in each dimension. |
| </p> |
| </li> |
| <li> |
| <p> |
| An offset index F |
| indicating the initial value of the indices in each dimension (zero by |
| default). |
| </p> |
| </li> |
| <li> |
| <p> |
| The size of a work-group |
| (local size) in each dimension. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Â </p></div> |
| <div class="paragraph"><p>Each work-items global ID is an N-dimensional tuple. The global ID |
| components are values in the range from F, to F plus the number of |
| elements in that dimension minus one. |
| <br> |
| <br> |
| If a kernel is created from OpenCL C 2.0 or SPIR-V, the size of work-groups |
| in an NDRange (the local size) need not be the same for all work-groups. |
| In this case, any single dimension for which the global size is not |
| divisible by the local size will be partitioned into two regions. One |
| region will have work-groups that have the same number of work items as |
| was specified for that dimension by the programmer (the local size). The |
| other region will have work-groups with less than the number of work |
| items specified by the local size parameter in that dimension (the |
| <em>remainder work-groups</em>). Work-group sizes could be non-uniform in |
| multiple dimensions, potentially producing work-groups of up to 4 |
| different sizes in a 2D range and 8 different sizes in a 3D range. |
| <br> |
| <br> |
| Each work-item is assigned to a work-group and given a local ID to |
| represent its position within the work-group. A work-item’s local ID is |
| an N-dimensional tuple with components in the range from zero to the |
| size of the work-group in that dimension minus one. |
| <br> |
| <br> |
| Work-groups are assigned IDs similarly. The number of work-groups in |
| each dimension is not directly defined but is inferred from the local |
| and global NDRanges provided when a kernel-instance is enqueued. A |
| work-group’s ID is an N-dimensional tuple with components in the range 0 |
| to the ceiling of the global size in that dimension divided by the local |
| size in the same dimension. As a result, the combination of a |
| work-group ID and the local-ID within a work-group uniquely defines a |
| work-item. Each work-item is identifiable in two ways; in terms of a |
| global index, and in terms of a work-group index plus a local index |
| within a work group. |
| <br> |
| <br> |
| For example, consider the 2-dimensional index space in figure 3-3. We |
| input the index space for the work-items (G<sub>x</sub>, G<sub>y</sub>), the size of each |
| work-group (S<sub>x</sub>, S<sub>y</sub>) and the global ID offset (F<sub>x</sub>, F<sub>y</sub>). The |
| global indices define an G<sub>x</sub>by G<sub>y</sub> index space where the total number |
| of work-items is the product of G<sub>x</sub> and G<sub>y</sub>. The local indices define |
| an S<sub>x</sub> by S<sub>y</sub> index space where the number of work-items in a single |
| work-group is the product of S<sub>x</sub> and S<sub>y</sub>. Given the size of each |
| work-group and the total number of work-items we can compute the number |
| of work-groups. A 2-dimensional index space is used to uniquely identify |
| a work-group. Each work-item is identified by its global ID (<em>g</em><sub>x</sub>, |
| <em>g</em><sub>y</sub>) or by the combination of the work-group ID (<em>w</em><sub>x</sub>, <em>w</em><sub>y</sub>), the |
| size of each work-group (S<sub>x</sub>,S<sub>y</sub>) and the local ID (s<sub>x</sub>, s<sub>y</sub>) inside |
| the work-group such that |
| <br></p></div> |
| <div class="paragraph"><p>        (g<sub>x</sub> , g<sub>y</sub>) = (w<sub>x</sub> * S<sub>x</sub> + s<sub>x</sub> + F<sub>x</sub>, w<sub>y</sub> * S<sub>y</sub> + s<sub>y</sub> + F<sub>y</sub>) |
| <br> |
| <br> |
| The number of work-groups can be computed as: |
| <br></p></div> |
| <div class="paragraph"><p>        (W<sub>x</sub>, W<sub>y</sub>) = (ceil(G<sub>x</sub> / S<sub>x</sub>),ceil( G<sub>y</sub> / S<sub>y</sub>)) |
| <br> |
| <br> |
| Given a global ID and the work-group size, the work-group ID for a |
| work-item is computed as: |
| <br></p></div> |
| <div class="paragraph"><p>        (w<sub>x</sub>, w<sub>y</sub>) = ( (g<sub>x</sub> s<sub>x</sub> F<sub>x</sub>) / S<sub>x</sub>, (g<sub>y</sub> s<sub>y</sub> F<sub>y</sub>) / |
| S<sub>y</sub> )</p></div> |
| <div class="paragraph"><p><span class="image"> |
| <img src="opencl22-API_files/image007.jpg" alt="image"> |
| </span></p></div> |
| <div class="paragraph"><p><strong>Figure 3-3: An example of an NDRange index space showing work-items, |
| their global IDs and their mapping onto the pair of work-group and local |
| IDs. In this case, we assume that in each dimension, the size of the |
| work-group evenly divides the global NDRange size (i.e. all work-groups |
| have the same size) and that the offset is equal to zero.</strong> |
| <br> |
| <br> |
| Within a work-group work-items may be divided into sub-groups. The |
| mapping of work-items to sub-groups is implementation-defined and may be |
| queried at runtime. While sub-groups may be used in multi-dimensional |
| work-groups, each sub-group is 1-dimensional and any given work-item may |
| query which sub-group it is a member of. |
| <br> |
| <br> |
| Work items are mapped into sub-groups through a combination of |
| compile-time decisions and the parameters of the dispatch. The mapping |
| to sub-groups is invariant for the duration of a kernels execution, |
| across dispatches of a given kernel with the same work-group dimensions, |
| between dispatches and query operations consistent with the dispatch |
| parameterization, and from one work-group to another within the dispatch |
| (excluding the trailing edge work-groups in the presence of non-uniform |
| work-group sizes). In addition, all sub-groups within a work-group will |
| be the same size, apart from the sub-group with the maximum index which |
| may be smaller if the size of the work-group is not evenly divisible by |
| the size of the sub-groups. |
| <br> |
| <br> |
| In the degenerate case, a single sub-group must be supported for each |
| work-group. In this situation all sub-group scope functions are |
| equivalent to their work-group level equivalents.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_execution_model_execution_of_kernel_instances">3.2.2. Execution Model: Execution of kernel-instances</h4> |
| <div class="paragraph"><p>The work carried out by an OpenCL program occurs through the execution |
| of kernel-instances on compute devices. To understand the details of |
| OpenCLs execution model, we need to consider how a kernel object moves |
| from the kernel-enqueue command, into a command-queue, executes on a |
| device, and completes. |
| <br> |
| <br> |
| A kernel-object is defined from a function within the program object and |
| a collection of arguments connecting the kernel to a set of argument |
| values. The host program enqueues a kernel-object to the command queue |
| along with the NDRange, and the work-group decomposition. These define |
| a <em>kernel-instance</em>. In addition, an optional set of events may be |
| defined when the kernel is enqueued. The events associated with a |
| particular kernel-instance are used to constrain when the |
| kernel-instance is launched with respect to other commands in the queue |
| or to commands in other queues within the same context. |
| <br> |
| <br> |
| A kernel-instance is submitted to a device. For an in-order command |
| queue, the kernel instances appear to launch and then execute in that |
| same order; where we use the term appear to emphasize that when there |
| are no dependencies between commands and hence differences in the order |
| that commands execute cannot be observed in a program, an implementation |
| can reorder commands even in an in-order command queue. For an out of |
| order command-queue, kernel-instances wait to be launched until:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Synchronization commands |
| enqueued prior to the kernel-instance are satisfied. |
| </p> |
| </li> |
| <li> |
| <p> |
| Each of the events in an |
| optional event list defined when the kernel-instance was enqueued are |
| set to CL_COMPLETE. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Once these conditions are met, the kernel-instance is launched and the |
| work-groups associated with the kernel-instance are placed into a pool |
| of ready to execute work-groups. This pool is called a <em>work-pool</em>. |
| The work-pool may be implemented in any manner as long as it assures |
| that work-groups placed in the pool will eventually execute. The |
| device schedules work-groups from the work-pool for execution on the |
| compute units of the device. The kernel-enqueue command is complete when |
| all work-groups associated with the kernel-instance end their execution, |
| updates to global memory associated with a command are visible globally, |
| and the device signals successful completion by setting the event |
| associated with the kernel-enqueue command to CL_COMPLETE. |
| <br> |
| <br> |
| While a command-queue is associated with only one device, a single |
| device may be associated with multiple command-queues all feeding into |
| the single work-pool. A device may also be associated with command |
| queues associated with different contexts within the same platform, |
| again all feeding into the single work-pool. The device will pull |
| work-groups from the work-pool and execute them on one or several |
| compute units in any order; possibly interleaving execution of |
| work-groups from multiple commands. A conforming implementation may |
| choose to serialize the work-groups so a correct algorithm cannot assume |
| that work-groups will execute in parallel. There is no safe and |
| portable way to synchronize across the independent execution of |
| work-groups since once in the work-pool, they can execute in any order. |
| <br> |
| <br> |
| The work-items within a single sub-group execute concurrently but not |
| necessarily in parallel (i.e. they are not guaranteed to make |
| independent forward progress). Therefore, only high-level |
| synchronization constructs (e.g. sub-group functions such as barriers) |
| that apply to all the work-items in a sub-group are well defined and |
| included in OpenCL. |
| <br> |
| <br> |
| Sub-groups execute concurrently within a given work-group and with |
| appropriate device support (<em>see Section__4.2</em>) may make independent |
| forward progress with respect to each other, with respect to host |
| threads and with respect to any entities external to the OpenCL system |
| but running on an OpenCL device, even in the absence of work-group |
| barrier operations. In this situation, sub-groups are able to internally |
| synchronize using barrier operations without synchronizing with each |
| other and may perform operations that rely on runtime dependencies on |
| operations other sub-groups perform. |
| <br> |
| <br> |
| The work-items within a single work-group execute concurrently but are |
| only guaranteed to make independent progress in the presence of |
| sub-groups and device support. In the absence of this capability, only |
| high-level synchronization constructs (e.g. work-group functions such as |
| barriers) that apply to all the work-items in a work-group are well |
| defined and included in OpenCL for synchronization within the |
| work-group. |
| <br> |
| <br> |
| In the absence of synchronization functions (e.g. a barrier), work-items |
| within a sub-group may be serialized. In the presence of sub -group |
| functions, work-items within a sub -group may be serialized before any |
| given sub -group function, between dynamically encountered pairs of sub |
| -group functions and between a work-group function and the end of the |
| kernel. |
| <br> |
| <br> |
| In the absence of independent forward progress of constituent |
| sub-groups, work-items within a work-group may be serialized before, |
| after or between work-group synchronization functions.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_execution_model_device_side_enqueue">3.2.3. Execution Model: Device-side enqueue</h4> |
| <div class="paragraph"><p>Algorithms may need to generate additional work as they execute. In |
| many cases, this additional work cannot be determined statically; so the |
| work associated with a kernel only emerges at runtime as the |
| kernel-instance executes. This capability could be implemented in logic |
| running within the host program, but involvement of the host may add |
| significant overhead and/or complexity to the application control |
| flow. A more efficient approach would be to nest kernel-enqueue |
| commands from inside other kernels. This <strong>nested parallelism</strong> can be |
| realized by supporting the enqueuing of kernels on a device without |
| direct involvement by the host program; so-called <strong>device-side |
| enqueue</strong>. |
| <br> |
| <br> |
| Device-side kernel-enqueue commands are similar to host-side |
| kernel-enqueue commands. The kernel executing on a device (the <strong>parent |
| kernel</strong>) enqueues a kernel-instance (the <strong>child kernel</strong>) to a |
| device-side command queue. This is an out-of-order command-queue and |
| follows the same behavior as the out-of-order command-queues exposed to |
| the host program. Commands enqueued to a device side command-queue |
| generate and use events to enforce order constraints just as for the |
| command-queue on the host. These events, however, are only visible to |
| the parent kernel running on the device. When these prerequisite |
| events take on the value CL_COMPLETE, the work-groups associated with |
| the child kernel are launched into the devices work pool. The device |
| then schedules them for execution on the compute units of the device. |
| Child and parent kernels execute asynchronously. However, a parent will |
| not indicate that it is complete by setting its event to CL_COMPLETE |
| until all child kernels have ended execution and have signaled |
| completion by setting any associated events to the value CL_COMPLETE. |
| Should any child kernel complete with an event status set to a negative |
| value (i.e. abnormally terminate), the parent kernel will abnormally |
| terminate and propagate the childs negative event value as the value of |
| the parents event. If there are multiple children that have an event |
| status set to a negative value, the selection of which childs negative |
| event value is propagated is implementation-defined.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_execution_model_synchronization">3.2.4. Execution Model: Synchronization</h4> |
| <div class="paragraph"><p>Synchronization refers to mechanisms that constrain the order of |
| execution between two or more units of execution. Consider the |
| following three domains of synchronization in OpenCL:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Work-group |
| synchronization: Constraints on the order of execution for work-items in |
| a single work-group |
| </p> |
| </li> |
| <li> |
| <p> |
| Sub-group synchronization: |
| Contraints on the order of execution for work-items in a single |
| sub-group |
| </p> |
| </li> |
| <li> |
| <p> |
| Command synchronization: |
| Constraints on the order of commands launched for execution |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Synchronization across all work-items within a single work-group is |
| carried out using a <em>work-group function</em>. These functions carry out |
| collective operations across all the work-items in a work-group. |
| Available collective operations are: barrier, reduction, broadcast, |
| prefix sum, and evaluation of a predicate. A work-group function must |
| occur within a converged control flow; i.e. all work-items in the |
| work-group must encounter precisely the same work-group function. For |
| example, if a work-group function occurs within a loop, the work-items |
| must encounter the same work-group function in the same loop |
| iterations. All the work-items of a work-group must execute the |
| work-group function and complete reads and writes to memory before any |
| are allowed to continue execution beyond the work-group function. |
| Work-group functions that apply between work-groups are not provided in |
| OpenCL since OpenCL does not define forward-progress or ordering |
| relations between work-groups, hence collective synchronization |
| operations are not well defined. |
| <br> |
| <br> |
| Synchronization across all work-items within a single sub-group is |
| carried out using a <em>sub-group function</em>. These functions carry out |
| collective operations across all the work-items in a sub-group. |
| Available collective operations are: barrier, reduction, broadcast, |
| prefix sum, and evaluation of a predicate. A sub-group function must |
| occur within a converged control flow; i.e. all work-items in the |
| sub-group must encounter precisely the same sub-group function. For |
| example, if a work-group function occurs within a loop, the work-items |
| must encounter the same sub-group function in the same loop iterations. |
| All the work-items of a sub-group must execute the sub-group function |
| and complete reads and writes to memory before any are allowed to |
| continue execution beyond the sub-group function. Synchronization |
| between sub-groups must either be performed using work-group functions, |
| or through memory operations. Using memory operations for sub-group |
| synchronization should be used carefully as forward progress of |
| sub-groups relative to each other is only supported optionally by OpenCL |
| implementations. |
| <br> |
| <br> |
| Command synchronization is defined in terms of distinct <strong>synchronization |
| points</strong>. The synchronization points occur between commands in host |
| command-queues and between commands in device-side command-queues. The |
| synchronization points defined in OpenCL include:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Launching a command:</strong> A |
| kernel-instance is launched onto a device after all events that kernel |
| is waiting-on have been set to CL_COMPLETE. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Ending a command:</strong> Child |
| kernels may be enqueued such that they wait for the parent kernel to |
| reach the <em>end</em> state before they can be launched. In this case, the |
| ending of the parent command defines a synchronization point. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Completion of a command:</strong> |
| A kernel-instance is complete after all of the work-groups in the kernel |
| and all of its child kernels have completed. This is signaled to the |
| host, a parent kernel or other kernels within command queues by setting |
| the value of the event associated with a kernel to CL_COMPLETE. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Blocking Commands:</strong> A |
| blocking command defines a synchronization point between the unit of |
| execution that calls the blocking API function and the enqueued command |
| reaching the complete state. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Command-queue barrier:</strong> |
| The command-queue barrier ensures that all previously enqueued commands |
| have completed before subsequently enqueued commands can be launched. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>clFinish:</strong> This function |
| blocks until all previously enqueued commands in the command queue have |
| completed after which clFinish defines a synchronization point and the |
| clFinish function returns. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>A synchronization point between a pair of commands (A and B) assures |
| that results of command A happens-before command B is launched. This |
| requires that any updates to memory from command A complete and are made |
| available to other commands before the synchronization point completes. |
| Likewise, this requires that command B waits until after the |
| synchronization point before loading values from global memory. The |
| concept of a synchronization point works in a similar fashion for |
| commands such as a barrier that apply to two sets of commands. All the |
| commands prior to the barrier must complete and make their results |
| available to following commands. Furthermore, any commands following |
| the barrier must wait for the commands prior to the barrier before |
| loading values and continuing their execution. |
| <br> |
| <br> |
| These <em>happens-before</em> relationships are a fundamental part of the |
| OpenCL memory model. When applied at the level of commands, they are |
| straightforward to define at a language level in terms of ordering |
| relationships between different commands. Ordering memory operations |
| inside different commands, however, requires rules more complex than can |
| be captured by the high level concept of a synchronization point. |
| These rules are described in detail in section 3.3.6.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_execution_model_categories_of_kernels">3.2.5. Execution Model: Categories of Kernels</h4> |
| <div class="paragraph"><p>The OpenCL execution model supports three types of kernels:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>OpenCL kernels</strong> are |
| managed by the OpenCL API as kernel-objects associated with kernel |
| functions within program-objects. OpenCL kernels are provided via a |
| kernel language. |
| All OpenCL implementations must support OpenCL kernels supplied in the |
| standard SPIR-V intermediate language with the appropriate environment |
| specification, and the OpenCL C programming language defined in earlier |
| versions of the OpenCL specification. Implementations must also support |
| OpenCL kernels in |
| SPIR-V intermediate language. SPIR-V binaries nay be |
| generated from an |
| OpenCL kernel language or by a third party compiler from an |
| alternative input. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Native kernels</strong> are |
| accessed through a host function pointer. Native kernels are queued for |
| execution along with OpenCL kernels on a device and share memory objects |
| with OpenCL kernels. For example, these native kernels could be |
| functions defined in application code or exported from a library. The |
| ability to execute native kernels is optional within OpenCL and the |
| semantics of native kernels are implementation-defined. The OpenCL API |
| includes functions to query capabilities of a device(s) and determine if |
| this capability is supported. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Built-in kernels</strong> are tied |
| to particular device and are not built at runtime from source code in a |
| program object. The common use of built in kernels is to expose |
| fixed-function hardware or firmware associated with a particular OpenCL |
| device or custom device. The semantics of a built-in kernel may be |
| defined outside of OpenCL and hence are implementation defined. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>All three types of kernels are manipulated through the OpenCL command |
| queues and must conform to the synchronization points defined in the |
| OpenCL execution model.</p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_memory_model">3.3. Memory Model</h3> |
| <div class="paragraph"><p>The OpenCL memory model describes the structure, contents, and behavior |
| of the memory exposed by an OpenCL platform as an OpenCL program runs. |
| The model allows a programmer to reason about values in memory as the |
| host program and multiple kernel-instances execute. |
| <br> |
| <br> |
| An OpenCL program defines a context that includes a host, one or more |
| devices, command-queues, and memory exposed within the context. |
| Consider the units of execution involved with such a program. The host |
| program runs as one or more host threads managed by the operating system |
| running on the host (the details of which are defined outside of |
| OpenCL). There may be multiple devices in a single context which all |
| have access to memory objects defined by OpenCL. On a single device, |
| multiple work-groups may execute in parallel with potentially |
| overlapping updates to memory. Finally, within a single work-group, |
| multiple work-items concurrently execute, once again with potentially |
| overlapping updates to memory. |
| <br> |
| <br> |
| The memory model must precisely define how the values in memory as seen |
| from each of these units of execution interact so a programmer can |
| reason about the correctness of OpenCL programs. We define the memory |
| model in four parts.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Memory regions: The |
| distinct memories visible to the host and the devices that share a |
| context. |
| </p> |
| </li> |
| <li> |
| <p> |
| Memory objects: The |
| objects defined by the OpenCL API and their management by the host and |
| devices. |
| </p> |
| </li> |
| <li> |
| <p> |
| Shared Virtual Memory: A |
| virtual address space exposed to both the host and the devices within a |
| context. |
| </p> |
| </li> |
| <li> |
| <p> |
| Consistency Model: Rules |
| that define which values are observed when multiple units of execution |
| load data from memory plus the atomic/fence operations that constrain |
| the order of memory operations and define synchronization relationships. |
| </p> |
| </li> |
| </ul></div> |
| <div class="sect3"> |
| <h4 id="_memory_model_fundamental_memory_regions">3.3.1. Memory Model: Fundamental Memory Regions</h4> |
| <div class="paragraph"><p>Memory in OpenCL is divided into two parts.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Host Memory:</strong> The memory |
| directly available to the host. The detailed behavior of host memory is |
| defined outside of OpenCL. Memory objects move between the Host and the |
| devices through functions within the OpenCL API or through a shared |
| virtual memory interface. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Device Memory:</strong> Memory |
| directly available to kernels executing on OpenCL devices. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Device memory consists of four named address spaces or <em>memory regions</em>:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Global Memory:</strong> This |
| memory region permits read/write access to all work-items in all |
| work-groups running on any device within a context. Work-items can read |
| from or write to any element of a memory object. Reads and writes to |
| global memory may be cached depending on the capabilities of the device. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Constant Memory</strong>: A |
| region of global memory that remains constant during the execution of a |
| kernel-instance. The host allocates and initializes memory objects |
| placed into constant memory. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Local Memory</strong>: A memory |
| region local to a work-group. This memory region can be used to allocate |
| variables that are shared by all work-items in that work-group. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Private Memory</strong>: A region |
| of memory private to a work-item. Variables defined in one work-items |
| private memory are not visible to another work-item. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Â </p></div> |
| <div class="paragraph"><p>The memory regions and their relationship to the OpenCL Platform model |
| are summarized in figure 3-4. Local and private memories are always |
| associated with a particular device. The global and constant memories, |
| however, are shared between all devices within a given context. An |
| OpenCL device may include a cache to support efficient access to these |
| shared memories |
| <br> |
| <br> |
| To understand memory in OpenCL, it is important to appreciate the |
| relationships between these named address spaces.  The four named |
| address spaces available to a device are disjoint meaning they do not |
| overlap.  This is a logical relationship, however, and an |
| implementation may choose to let these disjoint named address spaces |
| share physical memory. |
| <br> |
| <br> |
| Programmers often need functions callable from kernels where the |
| pointers manipulated by those functions can point to multiple named |
| address spaces. This saves a programmer from the error-prone and |
| wasteful practice of creating multiple copies of functions; one for each |
| named address space. Therefore the global, local and private address |
| spaces belong to a single <em>generic address space</em>. This is closely |
| modeled after the concept of a generic address space used in the |
| embedded C standard (ISO/IEC 9899:1999). Since they all belong to a |
| single generic address space, the following properties are supported for |
| pointers to named address spaces in device memory:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| A pointer to the generic |
| address space can be cast to a pointer to a global, local or private |
| address space |
| </p> |
| </li> |
| <li> |
| <p> |
| A pointer to a global, |
| local or private address space can be cast to a pointer to the generic |
| address space. |
| </p> |
| </li> |
| <li> |
| <p> |
| A pointer to a global, |
| local or private address space can be implicitly converted to a pointer |
| to the generic address space, but the converse is not allowed. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Â </p></div> |
| <div class="paragraph"><p>The constant address space is disjoint from the generic address space. |
| <br> |
| <br> |
| The addresses of memory associated with memory objects in Global memory |
| are not preserved between kernel instances, between a device and the |
| host, and between devices. In this regard global memory acts as a global |
| pool of memory objects rather than an address space. This restriction is |
| relaxed when shared virtual memory (SVM) is used. |
| <br> |
| <br> |
| SVM causes addresses to be meaningful between the host and all of the |
| devices within a context hence supporting the use of pointer based data |
| structures in OpenCL kernels. It logically extends a portion of the |
| global memory into the host address space giving work-items access to |
| the host address space. On platforms with hardware support for a shared |
| address space between the host and one or more devices, SVM may also |
| provide a more efficient way to share data between devices and the host. |
| Details about SVM are presented in section 3.3.3.</p></div> |
| <div class="paragraph"><p><span class="image"> |
| <img src="opencl22-API_files/image008.jpg" alt="image"> |
| </span></p></div> |
| <div class="paragraph"><p><strong>Figure 3-4: The named address spaces exposed in an OpenCL Platform. |
| Global and Constant memories are shared between the one or more devices |
| within a context, while local and private memories are associated with a |
| single device. Each device may include an optional cache to support |
| efficient access to their view of the global and constant address |
| spaces.</strong></p></div> |
| <div class="paragraph"><p>A programmer may use the features of the memory consistency model |
| (section 3.3.4) to manage safe access to global memory from multiple |
| work-items potentially running on one or more devices. In addition, when |
| using shared virtual memory (SVM), the memory consistency model may also |
| be used to ensure that host threads safely access memory locations in |
| the shared memory region.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_memory_model_memory_objects">3.3.2. Memory Model: Memory Objects</h4> |
| <div class="paragraph"><p>The contents of global memory are <em>memory objects</em>. A memory object is |
| a handle to a reference counted region of global memory. Memory objects |
| use the OpenCL type <em>cl_mem</em> and fall into three distinct classes.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Buffer</strong>: A memory object |
| stored as a block of contiguous memory and used as a general purpose |
| object to hold data used in an OpenCL program. The types of the values |
| within a buffer may be any of the built in types (such as int, float), |
| vector types, or user-defined structures. The buffer can be |
| manipulated through pointers much as one would with any block of memory |
| in C. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Image</strong>: An image memory |
| object holds one, two or three dimensional images. The formats are |
| based on the standard image formats used in graphics applications. An |
| image is an opaque data structure managed by functions defined in the |
| OpenCL API. To optimize the manipulation of images stored in the |
| texture memories found in many GPUs, OpenCL kernels have traditionally |
| been disallowed from both reading and writing a single image. In OpenCL |
| 2.0, however, we have relaxed this restriction by providing |
| synchronization and fence operations that let programmers properly |
| synchronize their code to safely allow a kernel to read and write a |
| single image. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Pipe</strong>: The <em>pipe</em> memory |
| object conceptually is an ordered sequence of data items. A pipe has |
| two endpoints: a write endpoint into which data items are inserted, and |
| a read endpoint from which data items are removed. At any one time, |
| only one kernel instance may write into a pipe, and only one kernel |
| instance may read from a pipe. To support the producer consumer design |
| pattern, one kernel instance connects to the write endpoint (the |
| producer) while another kernel instance connects to the reading endpoint |
| (the consumer). |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Â </p></div> |
| <div class="paragraph"><p>Memory objects are allocated by host APIs. The host program can provide |
| the runtime with a pointer to a block of continuous memory to hold the |
| memory object when the object is created (CL_MEM_USE_HOST_PTR). |
| Alternatively, the physical memory can be managed by the OpenCL runtime |
| and not be directly accessible to the host program. |
| <br> |
| <br> |
| Allocation and access to memory objects within the different memory |
| regions varies between the host and work-items running on a device. |
| This is summarized in table 3.1 which__describes whether the kernel or |
| the host can allocate from a memory region, the type of allocation |
| (static at compile time vs. dynamic at runtime) and the type of access |
| allowed (i.e. whether the kernel or the host can read and/or write to a |
| memory region).</p></div> |
| <div style="page-break-after:always"></div> |
| <table class="tableblock frame-all grid-all" |
| style=" |
| width:80%; |
| "> |
| <col style="width:20%;"> |
| <col style="width:20%;"> |
| <col style="width:20%;"> |
| <col style="width:20%;"> |
| <col style="width:20%;"> |
| <tbody> |
| <tr> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock"></p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Global</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Constant</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Local</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Private</p></td> |
| </tr> |
| <tr> |
| <td class="tableblock halign-left valign-top" rowspan="2" ><p class="tableblock">Host</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Dynamic Allocation</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Dynamic Allocation</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Dynamic Allocation</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">No Allocation</p></td> |
| </tr> |
| <tr> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access to buffers and images but not pipes</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">No access</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">No access</p></td> |
| </tr> |
| <tr> |
| <td class="tableblock halign-left valign-top" rowspan="2" ><p class="tableblock">Kernel</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Static Allocation for program scope variables</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Static Allocation</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Static Allocation. Dynamic allocation for child kernel</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Static Allocation</p></td> |
| </tr> |
| <tr> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Read-only access</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access. No access to child’s local memory.</p></td> |
| <td class="tableblock halign-left valign-top" ><p class="tableblock">Read/Write access</p></td> |
| </tr> |
| </tbody> |
| </table> |
| <div class="paragraph"><p>Â </p></div> |
| <div class="paragraph"><p><strong>Table 3 1: The different memory regions in |
| OpenCL and how memory objects are allocated and accessed by the host and |
| by an executing instance of a kernel. For the case of kernels, we |
| distinguish between the behavior of local memory with respect to a |
| kernel (self) and its child kernels.</strong></p></div> |
| <div class="paragraph"><p>Once allocated, a memory object is made available to kernel-instances |
| running on one or more devices. In addition to shared virtual memory |
| (section 3.3.3) there are three basic ways to manage the contents of |
| buffers between the host and devices.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Read/Write/Fill |
| commands</strong>: The data associated with a memory object is explicitly read |
| and written between the host and global memory regions using commands |
| enqueued to an OpenCL command queue. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Map/Unmap commands</strong>: Data |
| from the memory object is mapped into a contiguous block of memory |
| accessed through a host accessible pointer. The host program enqueues a |
| <em>map</em> command on block of a memory object before it can be safely |
| manipulated by the host program. When the host program is finished |
| working with the block of memory, the host program enqueues an <em>unmap</em> |
| command to allow a kernel-instance to safely read and/or write the |
| buffer.** |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Copy commands:</strong> The data |
| associated with a memory object is copied between two buffers, each of |
| which may reside either on the host or on the device. |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Â </p></div> |
| <div class="paragraph"><p>In both cases, the commands to transfer data between devices and the |
| host can be blocking or non-blocking operations. The OpenCL function |
| call for a blocking memory transfer returns once the associated memory |
| resources on the host can be safely reused. For a non-blocking memory |
| transfer, the OpenCL function call returns as soon as the command is |
| enqueued. |
| <br> |
| <br> |
| Memory objects are bound to a context and hence can appear in multiple |
| kernel-instances running on more than one physical device. The OpenCL |
| platform must support a large range of hardware platforms including |
| systems that do not support a single shared address space in hardware; |
| hence the ways memory objects can be shared between kernel-instances is |
| restricted. The basic principle is that multiple read operations on |
| memory objects from multiple kernel-instances that overlap in time are |
| allowed, but mixing overlapping reads and writes into the same memory |
| objects from different kernel instances is only allowed when fine |
| grained synchronization is used with shared virtual memory (see section |
| 3.3.3). |
| <br> |
| <br> |
| When global memory is manipulated by multiple kernel-instances running |
| on multiple devices, the OpenCL runtime system must manage the |
| association of memory objects with a given device. In most cases the |
| OpenCL runtime will implicitly associate a memory object with a device. |
| A kernel instance is naturally associated with the command queue to |
| which the kernel was submitted. Since a command-queue can only access a |
| single device, the queue uniquely defines which device is involved with |
| any given kernel-instance; hence defining a clear association between |
| memory objects, kernel-instances and devices. Programmers may |
| anticipate these associations in their programs and explicitly manage |
| association of memory objects with devices in order to improve |
| performance.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_memory_model_shared_virtual_memory">3.3.3. Memory Model: Shared Virtual Memory</h4> |
| <div class="paragraph"><p>OpenCL extends the global memory region into the host memory region |
| through a shared virtual memory (SVM) mechanism. There are three types |
| of SVM in OpenCL</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>Coarse-Grained buffer |
| SVM</strong>: Sharing occurs at the granularity of regions of OpenCL buffer |
| memory objects. Consistency is enforced at synchronization points and |
| with map/unmap commands to drive updates between the host and the |
| device. This form of SVM is similar to non-SVM use of memory; however, |
| it lets kernel-instances share pointer-based data structures (such as |
| linked-lists) with the host program. Program scope global variables are |
| treated as per-device coarse-grained SVM for addressing and sharing |
| purposes. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Fine-Grained buffer |
| SVM</strong>: Sharing occurs at the granularity of individual loads/stores into |
| bytes within OpenCL buffer memory objects. Loads and stores may be |
| cached. This means consistency is guaranteed at synchronization points. |
| If the optional OpenCL atomics are supported, they can be used to |
| provide fine-grained control of memory consistency. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>Fine-Grained system SVM</strong>: |
| Sharing occurs at the granularity of individual loads/stores into bytes |
| occurring anywhere within the host memory. Loads and stores may be |
| cached so consistency is guaranteed at synchronization points. If the |
| optional OpenCL atomics are supported, they can be used to provide |
| fine-grained control of memory consistency. |
| </p> |
| </li> |
| </ul></div> |
| <table class="tableblock frame-all grid-all" |
| style=" |
| width:100%; |
| "> |
| <caption class="title">Table 1. <strong>A summary of shared virtual memory (SVM) options in OpenCL</strong></caption> |
| <col style="width:20%;"> |
| <col style="width:20%;"> |
| <col style="width:20%;"> |
| <col style="width:20%;"> |
| <col style="width:20%;"> |
| <tbody> |
| <tr> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock"></p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Granularity of sharing</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Memory Allocation</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Mechanisms to enforce Consistency</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Explicit updates |
| between host and device</p></td> |
| </tr> |
| <tr> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Non-SVM buffers</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">OpenCL Memory objects(buffer)</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">clCreateBuffer</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Host synchronization points on the same or between |
| devices.</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">yes, through Map and Unmap commands.</p></td> |
| </tr> |
| <tr> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Coarse-Grained buffer SVM</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">OpenCL Memory objects (buffer)</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">clSVMAlloc</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Host synchronization points |
| between devices</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">yes, through Map and Unmap commands.</p></td> |
| </tr> |
| <tr> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Fine Grained buffer SVM</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Bytes within OpenCL Memory objects (buffer)</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">clSVMAlloc</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Synchronization points plus atomics (if supported)</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">No</p></td> |
| </tr> |
| <tr> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Fine-Grained system SVM</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Bytes within Host memory (system)</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Host memory allocation mechanisms (e.g. malloc)</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">Synchronization points plus atomics (if |
| supported)</p></td> |
| <td class="tableblock halign-center valign-top" ><p class="tableblock">No</p></td> |
| </tr> |
| </tbody> |
| </table> |
| <div class="paragraph"><p>Coarse-Grained buffer SVM is required in the core OpenCL specification. |
| The two finer grained approaches are optional features in OpenCL. The |
| various SVM mechanisms to access host memory from the work-items |
| associated with a kernel instance are summarized in table 3-2.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_memory_model_memory_consistency_model">3.3.4. Memory Model: Memory Consistency Model</h4> |
| <div class="paragraph"><p>The OpenCL memory model tells programmers what they can expect from an |
| OpenCL implementation; which memory operations are guaranteed to happen |
| in which order and which memory values each read operation will return. |
| The memory model tells compiler writers which restrictions they must |
| follow when implementing compiler optimizations; which variables they |
| can cache in registers and when they can move reads or writes around a |
| barrier or atomic operation. The memory model also tells hardware |
| designers about limitations on hardware optimizations; for example, when |
| they must flush or invalidate hardware caches. |
| <br> |
| <br> |
| The memory consistency model in OpenCL is based on the memory model from |
| the ISO C11 programming language. To help make the presentation more |
| precise and self-contained, we include modified paragraphs taken |
| verbatim from the ISO C11 international standard. When a paragraph is |
| taken or modified from the C11 standard, it is identified as such along |
| with its original location in the C11 standard. |
| <br> |
| <br> |
| For programmers, the most intuitive model is the <em>sequential |
| consistency</em> memory model. Sequential consistency interleaves the steps |
| executed by each of the units of execution. Each access to a memory |
| location sees the last assignment to that location in that |
| interleaving. While sequential consistency is relatively |
| straightforward for a programmer to reason about, implementing |
| sequential consistency is expensive. Therefore, OpenCL implements a |
| relaxed memory consistency model; i.e. it is possible to write programs |
| where the loads from memory violate sequential consistency. Fortunately, |
| if a program does not contain any races and if the program only uses |
| atomic operations that utilize the sequentially consistent memory order |
| (the default memory ordering for OpenCL), OpenCL programs appear to |
| execute with sequential consistency. |
| <br> |
| <br> |
| Programmers can to some degree control how the memory model is relaxed by choosing the memory order for synchronization operations. The precise semantics of synchronization and the memory orders are formally defined in section 3.3.6. Here, we give a high level description of how these memory orders apply to atomic operations on atomic objects shared between units of execution. OpenCL memory_order choices are based on those from the ANSI C11 standard memory model. They are specified in certain OpenCL functions through the following enumeration constants:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>memory_order_relaxed</strong>: |
| implies no order constraints. This memory order can be used safely to |
| increment counters that are concurrently incremented, but it doesnt |
| guarantee anything about the ordering with respect to operations to |
| other memory locations. It can also be used, for example, to do ticket |
| allocation and by expert programmers implementing lock-free algorithms. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>memory_order_acquire</strong>: A |
| synchronization operation (fence or atomic) that has acquire semantics |
| "acquires" side-effects from a release operation that synchronises with |
| it: if an acquire synchronises with a release, the acquiring unit of |
| execution will see all side-effects preceding that release (and possibly |
| subsequent side-effects.) As part of carefully-designed protocols, |
| programmers can use an "acquire" to safely observe the work of another |
| unit of execution. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>memory_order_release</strong>: A |
| synchronization operation (fence or atomic operation) that has release |
| semantics "releases" side effects to an acquire operation that |
| synchronises with it. All side effects that precede the release are |
| included in the release. As part of carefully-designed protocols, |
| programmers can use a "release" to make changes made in one unit of |
| execution visible to other units of execution. |
| </p> |
| </li> |
| </ul></div> |
| <div class="admonitionblock"> |
| <table><tr> |
| <td class="icon"> |
| <div class="title">Note</div> |
| </td> |
| <td class="content">In general, no acquire must <em>always</em> synchronise with any |
| particular release. However, synchronisation can be forced by certain |
| executions. See 3.3.6.2 for detailed rules for when synchronisation |
| must occur.</td> |
| </tr></table> |
| </div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| <strong>memory_order_acq_rel</strong>: A |
| synchronization operation with acquire-release semantics has the |
| properties of both the acquire and release memory orders. It is |
| typically used to order read-modify-write operations. |
| </p> |
| </li> |
| <li> |
| <p> |
| <strong>memory_order_seq_cst</strong>: |
| The loads and stores of each unit of execution appear to execute in |
| program (i.e., sequenced-before) order, and the loads and stores from |
| different units of execution appear to be simply interleaved. |
| <br> |
| <br> |
| Regardless of which memory_order is specified, resolving constraints on |
| memory operations across a heterogeneous platform adds considerable |
|