-
Notifications
You must be signed in to change notification settings - Fork 102
/
index.html
683 lines (668 loc) · 25.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<link rel="stylesheet" href="../common-revealjs/css/reveal.css">
<link rel="stylesheet" href="../common-revealjs/css/theme/white.css">
<link rel="stylesheet" href="../common-revealjs/css/custom.css">
<script>
// This is needed when printing the slides to pdf
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../common-revealjs/css/print/pdf.css' : '../common-revealjs/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<script>
// This is used to display the static images on each slide,
// See global-images in this html file and custom.css
(function() {
if(window.addEventListener) {
window.addEventListener('load', () => {
let slides = document.getElementsByClassName("slide-background");
if (slides.length === 0) {
slides = document.getElementsByClassName("pdf-page")
}
// Insert global images on each slide
for(let i = 0, max = slides.length; i < max; i++) {
let cln = document.getElementById("global-images").cloneNode(true);
cln.removeAttribute("id");
slides[i].appendChild(cln);
}
// Remove top level global images
let elem = document.getElementById("global-images");
elem.parentElement.removeChild(elem);
}, false);
}
})();
</script>
</head>
<body>
<div class="reveal">
<div class="slides">
<div id="global-images" class="global-images">
<img src="../common-revealjs/images/sycl_academy.png" />
<img src="../common-revealjs/images/sycl_logo.png" />
<img src="../common-revealjs/images/trademarks.png" />
<img src="../common-revealjs/images/codeplay.png" />
</div>
<!--Slide 1-->
<section class="hbox">
<div class="hbox" data-markdown>
## A Quick Introduction to SYCL
</div>
</section>
<!--Slide 2-->
<section class="hbox">
<div class="container" data-markdown>
## Learning Objectives
</div>
<div class="container" data-markdown>
* Quick SYCL introduction
* Writing a one-page application
</div>
</section>
<!--Slide 3-->
<section>
<div class="hbox" data-markdown>
#### What is SYCL?
</div>
<div class="container" data-markdown>
![SYCL](../common-revealjs/images/sycl_and_cpp.png "SYCL")
</div>
<div class="hbox" data-markdown>
SYCL is a single source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms
</div>
</section>
<!--Slide 4-->
<section>
<div class="hbox" data-markdown>
#### What is SYCL?
</div>
<div class="container" data-markdown>
A first example of SYCL code.
</div>
<object class="r-stretch" data="../common-revealjs/images/sycl_first_sample_code.svg"></object>
</section>
<!--Slide 5-->
<section>
<div class="hbox" data-markdown>
#### SYCL Key concepts
</div>
<div class="container" data-markdown>
* SYCL is a C++-based programming model:
* Device code and host code exist in the same file
* Device code can use templates and other C++ features
* Designed with "modern" C++ in mind
* SYCL provides high-level abstractions over common boilerplate code
* Platform/device selection
* Buffer creation and data movement
* Kernel function compilation
* Dependency management and scheduling
</div>
</section>
<!--Slide 6-->
<section>
<div class="hbox" data-markdown>
#### Image convolution sample application
</div>
<div class="container" data-markdown>
* The algorithm is **embarrassingly parallel**.
* Each work-item in the computation can be calculated entirely independently.
* The algorithm is computation heavy.
* A large number of operations are performed for each work-item in the computation, particularly when using large filters.
* The algorithm requires a large bandwidth.
* A lot of data must be passed through the GPU to process an image, particularly if the image is very high resolution.
</div>
</section>
<!--Slide 7-->
<section>
<div class="hbox" data-markdown>
#### Image convolution definition
</div>
<div class="container">
<div class="col" data-markdown>
![SYCL](../common-revealjs/images/image_convolution_definition.png "SYCL")
</div>
<div class="col" data-markdown>
* A filter of a given size is applied as a stencil to the position of each pixel in the input image.
* Each pixel covered by the filter is then multiples with the corresponding element in the filter.
* The result of these multiplications is then summed to give the resulting output pixel.
* Here we have a 3x3 gaussian blur approximation as an example.
</div>
</div>
</section>
<!--Slide 8-->
<section>
<div class="hbox" data-markdown>
#### SYCL System Topology
</div>
<div class="container" data-markdown>
* A SYCL application can execute work across a range of different heterogeneous devices.
* The devices that are available in any given system are determined at runtime through topology discovery.
</div>
</section>
<!--Slide 9-->
<section>
<div class="hbox" data-markdown>
#### Platforms and devices
</div>
<div class="container" data-markdown>
* The SYCL runtime will discover a set of platforms that are available in the system.
* Each platform represents a backend implementation such as Intel OpenCL or Nvidia CUDA.
* The SYCL runtime will also discover all the devices available for each of those platforms.
* CPU, GPU, FPGA, and other kinds of accelerators.
</div>
<div class="col" data-markdown>
![SYCL](../common-revealjs/images/devices-1.png "SYCL-Devices")
</div>
</section>
<!--Slide 10-->
<section>
<div class="hbox" data-markdown>
#### Querying with a device selector
</div>
<div class="container">
<div class="col-left-1" data-markdown>
![Device Topology](../common-revealjs/images/topology-1.png "Device-Topology")
</div>
<div class="col-right-1" data-markdown>
* To simplify the process of traversing the system topology SYCL provides device selectors.
* A device selector is is a callable C++ object which defines a heuristic for scoring devices.
* SYCL provides a number of standard device selectors, e.g. `default_selector_v`, `gpu_selector_v`, etc.
* Users can also create their own device selectors.
</div>
</div>
</section>
<!--Slide 11-->
<section>
<div class="hbox" data-markdown>
#### Querying with a device selector
</div>
<div class="container">
<div class="col-left-1">
<code><pre>
auto gpuDevice = device(gpu_selector_v);
</code></pre>
</div>
<div class="col-right-1" data-markdown>
![Device Topology](../common-revealjs/images/topology-1.png "Device-Topology")
</div>
</div>
<div class="container" data-markdown>
* A device selector takes a parameter of type `const device &` and gives it a "score".
* Used to query all devices and return the one with the highest "score".
* A device with a negative score will never be chosen.
</div>
</section>
<!--Slide 12-->
<section>
<div class="hbox" data-markdown>
#### SYCL Queues
</div>
<div class="container">
<div class="col" data-markdown>
![SYCL](../common-revealjs/images/out_of_order_execution.png "SYCL")
</div>
<div class="col" data-markdown>
* Commands are submitted to devices in SYCL by means of a Queue
* SYCL `queue`s are by default out-of-order.
* This means commands are allowed to be overlapped, re-ordered, and executed concurrently, providing dependencies are honoured to ensure consistency.
</div>
</div>
</section>
<!--Slide 13-->
<section>
<div class="hbox" data-markdown>
#### In-of-order execution
</div>
<div class="container">
<div class="col" data-markdown>
![SYCL](../common-revealjs/images/in_order_execution.png "SYCL")
</div>
<div class="col" data-markdown>
* SYCL `queue`s can be configured to be in-order.
* This mean commands must execute strictly in the order they were enqueued.
</div>
</div>
</section>
<!--Slide 14-->
<section>
<div class="hbox" data-markdown>
#### Memory Models
</div>
<div class="container" data-markdown>
* In SYCL there are two models for managing data:
* The buffer/accessor model.
* The USM (unified shared memory) model.
* Which model you choose can have an effect on how you enqueue kernel functions.
</div>
</section>
<!--Slide 15-->
<section>
<div class="hbox" data-markdown>
#### CPU and GPU Memory
</div>
<div class="container" data-markdown>
* A GPU has its own memory, separate to CPU memory.
* In order for the GPU to use memory from the CPU, the following actions must take place (either explicitly or implicitly):
* Memory allocation on the GPU.
* Data migration from the CPU to the allocation on the GPU.
* Some computation on the GPU.
* Migration of the result back to the CPU.
</div>
<img src="../common-revealjs/images/gpu_cpu_memory.png" height="400">
</section>
<!--Slide 16-->
<section>
<div class="hbox" data-markdown>
#### SYCL Buffers & Accessors
</div>
<div class="container" data-markdown>
* The buffer/accessor model separates the storage and access of data
* A SYCL buffer manages data across the host and any number of devices
* A SYCL accessor requests access to data on the host or on a device for a specific SYCL kernel function
* Accessors are also used to access data within a SYCL kernel function
* This means they are declared in the host code but captured by and then accessed within a SYCL kernel function
</div>
</section>
<!--Slide 17-->
<section>
<div class="hbox" data-markdown>
#### SYCL Buffers & Accessors
</div>
<div class="container">
<div class="col" data-markdown>
* When a buffer object is constructed it will not allocate or copy to device memory at first
* This will only happen once the SYCL runtime knows the data needs to be accessed and where it needs to be accessed
</div>
<div class="col" data-markdown>
![Buffer Host Memory](../common-revealjs/images/buffer-hostmemory.png "Buffer Host Memory")
</div>
</div>
</section>
<!--Slide 18-->
<section>
<div class="hbox" data-markdown>
#### SYCL Buffers & Accessors
</div>
<div class="container">
<div class="col" data-markdown>
* Constructing an accessor specifies a request to access the data managed by the buffer
* There are a range of different types of accessor which provide different ways to access data
</div>
<div class="col" data-markdown>
![Buffer Host Memory Accessor](../common-revealjs/images/buffer-hostmemory-accessor.png "Buffer Host Memory Accessor")
</div>
</div>
</section>
<!--Slide 19-->
<section>
<div class="hbox" data-markdown>
#### Accessing Data With Accessors
</div>
<div class="container">
<div class="col-left-3">
<code><pre>
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));
buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));
gpuQueue.submit([&](handler &cgh){
sycl::accessor inA{bufA, cgh, sycl::read_only};
sycl::accessor inB{bufB, cgh, sycl::read_only};
sycl::accessor out{bufO, cgh, sycl::write_only};
cgh.parallel_for<add>(range<1>(dA.size()),
[=](id<1> i){
<mark>out[i] = inA[i] + inB[i];</mark>
});
});
</code></pre>
</div>
<div class="col-right-2" data-markdown>
* Here we access the data of the `accessor` by
passing in the `id` passed to the SYCL kernel
function.
</div>
</div>
</section>
<!--Slide 20-->
<section>
<div class="hbox" data-markdown>
#### USM: Malloc_device
</div>
<div class="container">
<code class="code-100pc"><pre>
void* malloc_device(size_t numBytes, const queue& syclQueue, const property_list &propList = {});
template <typename T>
T* malloc_device(size_t count, const queue& syclQueue, const property_list &propList = {});
</code></pre>
</div>
<div class="container" data-markdown>
* A USM device allocation is performed by calling one of the `malloc_device` functions.
* Both of these functions allocate the specified region of memory on the `device` associated with the specified `queue`.
* The pointer returned is only accessible in a kernel function running on that `device`.
* Synchronous exception if the device does not have aspect::usm_device_allocations
* This is a blocking operation.
</div>
</section>
<!--Slide 22-->
<section>
<div class="hbox" data-markdown>
#### USM: Free
</div>
<div class="container">
<code class="code-100pc"><pre>
void free(void* ptr, queue& syclQueue);
</code></pre>
</div>
<div class="container" data-markdown>
* In order to prevent memory leaks USM device allocations must be free by calling the `free` function.
* The `queue` must be the same as was used to allocate the memory.
* This is a blocking operation.
</div>
</section>
<!--Slide 23-->
<section>
<div class="hbox" data-markdown>
#### USM: Memcpy
</div>
<div class="container">
<code class="code-100pc"><pre>
event queue::memcpy(void* dest, const void* src, size_t numBytes, const std::vector<event> &depEvents);
</code></pre>
</div>
<div class="container" data-markdown>
* Data can be copied to and from a USM device allocation by calling the `queue`'s `memcpy` member function.
* The source and destination can be either a host application pointer or a USM device allocation.
* This is an asynchronous operation enqueued to the `queue`.
* An `event` is returned which can be used to synchronize with the completion of copy operation.
* May depend on other events via `depEvents`
</div>
</section>
<!--Slide 24-->
<section>
<div class="hbox" data-markdown>
#### SYCL execution model
</div>
<div class="container">
<div class="col-left" data-markdown>
* SYCL kernel functions are executed by **work-items**
* You can think of a work-item as a thread of execution
* Each work-item will execute a SYCL kernel function from start to end
* A work-item can run on CPU threads, SIMD lanes, GPU threads, or any other kind of processing element
</div>
<div class="col-right" data-markdown>
![Work-Item](../common-revealjs/images/workitem.png "Work-Item")
</div>
</div>
</section>
<!--Slide 25-->
<section>
<div class="hbox" data-markdown>
#### SYCL execution model
</div>
<div class="container">
<div class="col" data-markdown>
* Work-items are collected together into **work-groups**
* The size of work-groups is generally relative to what is optimal on the device being targeted
* It can also be affected by the resources used by each work-item
</div>
<div class="col" data-markdown>
![Work-Group](../common-revealjs/images/workgroup.png "Work-Group")
</div>
</div>
</section>
<!--Slide 26-->
<section>
<div class="hbox" data-markdown>
#### SYCL execution model
</div>
<div class="container">
<div class="col" data-markdown>
* SYCL kernel functions are invoked within an **nd-range**
* An nd-range has a number of work-groups and subsequently a number of work-items
* Work-groups always have the same number of work-items
</div>
<div class="col" data-markdown>
![ND-Range](../common-revealjs/images/ndrange.png "ND-Range")
</div>
</div>
</section>
<!--Slide 27-->
<section>
<div class="hbox" data-markdown>
#### SYCL execution model
</div>
<div class="container">
<div class="col" data-markdown>
* The nd-range describes an **iteration space**: how it is composed in terms of work-groups and work-items
* An nd-range can be 1, 2 or 3 dimensions
* An nd-range has two components
* The **global-range** describes the total number of work-items in each dimension
* The **local-range** describes the number of work-items in a work-group in each dimension
</div>
<div class="col" data-markdown>
![ND-Range](../common-revealjs/images/ndrange-example.png "ND-Range")
</div>
</div>
</section>
<!--Slide 28-->
<section>
<div class="hbox" data-markdown>
#### Expressing parallelism
</div>
<div class="container">
<div class="col">
<code><pre>
cgh.parallel_for<kernel>(<mark>range<1>(1024)</mark>,
[=](<mark>id<1> idx</mark>){
/* kernel function code */
});
</code></pre>
<code><pre>
cgh.parallel_for<kernel>(<mark>range<1>(1024)</mark>,
[=](<mark>item<1> item</mark>){
/* kernel function code */
});
</code></pre>
<code><pre>
cgh.parallel_for<kernel>(nd_range<1>(<mark>range<1>(1024),
range<1>(32))</mark>,[=](<mark>nd_item<1> ndItem</mark>){
/* kernel function code */
});
</code></pre>
</div>
<div class="col" data-markdown>
* Overload taking a **range** object specifies the global range, runtime decides local range
* An **id** parameter represents the index within the global range
____________________________________________________________________________________________
* Overload taking a **range** object specifies the global range, runtime decides local range
* An **item** parameter represents the global range and the index within the global range
____________________________________________________________________________________________
* Overload taking an **nd_range** object specifies the global and local range
* An **nd_item** parameter represents the global and local range and index
</div>
</div>
</section>
<!--Slide 30-->
<section>
<div class="hbox" data-markdown>
#### SYCL Kernels
</div>
<div class="container" data-markdown>
* SYCL kernels (i.e. the device function the programmer wants
executed) are expressed either as C++ function objects or lambdas.
* Comparing with other GPU paradigms, kernel arguments are either
data members or lambda captures, respectively
* For function objects, the member function `operator()(sycl::id)`
is the compute function, which is equivalent to the lambda style
// then add slides explaining the practical work and how to
// glue it all together from scratch
</div>
</section>
<!--Slide 31-->
<section>
<div class="hbox" data-markdown>
#### Function object
</div>
<div class="container" data-markdown>
```c++
class MyKernel {
sycl::accessor<float> input_;
float* output_;
MyKernel(sycl::buffer<float> buf, float* output, sycl::handler& h)
: input{buf.get_access(h)}, output_{output} {}
// const is required
void operator()(sycl::item<1> i) const {
; // computation here
}
};
```
</div>
<div class="container" data-markdown>
The members are accessible on the device inside the function call
operator.
</div>
</section>
<!--Slide 32-->
<section>
<div class="hbox" data-markdown>
#### Lambda function
</div>
<div class="container" data-markdown>
```c++
sycl::buffer<float, 1> buf = /* normal init */;
float * output = sycl::malloc_device(/* params */);
... queue submit as normal ...
auto acc = buf.get_access(h);
auto func = [=](sycl::item<1> i) {
acc[i] = someVal;
output[i.get_global_linear_id()] = someOtherVal;
});
handler.parallel_for(range, func);
```
</div>
<div class="container" data-markdown>
The variables used implicitly are captured by value and are usable in
the kernel.
</div>
</section>
<!--Slide 33-->
<section>
<div class="hbox" data-markdown>
#### SYCL Kernels
</div>
<div class="container" data-markdown>
* These forms are equivalent (as in normal C++) and which one to use
depends on preference and use case
* Each instance of the kernel has a uniquely valued `sycl::item`
describing its position in the iteration space as covered in
"SYCL execution model"
* Can be used to index into accessors, pointers etc.
</div>
</section>
<!--Slide 34-->
<section>
<div class="hbox" data-markdown>
#### First Exercise
</div>
<div class="container" data-markdown>
* Use Data Parallelism and write a SYCL kernel
* The exercise README is in the "Code_Exercises/Data_Parallelism" folder
* Follow the guidelines in the README and comments in the source.cpp file
* There is a solution file but only use it if you need to
</div>
</section>
<!--Slide 35-->
<section>
<div class="hbox" data-markdown>
#### Image convolution example
</div>
<div class="container" data-markdown>
![SYCL](../common-revealjs/images/image_convolution_example.png "SYCL")
</div>
</section>
<!--Slide 36-->
<section>
<div class="hbox" data-markdown>
#### Image convolution data flow
</div>
<div class="container">
<div class="col" data-markdown>
![SYCL](../common-revealjs/images/image_convolution_data_flow.png "SYCL")
</div>
<div class="col" data-markdown>
* The kernel must read from the input image data and writes to the output image data.
* It must also read from the filter.
* The input image data and the filter don't need to be copied back to the host.
* The output image data can be uninitialized.
</div>
</div>
</section>
<!--Slide 37-->
<section>
<div class="hbox" data-markdown>
#### Gaussian Blur
</div>
<div class="container" data-markdown>
* The blur is implementable in two passes: one vertical, one horizontal
* Each item in the iteration space would load a series of pixels to
the left and right (or top and bottom) of its position in the
image
* Then, multiply with the corresponding element in the filter
* Then sum these intermediate results and store them to the output
* A dscrete convolution, in other words
</div>
</section>
<!--Slide 38-->
<section>
<div class="hbox" data-markdown>
#### Walkthrough
</div>
<div class="container" data-markdown>
* Explain a more complex example through image convolution
* Solution available in `Code_Exercises/Functors/solution.cpp`
</div>
</section>
<!--Slide 39-->
<section>
<div class="hbox" data-markdown>
#### Reference image
</div>
<div class="col" data-markdown>
![SYCL](../common-revealjs/images/dogs.png "SYCL")
</div>
<div class="container" data-markdown>
* We provide a reference image to use in the exercise.
* This is in Code_Exercises/Images
* This image is a 512x512 RGBA png.
* Feel free to use your own image but we recommend keeping to this format.
* The read and write functions can be found in `image_conv.h`.
</div>
</section>
<!--Slide 40-->
<section>
<div class="hbox" data-markdown>
#### Convolution filters
</div>
<div class="container" data-markdown>
* `auto filter = util::generate_filter(util::filter_type filterType, int width);`
* The utility for generating the filter data takes a `filter_type` enum which
can be either `identity` or `blur` and a width.
* Feel free to experiment with different variations.
* Note that the filter width should always be an odd value.
* The function can be found in `image_conv.h`
</div>
</section>
</div>
</div>
<script src="../common-revealjs/js/reveal.js"></script>
<script src="../common-revealjs/plugin/markdown/marked.js"></script>
<script src="../common-revealjs/plugin/markdown/markdown.js"></script>
<script src="../common-revealjs/plugin/notes/notes.js"></script>
<script>
Reveal.initialize({mouseWheel: true, defaultNotes: true});
Reveal.configure({ slideNumber: true });
</script>
</body>
</html>