GPU Programming in Modern C++ [2020 class archive]

GPU Programming in Modern C++ is a three-day online training course with programming exercises taught by Gordon Brown and Michael Wong. It is offered online from 11AM to 5PM Eastern Time (EDT), Monday September 21st through Wednesday September 23rd, 2020 (after the conference).

Course Description

Parallel programming can be used to take advance of heterogeneous architectures such as GPUs to significantly increase the performance of applications. It has gained a reputation for being difficult, but is it really? Modern C++ has gone a long way to making parallel programming easier and more accessible, and the introduction of the SYCL programming model means heterogeneous programming is now more accessible than ever.

This course will teach you the fundamentals of parallelism; how to recognize when to use parallelism, how to make the best choices and common parallel patterns which can be used over and again. It will teach you how to make use of modern C++ and the SYCL programming model to create parallel algorithms for heterogeneous devices such as GPUs. Finally, it will teach you how to apply common GPU optimizations.

Prerequisites

This course requires the following:

Working knowledge of C++11.
Working knowledge of Git.
Working knowledge of CMake.

We will also encourage attendees to configure the SYCL implementation and any dependencies on the computer they are attending from. Attendees will be contacted about this before the class.

Course Schedule

Day 1

Importance of Parallelism & Heterogeneity
Intro to SYCL
Enqueuing a Kernel
Managing Data
Handling Errors
Topology & Device Discovery
Configuring Queues and Contexts
Data Parallelism

Day 2

Fundamentals of Parallelism
Intro to USM
Using USM
Asynchronous Execution
Data & Dependencies
In-order Queues
Advanced Dataflow
ND Range Kernels

Day 3

GPU Optimization Principals
Image Convolution Case Study
Global Memory Coalescing
Vectorization
Local Memory
Optimizing for Occupancy & Throughput

Course Topics

The aim of this course is to provide students with an understanding of parallelism and how to develop for heterogeneous architectures such as the GPU. Students will gain an understanding of the fundamentals of parallelism and GPU architectures as well as a practical experience in writing parallel applications using modern C++ and the SYCL programming model and applying common GPU optimisations.

Course outcomes

Understanding of why parallelism is important.
1. Understand the current landscape of computer architectures and their limitations.
2. Understand the performance benefits of parallelism.
3. Understand when and where parallelism is appropriate.
Understanding of parallelism fundamentals.
1. Understand the difference between parallelism and concurrency.
2. Understand the difference between task parallelism and data parallelism.
3. Understand the balance of productivity, efficiency and portability.
Understanding of parallel patterns.
1. Understand the importance of parallel patterns.
2. Understand common parallel patterns such as map, scatter, gather and stencil.
Understanding of heterogeneous system architectures.
1. Understand the program execution and memory model of non-CPU architectures, like GPUs.
2. Understand SIMD execution and its benefits and limitations.
Understanding of asynchronous programming.
1. Understand how to execute a work asynchronously.
2. Understand how to wait for the completion of asynchronous work.
3. Understand how to execute both task and data-parallel work.
Understanding of the challenges of programming heterogeneous systems.
1. Understand the challenges of executing code on a remote device.
2. Understand how code can be offloaded to a remote co-processor.
3. Understand the effects of latency between different memory regions and important considerations for data movement.
4. Understand the importance of coalesced data access.
Understanding of the SYCL programming model.
1. Understand the SYCL ecosystem and available implementations.
2. Understand how to install and configure a SYCL implementation.
3. Understand how to discover the device topology and create a queue.
4. Understand how to enqueue kernels to a queue.
5. Understand how to manage data using buffers and accessors.
6. Understand how to use a variety of other SYCL features for achieving performance on a GPU.
Understanding of common GPU optimisations.
1. Understand techniques for coalescing global memory access.
2. Understand techniques for utilising vectorisation.
3. Understand techniques for utilising local memory.
4. Understand techniques for hiding the latency of data movement.

Register Here

Course Instructors

Gordon Brown is a principal software engineer at Codeplay Software specializing in heterogeneous programming models for C++. He has been involved in the standardization of the Khronos standard SYCL and the development of Codeplay’s implementation of the standard; ComputeCpp, from its inception. More recently he has been involved in the efforts within SG1/SG14 to standardize execution and to bring heterogeneous computing to C++, including executors, topology discovery and affinity. Gordon is also a regular speaker at CppCon and teaches the CppCon class on parallelism and GPU programming in C++.

Michael Wong is VP of R&D at Codeplay Software. He is a current Director and VP of ISOCPP , and a senior member of the C++ Standards Committee with more then 15 years of experience. He chairs the WG21 SG5 Transactional Memory and SG14 Games Development/Low Latency/Financials C++ groups and is the co-author of a number C++/OpenMP/Transactional memory features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. He has published numerous research papers and is the author of a book on C++11. He has been an invited speaker and keynote at numerous conferences. He is currently the editor of SG1 Concurrency TS and SG5 Transactional Memory TS. He is also the Chair of the SYCL standard and all Programming Languages for Standards Council of Canada. Previously, he was CEO of OpenMP involved with taking OpenMP toward Acceelerator support and the Technical Strategy Architect responsible for moving IBM’s compilers to Clang/LLVM after leading IBM’s XL C++ compiler team.