GPU Programming in Modern C++ is a three-day online training course with programming exercises taught by Gordon Brown and Michael Wong. It is offered online from 11AM to 5PM Eastern Time (EDT), Monday September 21st through Wednesday September 23rd, 2020 (after the conference).
Course Description
Parallel programming can be used to take advance of heterogeneous architectures such as GPUs to significantly increase the performance of applications. It has gained a reputation for being difficult, but is it really? Modern C++ has gone a long way to making parallel programming easier and more accessible, and the introduction of the SYCL programming model means heterogeneous programming is now more accessible than ever.
This course will teach you the fundamentals of parallelism; how to recognize when to use parallelism, how to make the best choices and common parallel patterns which can be used over and again. It will teach you how to make use of modern C++ and the SYCL programming model to create parallel algorithms for heterogeneous devices such as GPUs. Finally, it will teach you how to apply common GPU optimizations.
Prerequisites
This course requires the following:
- Working knowledge of C++11.
- Working knowledge of Git.
- Working knowledge of CMake.
We will also encourage attendees to configure the SYCL implementation and any dependencies on the computer they are attending from. Attendees will be contacted about this before the class.
Course Schedule
Day 1
- Importance of Parallelism & Heterogeneity
- Intro to SYCL
- Enqueuing a Kernel
- Managing Data
- Handling Errors
- Topology & Device Discovery
- Configuring Queues and Contexts
- Data Parallelism
Day 2
- Fundamentals of Parallelism
- Intro to USM
- Using USM
- Asynchronous Execution
- Data & Dependencies
- In-order Queues
- Advanced Dataflow
- ND Range Kernels
Day 3
- GPU Optimization Principals
- Image Convolution Case Study
- Global Memory Coalescing
- Vectorization
- Local Memory
- Optimizing for Occupancy & Throughput
Course Topics
The aim of this course is to provide students with an understanding of parallelism and how to develop for heterogeneous architectures such as the GPU. Students will gain an understanding of the fundamentals of parallelism and GPU architectures as well as a practical experience in writing parallel applications using modern C++ and the SYCL programming model and applying common GPU optimisations.
Course outcomes
- Understanding of why parallelism is important.
- Understand the current landscape of computer architectures and their limitations.
- Understand the performance benefits of parallelism.
- Understand when and where parallelism is appropriate.
- Understanding of parallelism fundamentals.
- Understand the difference between parallelism and concurrency.
- Understand the difference between task parallelism and data parallelism.
- Understand the balance of productivity, efficiency and portability.
- Understanding of parallel patterns.
- Understand the importance of parallel patterns.
- Understand common parallel patterns such as map, scatter, gather and stencil.
- Understanding of heterogeneous system architectures.
- Understand the program execution and memory model of non-CPU architectures, like GPUs.
- Understand SIMD execution and its benefits and limitations.
- Understanding of asynchronous programming.
- Understand how to execute a work asynchronously.
- Understand how to wait for the completion of asynchronous work.
- Understand how to execute both task and data-parallel work.
- Understanding of the challenges of programming heterogeneous systems.
- Understand the challenges of executing code on a remote device.
- Understand how code can be offloaded to a remote co-processor.
- Understand the effects of latency between different memory regions and important considerations for data movement.
- Understand the importance of coalesced data access.
- Understanding of the SYCL programming model.
- Understand the SYCL ecosystem and available implementations.
- Understand how to install and configure a SYCL implementation.
- Understand how to discover the device topology and create a queue.
- Understand how to enqueue kernels to a queue.
- Understand how to manage data using buffers and accessors.
- Understand how to use a variety of other SYCL features for achieving performance on a GPU.
- Understanding of common GPU optimisations.
- Understand techniques for coalescing global memory access.
- Understand techniques for utilising vectorisation.
- Understand techniques for utilising local memory.
- Understand techniques for hiding the latency of data movement.