Heterogeneous Programming in Modern C++ with SYCL [2021 class archive]

Heterogeneous Programming in Modern C++ with SYCL is a three-day online training course with programming exercises taught by Aksel Alpay, Gordon Brown, James Reinders, Michael Wong, Peter Zuzek, Rod Burns, and Ronan Keryell. It is offered online from 09:00 to 15:00 Aurora time (MDT), 11:00 to 17:00 EDT, 17:00 to 23:00 CET, Monday, November 1st through Wednesday, November 3rd, 2021 (after the conference).

Course Description

Parallel programming can be used to take advantage of heterogeneous architectures such as GPUs, FPGAs, ASICs, XPUs, IPUs, TPUs or special units on CPUs, to significantly increase the performance of applications. It has gained a reputation for being difficult, but is it really? Modern C++ has gone a long way to making parallel programming easier and more accessible, and the introduction of the SYCL programming model means heterogeneous programming is now more accessible than ever.

This course will teach you the fundamentals of heterogeneous parallelism; how to recognize when to use parallelism, how to make the best choices and common parallel patterns which can be used over and again. It will teach you how to make use of modern C++ and the SYCL programming model to create parallel algorithms for heterogeneous devices. Most of the programming focus will be on GPUs, but some time will be spent applying the techniques to simple FPGA examples. The course will teach you how to apply common GPU optimizations.

The challenges and general approaches for heterogeneous programming are well covered in this tutorial. Heterogeneous programming is an incredibly important topic, adding important dimensions to parallel programming that gained widespread usage after multicore (c. 2006) and the rise of GPU compute. The future is heterogeneous programming, with numerous device types from many vendors. This course will give attendees a deep appreciation of the challenge and a solid understanding of the programming techniques available to meet the challenge.

SYCL is Modern C++. SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, FPGAs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model (this means no vendor lock-in) such as SYCL is essential for modern software developers. SYCL has the advantage of supporting a single-source style of programming from completely standard C++. Attendees will gain strong knowledge in how the designers of the SYCL standard have addressed heterogeneous programming in C++. SYCL has gained enormous widespread support in recent years from HPC for Exascale computing, desktops, embedded systems, FPGAs, machine learning, and AI for self-driving cars. Regardless of the particular constructs in the future – the material in this course will prove timeless.

There are multiple implementations of SYCL available and in this tutorial, you will see instructor representation from multiple vendors, academia, and research – from Codeplay’s ComputeCpp, Intel’s clang-based DPC++, Xilin’s TriSYCL, open source hipSYCL project. This tutorial will provide a way for developers to gain expertise in this programming model in a practical environment.

Prerequisites

This course requires the following:

Working knowledge of C++11.
Working knowledge of Git.
Working knowledge of CMake.

We will also encourage attendees to configure the SYCL implementation and any dependencies on the computer they are attending from. Attendees will be contacted about this before the class.

Course Schedule

Day 1

Importance of Parallelism & Heterogeneity
Intro to SYCL
Enqueuing a Kernel
Managing Data
Handling Errors
Topology & Device Discovery
Configuring Queues and Contexts
Data Parallelism
How to think about targeting GPUs

Day 2

Fundamentals of Parallelism
Unified Shared Memory
Asynchronous Execution
Data & Dependencies
In-order Queues
Advanced Dataflow
ND Range Kernels
How to think about targeting FPGAs

Day 3

GPU Optimization Principals
Image Convolution Case Study
Global Memory Coalescing
Vectorization
Local Memory
Performance Portability in Practice
Future of SYCL; Learning more

Course Topics

The aim of this course is to provide students with an understanding of parallelism and how to develop for heterogeneous architectures such as the GPU. Students will gain an understanding of the fundamentals of parallelism and GPU architectures as well as a practical experience in writing parallel applications using modern C++ and the SYCL programming model and applying common GPU optimizations.

Course outcomes

Understanding of why parallelism is important.
1. Understand the current landscape of computer architectures and their limitations.
2. Understand the performance benefits of parallelism.
3. Understand when and where parallelism is appropriate.
4. Understand abstract models for GPUs and FPGAs as key examples in thinking about heterogeneous targets.
Understanding of parallelism fundamentals.
1. Understand the difference between parallelism and concurrency.
2. Understand the difference between task parallelism and data parallelism.
3. Understand the balance of productivity, efficiency and portability.
Understanding of parallel patterns.
1. Understand the importance of parallel patterns.
2. Understand common parallel patterns such as map, scatter, gather and stencil.
Understanding of heterogeneous system architectures.
1. Understand the program execution and memory model of non-CPU architectures, like GPUs and FPGAs.
2. Understand SIMD execution and its benefits and limitations.
Understanding of asynchronous programming.
1. Understand how to execute a work asynchronously.
2. Understand how to wait for the completion of asynchronous work.
3. Understand how to execute both task and data-parallel work.
Understanding of the challenges of programming heterogeneous systems.
1. Understand the challenges of executing code on a remote device.
2. Understand how code can be offloaded to a remote co-processor.
3. Understand the effects of latency between different memory regions and important considerations for data movement.
4. Understand the importance of coalesced data access.
Understanding of the SYCL programming model.
1. Understand the SYCL ecosystem and available implementations.
2. Understand how to install and configure a SYCL implementation.
3. Understand how to discover the device topology and create a queue.
4. Understand how to enqueue kernels to a queue.
5. Understand how to manage data using buffers and accessors.
6. Understand how to use a variety of other SYCL features for achieving performance on a GPU.
Understanding of common GPU optimisations.
1. Understand techniques for coalescing global memory access.
2. Understand techniques for utilising vectorisation.
3. Understand techniques for utilising local memory.
4. Understand techniques for hiding the latency of data movement.

Register Here

Course Instructors

Gordon Brown is a principal software engineer at Codeplay Software specializing in heterogeneous programming models for C++. He has been involved in the standardization of the Khronos standard SYCL and the development of Codeplay’s implementation of the standard; ComputeCpp, from its inception. More recently he has been involved in the efforts within SG1/SG14 to standardize execution and to bring heterogeneous computing to C++, including executors, topology discovery and affinity. Gordon is also a regular speaker at CppCon and teaches the CppCon class on parallelism and GPU programming in C++.

Michael Wong is VP of R&D at Codeplay Software. He is a current Director and VP of ISOCPP , and a senior member of the C++ Standards Committee with more then 15 years of experience. He chairs the WG21 SG5 Transactional Memory and SG14 Games Development/Low Latency/Financials C++ groups and is the co-author of a number C++/OpenMP/Transactional memory features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. He has published numerous research papers and is the author of a book on C++11. He has been an invited speaker and keynote at numerous conferences. He is currently the editor of SG1 Concurrency TS and SG5 Transactional Memory TS. He is also the Chair of the SYCL standard and all Programming Languages for Standards Council of Canada. Previously, he was CEO of OpenMP involved with taking OpenMP toward Acceelerator support and the Technical Strategy Architect responsible for moving IBM’s compilers to Clang/LLVM after leading IBM’s XL C++ compiler team.

Aksel Alpay is a staff member at the Heidelberg University Computing Centre where his work focuses on HPC and parallel programming models. In particular, he is the lead developer of the hipSYCL SYCL implementation. He has also been a teaching assistant at Heidelberg University and supervisor for multiple courses on parallel computing. He is also a member of the SYCL working group.

Rod Burns has been helping developers to build complex software for well over a decade. Working at Codeplay Software Rod is involved in providing supporting and building educational materials for developers using our SYCL product. Most recently Rod created “SYCL Academy,” a set of materials for teaching SYCL, that have already been adopted by some of the top universities in the world. Rod has been involved in writing a range of training courses for more than a decade.

Ronan Keryell is principal software engineer at Xilinx Research Labs working on high-level programming models for FPGA and is member of the Khronos OpenCL & SYCL C++ committee. Ronan Keryell received his MSc in Electrical Engineering and PhD in Computer Science from École Normale Supérieure of Paris (France), on the design of a massively parallel RISC-based VLIW-SIMD graphics computer and its programming environment. He was assistant professor in the Computer Science department at MINES Paris Tech and later at Télécom Bretagne (France), working on automatic parallelization, compilation of PGAS languages (High-Performance Fortran), high-level synthesis and co-design, networking, and secure computing. He was co-founder of 3 start-ups, mainly in the area of High Performance Computing, and was the technical lead of the Par4All automatic parallelizer at SILKAN, targeting OpenMP, CUDA & OpenCL from sequential C & Fortran. Before joining Xilinx, he worked at AMD on programming models for GPU.

James Reinders is an engineer at Intel, with parallel computing experience spanning four decades, focused on enabling parallel programming in a heterogeneous world. James is currently focused on the DPC++ project (SYCL for LLVM), and the oneAPI initiative (delivering APIs spanning compute devices of many types from many vendors). James is an author/co-author/editor of ten technical books related to parallel programming; his latest book is about SYCL (free download: https://www.apress.com/book/9781484255735). He has had the great fortune to help make key contributions to two of the world’s fastest computers (#1 on Top500 list) as well as many other supercomputers, and software developer tools. James consistently enjoys writing, teaching, programming, and consulting in areas related to parallel computing (HPC and AI).

Peter Zuzek is a Senior Software Engineer at Codeplay where he has worked on the ComputeCpp runtime and is now the Team Lead of the SYCL-ECO team responsible for maintaining ComputeCpp and providing support for customer and open-source SYCL projects. He has also contributed to the SYCL 1.2.1 and SYCL 2020 specifications and continues to be involved in the SYCL Working Group in Khronos.