Stanford CME 213/ME 339 Spring 2021 homepage
Introduction to parallel computing using MPI, openMP, and CUDA
This is the website for CME 213 Introduction to parallel computing using MPI, openMP, and CUDA. This material was created by Eric Darve, with the help of course staff and students.
Syllabus
Policy for late assignments
Extensions can be requested in advance for exceptional circumstances (e.g., travel, sickness, injury, COVID-related issues) and for OAE-approved accommodations.
Submissions after the deadline and late by at most two days (+48 hours after the deadline) will be accepted with a 10% penalty. No submissions will be accepted two days after the deadline.
See Gradescope for all the current assignments and their due dates. Post on Slack if you cannot access the Gradescope class page. The 6-letter code to join the class is given on Canvas.
Datasheet on the Quadro RTX 6000
Final Project
Final project instructions and starter code:
Slides and videos explaining the final project:
- Overview of the final project; Slides
- 33 Final Project 1, Overview; Video
- 34 Final Project 2, Regularization; Video
- 35 Final Project 3, CUDA GEMM and MPI; Video
See also the Module 8 videos on MPI.
Class modules and learning material
Introduction to the class
CME 213 First Live Lecture; Video, Slides
C++ tutorial
Module 1 Introduction to Parallel Computing
- Slides
- 01 Homework 1; Video
- 02 Why Parallel Computing; Video
- 03 Top 500; Video
- 04 Example of Parallel Computation; Video
- 05 Shared memory processor; Video
- Reading assignment 1
- Homework 1; starter code
Module 2 Shared Memory Parallel Programming
- C++ threads; Slides; Code
- Introduction to OpenMP; Slides; Code
- 06 C++ threads; Video
- 07 Promise and future; Video
- 08 mutex; Video
- 09 Introduction to OpenMP; Video
- 10 OpenMP Hello World; Video
- 11 OpenMP for loop; Video
- 12 OpenMP clause; Video
- Reading assignment 2
Module 3 Shared Memory Parallel Programming, OpenMP, advanced OpenMP
- OpenMP, for loops, advanced OpenMP; Slides; Code
- OpenMP, sorting algorithms; Slides; Code
- 13 OpenMP tasks; Video
- 14 OpenMP depend; Video
- 15 OpenMP synchronization; Video
- 16 Sorting algorithms Quicksort Mergesort; Video
- 17 Sorting Algorithms Bitonic Sort; Video
- 18 Bitonic Sort Exercise; Video
- Reading assignment 3
- Homework 2; starter code; radix sort tutorial
Module 4 Introduction to CUDA programming
- Introduction to GPU computing; Slides
- Introduction to CUDA and
nvcc
; Slides; Code - 19 GPU computing introduction; Video
- 20 Graphics Processing Units; Video
- 21 Introduction to GPU programming; Video
- 22 icme-gpu; Video
- 23 a First CUDA program; Video
- 23 b First CUDA program part 2; Video
- 24 nvcc CUDA compiler; Video
- Reading assignment 4
- Homework 3; starter code
Module 5 Code performance on NVIDIA GPUs
- GPU memory and matrix transpose; Slides; Code
- CUDA occupancy, branching, homework 4; Slides
- 25 GPU memory; Video
- 26 Matrix transpose; Video
- 27 Latency, concurrency, and occupancy; Video
- 28 CUDA branching; Video
- 29 Homework 4; Video
- Reading assignment 5
- Homework 4; starter code
Module 6 NVIDIA guest lectures, openACC, CUDA optimization
- 30 NVIDIA guest lecture, openACC; Video; Slides
- 31 NVIDIA guest lecture, CUDA optimization; Video; Slides
- Reading assignment 6
Module 7 NVIDIA guest lectures, CUDA profiling
- 32 NVIDIA guest lecture, CUDA profiling; Video; Slides
- Reading assignment 7
Module 8 Group activity and introduction to MPI
The slides and videos below are needed for the final project.
- Introduction to MPI; Slides; Code
- 37 MPI Introduction; Video
- 38 MPI Hello World; Video
- 39 MPI Send Recv; Video
- 40 MPI Collective Communications; Video
Material for the May 17 group activity:
- generate_sequence.cpp
- 36 Instructions for Monday, May 17 group activity; Video; Slides
Module 9 Advanced MPI
- MPI Advanced Send and Recv; Slides; Code
- 41 MPI Process Mapping; Video
- 42 MPI Buffering; Video
- 43 MPI Send Recv Deadlocks; Video
- 44 MPI Non-blocking; Video
- 45 MPI Send Modes; Video
- Parallel efficiency and MPI communicators; Slides; Code
- 46 MPI Matrix-vector product 1D schemes; Video
- 47 MPI Matrix vector product 2D scheme; Video
- 48 Parallel Speed-up; Video
- 49 Isoefficiency; Video
- 50 MPI Communicators; Video
- Reading assignment 8
Module 10 SLAC guest lecture, Task-based parallel programming
Reading and links
Lawrence Livermore National Lab Resources
- LLNL Tutorial and Training Materials
- LLNL Introduction to Parallel Computing tutorial
- LLNL POSIX threads programming
- LLNL openMP tutorial
- LLNL MPI tutorial
- LLNL Advanced MPI slides
C++ threads
OpenMP
- OpenMP LLNL guide
- OpenMP guide by Yliluoma
- OpenMP 5.0 Reference Guide
- OpenMP API Specification
- Tutorials
CUDA
- CUDA Programming Guides and References
- CUDA C++ Programming Guide
- CUDA C++ Best Practices Guide
- CUDA occupancy calculator
- CUDA compiler driver NVCC
- OpenACC
- OpenACC Programming and Best Practices Guide
- OpenACC 2.7 API Reference Card
- Compilers that support OpenACC
- OpenACC Specification (Version 3.0)
MPI
Open MPI hwloc documentation
Task-based parallel languages and APIs
- Legion and Regent
- StarPU
- Charm++
- PaRSEC
- Chapel
- X10
- TaskTorrent and documentation