View on GitHub

CME 213 Introduction to parallel computing using MPI, openMP, and CUDA

Eric Darve, Stanford University

CME 213, Stanford University, Spring 2013
You can download all the material for this class by going to the GitHub repository. The command to clone the site is simply:

$ git clone https://github.com/EricDarve/cme213_material_2013.git

Instructors:
Eric Darve, Stanford University
Erich Elsen, Royal Caliber
Sammy El Ghazzal

Contact: Eric Darve, darve@stanford.edu

In the GitHub repository, you will find:

Lecture slides.
Homeworks. There are 6 homeworks divided as follows:
1. C++ pre-requisites
2. OpenMP
3. CUDA 1: basic string shift algorithm and pagerank algorithm
4. CUDA 2: 2D heat diffusion
5. CUDA 3: Vigenère cypher
6. MPI: 2D heat diffusion
Final Project. The final project is about writing a CUDA code to calculate connected components in images.
CUDA sample codes.
MPI sample codes.

Lecture Slides

You can find the lecture slides on GitHub.
List of topics:

Pthreads
OpenMP
CUDA
NVIDIA CUDA Lectures
MPI

Lecture 1

Topics:

introduction; syllabus; why we need parallelism; example of parallel program: summing up numbers

Shared memory and multicore processors

Introduction to Pthreads

Lecture 2

Topics:

Pthreads; creating and joining threads; example: multiplication of two matrices; Mutexes; example: dot product

Lecture 3

Topics:

condition variables; example of a pizza restaurant and delivery; example with code sample

OpenMP; introduction; parallel regions

Lecture 4

Topics:

OpenMP; parallel for loops; matrix multiplication; sections; single; tasks; master; critical; barrier; atomic; data sharing attributes; reduction clause

Lecture 5

Topics:

fast multipole method; OpenMP and Pthreads implementations

Lecture 6

Topics:

CUDA; threading model, basic commands, simple example programs, threads, blocks; timing; basic debugging techniques / printf / how nvcc works; unary function using templates

Lecture 7

Topics:

warps; coalescing and performance impact; caching; shared memory; bank conflicts; example of matrix transpose

Lecture 8

Topics:

reduce and scan algorithms; Work complexity vs. step complexity.

Students were asked to group in team and find an efficient procedure to quickly add many numbers, and calculate a scan.

There are no slides for this lecture.

Lecture 9

Topics:

CUDA; reduction algorithm; warp; thread-block; use of atomics

Lecture 10

Topics:

floating point numbers; matrix-vector products; how to optimize the memory access; study of different cases: small and large matrices; tall and fat matrices

Lecture 11

Topics:

discussion of Thrust; segmented algorithms; examples of problems that can be broken into Thrust algorithms

Lecture 12

Steve Rennich from NVIDIA. Introduction to streams; increasing the concurrency; running concurrently memory transfers and kernels

Lecture 13

Justin Luitjens from NVIDIA. OpenACC

Lecture 14

David Goodwin from NVIDIA. The CUDA nvvp profiler.

Lecture 15

Sean Baxter from NVIDIA. Merge in merge-sort algorithms; merge-like operations. Load-balancing search.

Lecture 16

Topics:

MPI; introduction to message-passing; point-to-point communication.

Lecture 17

Topics:

deadlocks; blocking vs non-blocking; synchronous vs non-synchronous; introduction to collective communication

Lecture 18

Topics:

collective communication; matrix-vector product; groups, communicators

Lecture 19

Topics:

virtual topologies; application to matrix-vector product with 2D partitioning; introduction to performance metrics; speed-up, efficiency; Amdahl’s law

Lecture 20

Topics:

performance metrics; example: dot-product; efficiency and iso-efficiency; matrix-vector product with 1D and 2D partitioning; matrix-matrix products; Cannon and DNS algorithms.