Stanford University
View on GitHub

CME 213 Introduction to parallel computing using MPI, openMP, and CUDA

Eric Darve, Stanford University

CME 213, Stanford University, Spring 2013
You can download all the material for this class by going to the GitHub repository. The command to clone the site is simply:
$ git clone https://github.com/EricDarve/cme213_material_2013.git

Instructors:
Eric Darve, Stanford University
Erich Elsen, Royal Caliber
Sammy El Ghazzal

Contact: Eric Darve, darve@stanford.edu

In the GitHub repository, you will find: Recommended reading for this class:

Lecture Slides

You can find the lecture slides on GitHub.
List of topics:

Topics:
introduction; syllabus; why we need parallelism; example of parallel program: summing up numbers

Shared memory and multicore processors

Introduction to Pthreads

Lecture 2

Topics:
Pthreads; creating and joining threads; example: multiplication of two matrices; Mutexes; example: dot product

Lecture 3

Topics:
condition variables; example of a pizza restaurant and delivery; example with code sample

OpenMP; introduction; parallel regions

Topics:
OpenMP; parallel for loops; matrix multiplication; sections; single; tasks; master; critical; barrier; atomic; data sharing attributes; reduction clause

Lecture 5

Topics:
fast multipole method; OpenMP and Pthreads implementations

Topics:
CUDA; threading model, basic commands, simple example programs, threads, blocks; timing; basic debugging techniques / printf / how nvcc works; unary function using templates

Lecture 7

Topics:
warps; coalescing and performance impact; caching; shared memory; bank conflicts; example of matrix transpose

Lecture 8

Topics:
reduce and scan algorithms; Work complexity vs. step complexity.

Students were asked to group in team and find an efficient procedure to quickly add many numbers, and calculate a scan.

There are no slides for this lecture.

Lecture 9

Topics:
CUDA; reduction algorithm; warp; thread-block; use of atomics

Lecture 10

Topics:
floating point numbers; matrix-vector products; how to optimize the memory access; study of different cases: small and large matrices; tall and fat matrices

Lecture 11

Topics:
discussion of Thrust; segmented algorithms; examples of problems that can be broken into Thrust algorithms

Steve Rennich from NVIDIA. Introduction to streams; increasing the concurrency; running concurrently memory transfers and kernels

Justin Luitjens from NVIDIA. OpenACC

David Goodwin from NVIDIA. The CUDA nvvp profiler.

Sean Baxter from NVIDIA. Merge in merge-sort algorithms; merge-like operations. Load-balancing search.

Topics:
MPI; introduction to message-passing; point-to-point communication.

Lecture 17

Topics:
deadlocks; blocking vs non-blocking; synchronous vs non-synchronous; introduction to collective communication

Lecture 18

Topics:
collective communication; matrix-vector product; groups, communicators

Lecture 19

Topics:
virtual topologies; application to matrix-vector product with 2D partitioning; introduction to performance metrics; speed-up, efficiency; Amdahl’s law

Lecture 20

Topics:
performance metrics; example: dot-product; efficiency and iso-efficiency; matrix-vector product with 1D and 2D partitioning; matrix-matrix products; Cannon and DNS algorithms.