You can download all the material for this class by going to the GitHub repository. The command to clone the site is simply:
$ git clone https://github.com/EricDarve/cme213_material_2013.git
Instructors:
Eric Darve, Stanford University
Erich Elsen, Royal Caliber
Sammy El Ghazzal
Contact: Eric Darve, darve@stanford.edu
In the GitHub repository, you will find:
- Lecture slides.
- Homeworks. There are 6 homeworks divided as follows:
- Final Project. The final project is about writing a CUDA code to calculate connected components in images.
- CUDA sample codes.
- MPI sample codes.
- Parallel Programming for Multicore and Cluster Systems, Rauber and Rünger.
- Introduction to Parallel Computing, Grama, Gupta, Karypis, Kumar.
- Introduction to Parallel Programming, Pacheco.
- Using OpenMP: Portable Shared Memory Parallel Programming, Chapman, Jost, van der Pas.
- Parallel Programming in OpenMP, Chandra, Menon, Dagum, Kohr, Maydan, McDonald
- The Art of Multiprocessor Programming, Herlihy, Shavit.
- CUDA by Example: An Introduction to General-Purpose GPU Programming, Sanders, Kandrot
- CUDA Handbook: A Comprehensive Guide to GPU Programming, Wilt
Lecture Slides
You can find the lecture slides on GitHub.
List of topics:
Topics:
introduction; syllabus; why we need parallelism; example of parallel program: summing up numbers
Shared memory and multicore processors
Introduction to Pthreads
Lecture 2
Topics:
Pthreads; creating and joining threads; example: multiplication of two matrices; Mutexes; example: dot product
Lecture 3
Topics:
condition variables; example of a pizza restaurant and delivery; example with code sample
OpenMP; introduction; parallel regions
Topics:
OpenMP; parallel for loops; matrix multiplication; sections; single; tasks; master; critical; barrier; atomic; data sharing attributes; reduction clause
Lecture 5
Topics:
fast multipole method; OpenMP and Pthreads implementations
Topics:
CUDA; threading model, basic commands, simple example programs, threads, blocks; timing; basic debugging techniques / printf / how nvcc works; unary function using templates
Lecture 7
Topics:
warps; coalescing and performance impact; caching; shared memory; bank conflicts; example of matrix transpose
Lecture 8
Topics:
reduce and scan algorithms; Work complexity vs. step complexity.
Students were asked to group in team and find an efficient procedure to quickly add many numbers, and calculate a scan.
There are no slides for this lecture.
Lecture 9
Topics:
CUDA; reduction algorithm; warp; thread-block; use of atomics
Lecture 10
Topics:
floating point numbers; matrix-vector products; how to optimize the memory access; study of different cases: small and large matrices; tall and fat matrices
Lecture 11
Topics:
discussion of Thrust; segmented algorithms; examples of problems that can be broken into Thrust algorithms
Steve Rennich from NVIDIA. Introduction to streams; increasing the concurrency; running concurrently memory transfers and kernels
Lecture 13
Justin Luitjens from NVIDIA. OpenACC
Lecture 14
David Goodwin from NVIDIA. The CUDA nvvp profiler.
Lecture 15
Sean Baxter from NVIDIA. Merge in merge-sort algorithms; merge-like operations. Load-balancing search.
Topics: MPI; introduction to message-passing; point-to-point communication.
Lecture 17
Topics: deadlocks; blocking vs non-blocking; synchronous vs non-synchronous; introduction to collective communication
Lecture 18
Topics: collective communication; matrix-vector product; groups, communicators
Lecture 19
Topics: virtual topologies; application to matrix-vector product with 2D partitioning; introduction to performance metrics; speed-up, efficiency; Amdahl’s law
Lecture 20
Topics: performance metrics; example: dot-product; efficiency and iso-efficiency; matrix-vector product with 1D and 2D partitioning; matrix-matrix products; Cannon and DNS algorithms.
Justin Luitjens from NVIDIA. OpenACC
Lecture 14
David Goodwin from NVIDIA. The CUDA nvvp profiler.
Lecture 15
Sean Baxter from NVIDIA. Merge in merge-sort algorithms; merge-like operations. Load-balancing search.
Topics: MPI; introduction to message-passing; point-to-point communication.
Lecture 17
Topics: deadlocks; blocking vs non-blocking; synchronous vs non-synchronous; introduction to collective communication
Lecture 18
Topics: collective communication; matrix-vector product; groups, communicators
Lecture 19
Topics: virtual topologies; application to matrix-vector product with 2D partitioning; introduction to performance metrics; speed-up, efficiency; Amdahl’s law
Lecture 20
Topics: performance metrics; example: dot-product; efficiency and iso-efficiency; matrix-vector product with 1D and 2D partitioning; matrix-matrix products; Cannon and DNS algorithms.
David Goodwin from NVIDIA. The CUDA nvvp profiler.
Lecture 15
Sean Baxter from NVIDIA. Merge in merge-sort algorithms; merge-like operations. Load-balancing search.
Topics: MPI; introduction to message-passing; point-to-point communication.
Lecture 17
Topics: deadlocks; blocking vs non-blocking; synchronous vs non-synchronous; introduction to collective communication
Lecture 18
Topics: collective communication; matrix-vector product; groups, communicators
Lecture 19
Topics: virtual topologies; application to matrix-vector product with 2D partitioning; introduction to performance metrics; speed-up, efficiency; Amdahl’s law
Lecture 20
Topics: performance metrics; example: dot-product; efficiency and iso-efficiency; matrix-vector product with 1D and 2D partitioning; matrix-matrix products; Cannon and DNS algorithms.
Sean Baxter from NVIDIA. Merge in merge-sort algorithms; merge-like operations. Load-balancing search.
Topics:
MPI; introduction to message-passing; point-to-point communication.
Lecture 17
Topics:
deadlocks; blocking vs non-blocking; synchronous vs non-synchronous; introduction to collective communication
Lecture 18
Topics:
collective communication; matrix-vector product; groups, communicators
Lecture 19
Topics:
virtual topologies; application to matrix-vector product with 2D partitioning; introduction to performance metrics; speed-up, efficiency; Amdahl’s law
Lecture 20
Topics:
performance metrics; example: dot-product; efficiency and iso-efficiency; matrix-vector product with 1D and 2D partitioning; matrix-matrix products; Cannon and DNS algorithms.