Reading Assignment 7
NVIDIA Guest Lecture, CUDA profiling
Write your answers in a PDF and upload the document on Gradescope for submission. The due date is given on Gradescope. Each question is worth 10 points.
32 NVIDIA guest lecture, CUDA profiling; Video; Slides
NVIDIA Developer Blog on the high-performance multigrid CUDA code (HPGMG)
- Explain the difference between kernels that are compute bound, bandwidth bound, and latency bound
- See the right figure on Slide 25; what is a long scoreboard stall? See the NVIDIA document, section 4.1 for some definitions; scoreboarding is a method to track dependencies between instructions and is used to determine when a warp is ready to run.
- Assume that we have a pipeline with depth 10 cycles; assume that we have 5 warps that can issue instructions in parallel; on average, how many instructions is the pipeline able to issue per cycle? See slide 27.
- Consider the following codes:
Version 1:
float a = 0.0f;
for( int i = 0 ; i < N ; ++i )
a += logf(b[i]);
Version 2:
float a, a0 = 0.0f, a1 = 0.0f;
for( int i = 0 ; i < N ; i += 2 )
{
a0 += logf(b[i]);
a1 += logf(b[i+1]);
}
a += logf(c) a = a0 + a1
Explain why version 2 is expected to run faster on a GPU.
- Explain the difference between Nsight Systems and Nsight Compute.