Reading Assignment 7

NVIDIA Guest Lecture, CUDA profiling

Write your answers in a PDF and upload the document on Gradescope for submission. The due date is given on Gradescope. Each question is worth 10 points.

32 NVIDIA guest lecture, CUDA profiling; Video; Slides

NVIDIA Developer Blog on the high-performance multigrid CUDA code (HPGMG)

Explain the difference between kernels that are compute bound, bandwidth bound, and latency bound
See the right figure on Slide 25; what is a long scoreboard stall? See the NVIDIA document, section 4.1 for some definitions; scoreboarding is a method to track dependencies between instructions and is used to determine when a warp is ready to run.
Assume that we have a pipeline with depth 10 cycles; assume that we have 5 warps that can issue instructions in parallel; on average, how many instructions is the pipeline able to issue per cycle? See slide 27.
Consider the following codes:

Version 1:

float a = 0.0f;
for( int i = 0 ; i < N ; ++i )
 a += logf(b[i]);

Version 2:

float a, a0 = 0.0f, a1 = 0.0f;
for( int i = 0 ; i < N ; i += 2 )
{
 a0 += logf(b[i]);
 a1 += logf(b[i+1]);
}
a += logf(c) a = a0 + a1

Explain why version 2 is expected to run faster on a GPU.

Explain the difference between Nsight Systems and Nsight Compute.