Top CUDA Programming frequently asked interview questions.
Q: How many different kinds of memories are in a GPU?
A : CUDA-enabled GPUs have several types of memory, each with different characteristics in terms of size, speed, and scope:- Global Memory - Largest memory space accessible by all threads. High latency (400-800 cycles). Persists for the lifetime of the application. Typically several GB in size.
- Shared Memory - On-chip memory shared among threads within a block. Much faster than global memory (low latency). Limited size (typically 48-96 KB per SM).
- Registers - Fastest memory, private to each thread. Very limited in quantity (typically 64KB per SM, divided among all threads).
- Local Memory - Private to each thread but physically resides in global memory. Used for register spilling and large structures. Same latency as global memory.
- Constant Memory - Read-only memory cached on-chip. Total size of 64KB. Optimized for broadcast access patterns where all threads read the same address.
- Texture Memory - Read-only memory with special caching optimized for 2D spatial locality. Useful for image processing.
- L1/L2 Cache - Hardware-managed caches for global and local memory accesses. L2 is shared across all SMs.
Q: What does coalesced/uncoalesced memory access mean?
A : Memory coalescing is a technique where the GPU combines multiple memory accesses from threads in a warp into fewer transactions to improve memory bandwidth utilization.Coalesced Access:
When consecutive threads in a warp access consecutive memory addresses, the memory controller can combine these accesses into a single transaction. This is the most efficient access pattern.
Example of coalesced access:
__global__ void coalescedAccess(float *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Each thread accesses consecutive addresses
float value = data[idx]; // COALESCED
}
Uncoalesced Access:When threads in a warp access scattered or misaligned memory addresses, each access may require a separate transaction, significantly reducing memory bandwidth efficiency.
Example of uncoalesced access:
__global__ void uncoalescedAccess(float *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Strided access with large stride
float value = data[idx * 32]; // UNCOALESCED
}
Performance Impact:Uncoalesced memory accesses can reduce memory bandwidth utilization by 10x or more. Modern GPUs are more tolerant of some access patterns, but coalescing remains critical for performance.
Best Practices:
- Align data structures to 128-byte boundaries
- Access memory in sequential patterns within a warp
- Avoid scattered or strided access patterns when possible
- Use shared memory to reorganize data for coalesced global memory access
Q: Can you implement a matrix transpose kernel?
A : Matrix transpose is a common operation that demonstrates the importance of memory access patterns in CUDA. Here's an optimized implementation using shared memory to avoid uncoalesced writes:
#define TILE_DIM 32
#define BLOCK_ROWS 8
__global__ void transposeCoalesced(float *odata, float *idata,
int width, int height) {
__shared__ float tile[TILE_DIM][TILE_DIM + 1]; // +1 to avoid bank conflicts
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
// Coalesced read from global memory into shared memory
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (x < width && (y + j) < height) {
tile[threadIdx.y + j][threadIdx.x] =
idata[(y + j) * width + x];
}
}
__syncthreads();
// Transpose block indices
x = blockIdx.y * TILE_DIM + threadIdx.x;
y = blockIdx.x * TILE_DIM + threadIdx.y;
// Coalesced write from shared memory to global memory
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
if (x < height && (y + j) < width) {
odata[(y + j) * height + x] =
tile[threadIdx.x][threadIdx.y + j];
}
}
}
Key optimizations:
- Uses shared memory as an intermediate buffer
- Adds padding (+1) to shared memory to avoid bank conflicts
- Ensures both reads and writes are coalesced
- Uses tiles to improve data locality
dim3 dimGrid((width + TILE_DIM - 1) / TILE_DIM,
(height + TILE_DIM - 1) / TILE_DIM);
dim3 dimBlock(TILE_DIM, BLOCK_ROWS);
transposeCoalesced<<>>(d_out, d_in, width, height);
Q: What is a warp?
A : A warp is the fundamental unit of execution in CUDA. It consists of 32 threads that execute the same instruction simultaneously in SIMT (Single Instruction, Multiple Thread) fashion.Key Characteristics:
- Size: Always 32 threads (on current NVIDIA GPUs)
- Execution: All threads in a warp execute the same instruction at the same time
- Scheduling: The warp scheduler selects warps that are ready to execute
- Thread Grouping: Threads are grouped into warps based on their threadIdx (threads 0-31 form warp 0, threads 32-63 form warp 1, etc.)
When threads in a warp take different execution paths (e.g., due to if-else statements), this is called warp divergence. The warp must execute both paths serially, with inactive threads masked out, reducing performance.
Example of warp divergence:
__global__ void divergentKernel(int *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// This causes divergence if idx % 2 varies within a warp
if (idx % 2 == 0) {
data[idx] = idx * 2; // Some threads execute this
} else {
data[idx] = idx * 3; // Other threads execute this
}
}
Best Practices:
- Minimize branch divergence within warps
- Organize data and thread indexing to keep threads in the same warp on the same code path
- Ensure memory accesses within a warp are coalesced
- Consider warp size when designing algorithms and data structures
Q: How many warps can run simultaneously inside a multiprocessor?
A : The number of warps that can run simultaneously on a Streaming Multiprocessor (SM) depends on the GPU architecture and available resources. This is determined by several factors:Hardware Limits (varies by architecture):
- Maximum resident warps per SM: Typically 32-64 warps depending on architecture
- Fermi/Kepler: 64 warps per SM (2048 threads)
- Maxwell/Pascal: 64 warps per SM (2048 threads)
- Volta/Turing: 64 warps per SM (2048 threads)
- Ampere: 64 warps per SM (2048 threads)
- Maximum thread blocks per SM: 16-32 depending on architecture
The actual number of concurrent warps is limited by whichever resource runs out first:
1. Registers: Limited number of registers per SM (e.g., 65,536) Warps = (Total Registers) / (Registers per thread * 32) 2. Shared Memory: Limited shared memory per SM (e.g., 96 KB) Warps = (Total Shared Memory) / (Shared Memory per block / warps per block) 3. Thread Block Limit: Maximum blocks per SM 4. Warp Limit: Hard limit on concurrent warpsExample calculation:
// If a kernel uses 32 registers per thread: // Register limit: 65536 / (32 * 32) = 64 warps â // If a kernel uses 48KB shared memory per block with 8 warps per block: // Shared memory limit: 98304 / 48KB = 2 blocks = 16 warps â // Shared memory becomes the bottleneck!Occupancy:
Occupancy is the ratio of active warps to maximum possible warps. Higher occupancy generally improves performance by hiding memory latency, but 100% occupancy isn't always necessary for optimal performance.
Use the CUDA Occupancy Calculator or
cudaOccupancyMaxActiveBlocksPerMultiprocessor() to determine occupancy for your kernels.
Q: What is the difference between a block and a thread?
A : Blocks and threads are two fundamental levels of the CUDA execution hierarchy:Thread:
- The smallest unit of execution in CUDA
- Each thread executes the kernel code independently
- Has a unique threadIdx (x, y, z) within its block
- Has private registers and local memory
- Can access shared memory within its block
- Can access global memory across all blocks
- A group of threads that execute together on the same SM
- Threads within a block can cooperate via shared memory and synchronization
- Has a unique blockIdx (x, y, z) within the grid
- Contains up to 1024 threads (architecture dependent)
- All threads in a block share the same shared memory space
- Blocks execute independently and can complete in any order
// Define block and grid dimensions dim3 threadsPerBlock(16, 16); // 256 threads per block dim3 numBlocks(32, 32); // 1024 blocks in grid // Launch kernel with 262,144 total threads myKernel<<Key Differences:>>(...); // Inside kernel: __global__ void myKernel(...) { // Calculate global thread ID int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; // Thread can access: // - Its own registers (private) // - Shared memory (block-level) // - Global memory (device-level) }
| Aspect | Thread | Block |
|---|---|---|
| Scope | Individual execution unit | Group of threads |
| Synchronization | Within warp (implicit) | Within block (__syncthreads()) |
| Shared Memory | Access via block | Owns shared memory space |
| Independence | Depends on block | Completely independent |
Q: Can threads communicate between them? What about blocks?
A : Communication between threads and blocks in CUDA has different capabilities and limitations:Thread Communication Within a Block:
Threads within the same block CAN communicate efficiently through:
1. Shared Memory:
__global__ void communicateWithinBlock() {
__shared__ float sharedData[256];
int tid = threadIdx.x;
// Thread writes to shared memory
sharedData[tid] = tid * 2.0f;
// Synchronize to ensure all writes are visible
__syncthreads();
// Thread reads data written by other threads in the block
float neighbor = sharedData[(tid + 1) % blockDim.x];
}
2. Warp-level Primitives (within same warp):
__global__ void warpCommunication() {
int lane = threadIdx.x % 32;
int value = lane;
// Shuffle data between threads in the same warp
int neighbor = __shfl_down_sync(0xffffffff, value, 1);
// Warp-level reduction
int sum = __reduce_add_sync(0xffffffff, value);
}
Block Communication:Threads from DIFFERENT blocks CANNOT directly communicate during kernel execution because:
- Blocks may execute on different SMs
- Blocks may execute at different times
- No synchronization mechanism exists between blocks during a single kernel launch
1. Global Memory (with atomic operations):
__global__ void interBlockComm(int *globalCounter) {
// Atomic operations ensure correct updates across blocks
atomicAdd(globalCounter, 1);
// But no guarantee about execution order!
}
2. Multiple Kernel Launches:
// Kernel 1: All blocks write results to global memory kernel1<<3. Cooperative Groups (CUDA 9+):>>(data); cudaDeviceSynchronize(); // Wait for all blocks to complete // Kernel 2: Read and process results from all blocks kernel2<< >>(data);
#includeSummary:__global__ void cooperativeKernel(int *data) { auto grid = cooperative_groups::this_grid(); // Do work... // Synchronize ALL blocks in the grid grid.sync(); // Now all blocks have completed previous work } // Launch with cooperative launch API cudaLaunchCooperativeKernel((void*)cooperativeKernel, numBlocks, threadsPerBlock, args);
- Within Block: Easy communication via shared memory and __syncthreads()
- Within Warp: Very fast communication via shuffle operations
- Between Blocks: No direct communication; use global memory, multiple kernels, or cooperative groups