Search Tutorials


Top CUDA Programming (2026) Interview Questions | JavaInUse

Top CUDA Programming frequently asked interview questions.

In this post we will look at CUDA Programming Interview questions. Examples are provided with explanations.


Q: How many different kinds of memories are in a GPU?

A : CUDA-enabled GPUs have several types of memory, each with different characteristics in terms of size, speed, and scope:

  • Global Memory - Largest memory space accessible by all threads. High latency (400-800 cycles). Persists for the lifetime of the application. Typically several GB in size.
  • Shared Memory - On-chip memory shared among threads within a block. Much faster than global memory (low latency). Limited size (typically 48-96 KB per SM).
  • Registers - Fastest memory, private to each thread. Very limited in quantity (typically 64KB per SM, divided among all threads).
  • Local Memory - Private to each thread but physically resides in global memory. Used for register spilling and large structures. Same latency as global memory.
  • Constant Memory - Read-only memory cached on-chip. Total size of 64KB. Optimized for broadcast access patterns where all threads read the same address.
  • Texture Memory - Read-only memory with special caching optimized for 2D spatial locality. Useful for image processing.
  • L1/L2 Cache - Hardware-managed caches for global and local memory accesses. L2 is shared across all SMs.
Understanding the memory hierarchy is crucial for optimizing CUDA applications, as memory bandwidth is often the primary performance bottleneck.

Q: What does coalesced/uncoalesced memory access mean?

A : Memory coalescing is a technique where the GPU combines multiple memory accesses from threads in a warp into fewer transactions to improve memory bandwidth utilization.

Coalesced Access:
When consecutive threads in a warp access consecutive memory addresses, the memory controller can combine these accesses into a single transaction. This is the most efficient access pattern.

Example of coalesced access:
__global__ void coalescedAccess(float *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Each thread accesses consecutive addresses
    float value = data[idx];  // COALESCED
}
Uncoalesced Access:
When threads in a warp access scattered or misaligned memory addresses, each access may require a separate transaction, significantly reducing memory bandwidth efficiency.

Example of uncoalesced access:
__global__ void uncoalescedAccess(float *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Strided access with large stride
    float value = data[idx * 32];  // UNCOALESCED
}
Performance Impact:
Uncoalesced memory accesses can reduce memory bandwidth utilization by 10x or more. Modern GPUs are more tolerant of some access patterns, but coalescing remains critical for performance.

Best Practices:
  • Align data structures to 128-byte boundaries
  • Access memory in sequential patterns within a warp
  • Avoid scattered or strided access patterns when possible
  • Use shared memory to reorganize data for coalesced global memory access

Q: Can you implement a matrix transpose kernel?

A : Matrix transpose is a common operation that demonstrates the importance of memory access patterns in CUDA. Here's an optimized implementation using shared memory to avoid uncoalesced writes:

#define TILE_DIM 32
#define BLOCK_ROWS 8

__global__ void transposeCoalesced(float *odata, float *idata, 
                                   int width, int height) {
    __shared__ float tile[TILE_DIM][TILE_DIM + 1]; // +1 to avoid bank conflicts
    
    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    
    // Coalesced read from global memory into shared memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < width && (y + j) < height) {
            tile[threadIdx.y + j][threadIdx.x] = 
                idata[(y + j) * width + x];
        }
    }
    
    __syncthreads();
    
    // Transpose block indices
    x = blockIdx.y * TILE_DIM + threadIdx.x;
    y = blockIdx.x * TILE_DIM + threadIdx.y;
    
    // Coalesced write from shared memory to global memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < height && (y + j) < width) {
            odata[(y + j) * height + x] = 
                tile[threadIdx.x][threadIdx.y + j];
        }
    }
}
Key optimizations:
  • Uses shared memory as an intermediate buffer
  • Adds padding (+1) to shared memory to avoid bank conflicts
  • Ensures both reads and writes are coalesced
  • Uses tiles to improve data locality
Usage:
dim3 dimGrid((width + TILE_DIM - 1) / TILE_DIM, 
             (height + TILE_DIM - 1) / TILE_DIM);
dim3 dimBlock(TILE_DIM, BLOCK_ROWS);
transposeCoalesced<<>>(d_out, d_in, width, height);


Q: What is a warp?

A : A warp is the fundamental unit of execution in CUDA. It consists of 32 threads that execute the same instruction simultaneously in SIMT (Single Instruction, Multiple Thread) fashion.

Key Characteristics:
  • Size: Always 32 threads (on current NVIDIA GPUs)
  • Execution: All threads in a warp execute the same instruction at the same time
  • Scheduling: The warp scheduler selects warps that are ready to execute
  • Thread Grouping: Threads are grouped into warps based on their threadIdx (threads 0-31 form warp 0, threads 32-63 form warp 1, etc.)
Warp Divergence:
When threads in a warp take different execution paths (e.g., due to if-else statements), this is called warp divergence. The warp must execute both paths serially, with inactive threads masked out, reducing performance.

Example of warp divergence:
__global__ void divergentKernel(int *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // This causes divergence if idx % 2 varies within a warp
    if (idx % 2 == 0) {
        data[idx] = idx * 2;      // Some threads execute this
    } else {
        data[idx] = idx * 3;      // Other threads execute this
    }
}
Best Practices:
  • Minimize branch divergence within warps
  • Organize data and thread indexing to keep threads in the same warp on the same code path
  • Ensure memory accesses within a warp are coalesced
  • Consider warp size when designing algorithms and data structures


Q: How many warps can run simultaneously inside a multiprocessor?

A : The number of warps that can run simultaneously on a Streaming Multiprocessor (SM) depends on the GPU architecture and available resources. This is determined by several factors:

Hardware Limits (varies by architecture):
  • Maximum resident warps per SM: Typically 32-64 warps depending on architecture
    • Fermi/Kepler: 64 warps per SM (2048 threads)
    • Maxwell/Pascal: 64 warps per SM (2048 threads)
    • Volta/Turing: 64 warps per SM (2048 threads)
    • Ampere: 64 warps per SM (2048 threads)
  • Maximum thread blocks per SM: 16-32 depending on architecture
Resource Constraints:
The actual number of concurrent warps is limited by whichever resource runs out first:
1. Registers: Limited number of registers per SM (e.g., 65,536)
   Warps = (Total Registers) / (Registers per thread * 32)

2. Shared Memory: Limited shared memory per SM (e.g., 96 KB)
   Warps = (Total Shared Memory) / (Shared Memory per block / warps per block)

3. Thread Block Limit: Maximum blocks per SM

4. Warp Limit: Hard limit on concurrent warps
Example calculation:
// If a kernel uses 32 registers per thread:
// Register limit: 65536 / (32 * 32) = 64 warps ✓

// If a kernel uses 48KB shared memory per block with 8 warps per block:
// Shared memory limit: 98304 / 48KB = 2 blocks = 16 warps ✗
// Shared memory becomes the bottleneck!
Occupancy:
Occupancy is the ratio of active warps to maximum possible warps. Higher occupancy generally improves performance by hiding memory latency, but 100% occupancy isn't always necessary for optimal performance.

Use the CUDA Occupancy Calculator or cudaOccupancyMaxActiveBlocksPerMultiprocessor() to determine occupancy for your kernels.

Q: What is the difference between a block and a thread?

A : Blocks and threads are two fundamental levels of the CUDA execution hierarchy:

Thread:
  • The smallest unit of execution in CUDA
  • Each thread executes the kernel code independently
  • Has a unique threadIdx (x, y, z) within its block
  • Has private registers and local memory
  • Can access shared memory within its block
  • Can access global memory across all blocks
Block (Thread Block):
  • A group of threads that execute together on the same SM
  • Threads within a block can cooperate via shared memory and synchronization
  • Has a unique blockIdx (x, y, z) within the grid
  • Contains up to 1024 threads (architecture dependent)
  • All threads in a block share the same shared memory space
  • Blocks execute independently and can complete in any order
Example kernel configuration:
// Define block and grid dimensions
dim3 threadsPerBlock(16, 16);  // 256 threads per block
dim3 numBlocks(32, 32);        // 1024 blocks in grid

// Launch kernel with 262,144 total threads
myKernel<<>>(...);

// Inside kernel:
__global__ void myKernel(...) {
    // Calculate global thread ID
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    // Thread can access:
    // - Its own registers (private)
    // - Shared memory (block-level)
    // - Global memory (device-level)
}
Key Differences:
Aspect Thread Block
Scope Individual execution unit Group of threads
Synchronization Within warp (implicit) Within block (__syncthreads())
Shared Memory Access via block Owns shared memory space
Independence Depends on block Completely independent


Q: Can threads communicate between them? What about blocks?

A : Communication between threads and blocks in CUDA has different capabilities and limitations:

Thread Communication Within a Block:
Threads within the same block CAN communicate efficiently through:

1. Shared Memory:
__global__ void communicateWithinBlock() {
    __shared__ float sharedData[256];
    
    int tid = threadIdx.x;
    
    // Thread writes to shared memory
    sharedData[tid] = tid * 2.0f;
    
    // Synchronize to ensure all writes are visible
    __syncthreads();
    
    // Thread reads data written by other threads in the block
    float neighbor = sharedData[(tid + 1) % blockDim.x];
}
2. Warp-level Primitives (within same warp):
__global__ void warpCommunication() {
    int lane = threadIdx.x % 32;
    int value = lane;
    
    // Shuffle data between threads in the same warp
    int neighbor = __shfl_down_sync(0xffffffff, value, 1);
    
    // Warp-level reduction
    int sum = __reduce_add_sync(0xffffffff, value);
}
Block Communication:
Threads from DIFFERENT blocks CANNOT directly communicate during kernel execution because:
  • Blocks may execute on different SMs
  • Blocks may execute at different times
  • No synchronization mechanism exists between blocks during a single kernel launch
Workarounds for Inter-block Communication:

1. Global Memory (with atomic operations):
__global__ void interBlockComm(int *globalCounter) {
    // Atomic operations ensure correct updates across blocks
    atomicAdd(globalCounter, 1);
    
    // But no guarantee about execution order!
}
2. Multiple Kernel Launches:
// Kernel 1: All blocks write results to global memory
kernel1<<>>(data);
cudaDeviceSynchronize();  // Wait for all blocks to complete

// Kernel 2: Read and process results from all blocks
kernel2<<>>(data);
3. Cooperative Groups (CUDA 9+):
#include 

__global__ void cooperativeKernel(int *data) {
    auto grid = cooperative_groups::this_grid();
    
    // Do work...
    
    // Synchronize ALL blocks in the grid
    grid.sync();
    
    // Now all blocks have completed previous work
}

// Launch with cooperative launch API
cudaLaunchCooperativeKernel((void*)cooperativeKernel, numBlocks, 
                           threadsPerBlock, args);
Summary:
  • Within Block: Easy communication via shared memory and __syncthreads()
  • Within Warp: Very fast communication via shuffle operations
  • Between Blocks: No direct communication; use global memory, multiple kernels, or cooperative groups






Q: How does a cache work in GPU architecture?

A : GPU caches are designed differently from CPU caches because GPUs optimize for throughput rather than latency. Understanding GPU caching is essential for performance optimization.

GPU Cache Hierarchy:

1. L1 Cache:
  • Located on each SM (Streaming Multiprocessor)
  • Caches global and local memory accesses
  • Size: Typically 16-128 KB per SM (architecture dependent)
  • Shared with shared memory in some architectures
  • Can be configured for different shared memory/L1 splits
Configuration example:
// Prefer more L1 cache
cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferL1);

// Prefer more shared memory
cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferShared);

// Equal split
cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferEqual);
2. L2 Cache:
  • Shared by all SMs on the GPU
  • Caches global memory accesses
  • Size: Several MB (e.g., 6MB on A100)
  • Unified cache for all memory traffic
  • Persisting cache in newer architectures (Ampere+)
L2 cache persistence (Ampere+):
// Set aside part of L2 for persistent data
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 4 * 1024 * 1024);

// Mark memory region for L2 persistence
cudaStreamAttrValue stream_attribute;
stream_attribute.accessPolicyWindow.base_ptr = d_data;
stream_attribute.accessPolicyWindow.num_bytes = size;
stream_attribute.accessPolicyWindow.hitRatio = 1.0;
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, 
                      &stream_attribute);
3. Texture Cache:
  • Read-only cache optimized for 2D/3D spatial locality
  • Useful for image processing and irregular access patterns
  • Can provide performance benefits for certain access patterns
4. Constant Cache:
  • Dedicated cache for constant memory
  • Extremely fast for broadcast reads (all threads read same address)
  • 64 KB constant memory space
Cache Behavior Characteristics:
  • Write Policy: GPUs use write-through or write-evict policies (no write-back)
  • Coherence: No hardware cache coherence between L1 caches on different SMs
  • Line Size: Typically 128 bytes (32 floats or 16 doubles)
  • Latency Hiding: GPUs rely on thread parallelism to hide cache miss latency
Best Practices:
  • Maximize memory coalescing to utilize cache lines effectively
  • Reuse data within blocks using shared memory rather than relying on L1 cache
  • Use constant memory for read-only data accessed by all threads
  • Consider texture memory for irregular access patterns
  • Profile cache hit rates using NVIDIA Nsight Compute


Q: What is the difference between shared memory and registers?

A : Shared memory and registers are both on-chip memory types in CUDA, but they have different characteristics and use cases:

Registers:
  • Scope: Private to each thread
  • Speed: Fastest memory (1 cycle latency)
  • Size: Limited (typically 255 registers per thread max, 64KB total per SM)
  • Allocation: Automatic by compiler for local variables
  • Access: No conflicts, each thread has its own registers
  • Lifetime: Duration of thread execution
Register usage example:
__global__ void registerExample() {
    // These variables are stored in registers
    int x = threadIdx.x;
    float temp = x * 2.0f;
    float result = temp + 1.0f;
    
    // Each thread has its own x, temp, result in registers
}
Shared Memory:
  • Scope: Shared among all threads in a block
  • Speed: Very fast (comparable to registers when no bank conflicts)
  • Size: Configurable (typically 48-96 KB per SM)
  • Allocation: Explicit declaration by programmer
  • Access: Can have bank conflicts if multiple threads access same bank
  • Lifetime: Duration of block execution
Shared memory usage example:
__global__ void sharedMemoryExample() {
    // Explicitly declared shared memory
    __shared__ float sharedData[256];
    
    int tid = threadIdx.x;
    
    // Each thread can read/write shared memory
    sharedData[tid] = tid * 2.0f;
    __syncthreads();
    
    // Threads can access each other's data
    float neighbor = sharedData[(tid + 1) % blockDim.x];
}
Bank Conflicts in Shared Memory:
Shared memory is organized into banks (32 banks on modern GPUs). When multiple threads in a warp access different addresses in the same bank simultaneously, a bank conflict occurs, serializing the accesses.

Avoiding bank conflicts:
// Bad: Bank conflicts
__shared__ float data[32][32];
float value = data[threadIdx.x][threadIdx.y];  // Conflict!

// Good: Add padding to avoid conflicts
__shared__ float data[32][33];  // +1 padding
float value = data[threadIdx.x][threadIdx.y];  // No conflict
Comparison Table:
Aspect Registers Shared Memory
Visibility Private to thread Shared within block
Speed Fastest Very fast (if no conflicts)
Size ~64KB per SM 48-96KB per SM
Allocation Automatic Explicit (__shared__)
Use Case Thread-local variables Inter-thread communication
When to Use Each:
  • Registers: For thread-private frequently-accessed variables
  • Shared Memory: For data sharing and cooperative algorithms within a block
  • Excessive register usage reduces occupancy (register spilling to local memory)
  • Excessive shared memory usage also reduces occupancy


Q: Which algorithms perform better on GPU: data bound or compute bound?

A : Understanding whether an algorithm is memory-bound or compute-bound is crucial for GPU optimization:

Memory-Bound (Data-Bound) Algorithms:
  • Performance limited by memory bandwidth rather than compute throughput
  • Spend most time waiting for data from memory
  • Low arithmetic intensity (few operations per byte transferred)
  • Examples: Vector addition, simple element-wise operations, memory copies
Example (memory-bound):
// Vector addition: 2 reads + 1 write = 12 bytes, 1 FLOP
// Arithmetic intensity = 1/12 ≈ 0.08 FLOP/byte (very low)
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];  // Memory-bound
    }
}
Compute-Bound Algorithms:
  • Performance limited by computational throughput
  • High arithmetic intensity (many operations per byte)
  • Keep GPU cores busy with calculations
  • Examples: Matrix multiplication, FFT, ray tracing, deep learning training
Example (compute-bound):
// Matrix multiplication: O(n³) operations, O(n²) data
// Much higher arithmetic intensity
__global__ void matMul(float *A, float *B, float *C, int n) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    float sum = 0.0f;
    for (int k = 0; k < n; k++) {
        sum += A[row * n + k] * B[k * n + col];  // Compute-bound
    }
    C[row * n + col] = sum;
}
Which Performs Better on GPU?

Compute-bound algorithms typically see greater speedup on GPUs because:
  • GPUs have massive parallel compute capability (thousands of cores)
  • High operations-to-memory-transfer ratio maximizes GPU utilization
  • Can achieve near-theoretical peak FLOPS
  • Speedups of 10-100x over CPU are common
Memory-bound algorithms see less speedup because:
  • GPU memory bandwidth, while high, is often the bottleneck
  • Many cores remain idle waiting for data
  • Speedup limited by memory bandwidth ratio (GPU:CPU ≈ 10-20x)
  • Harder to optimize beyond coalescing and caching
Arithmetic Intensity Analysis:
Arithmetic Intensity = (Operations) / (Bytes Transferred)

High intensity (>10 FLOP/byte):  Compute-bound → GPU excels
Low intensity (<1 FLOP/byte):    Memory-bound → Limited speedup
Optimization Strategies:
For Memory-Bound:
  • Maximize memory coalescing
  • Use shared memory to reduce global memory accesses
  • Increase arithmetic intensity by fusing operations
  • Optimize memory access patterns
For Compute-Bound:
  • Maximize occupancy to hide latency
  • Use appropriate math precision (FP16 can double throughput)
  • Leverage tensor cores for matrix operations (if available)
  • Minimize divergence and control flow overhead
The Roofline Model is an excellent tool for understanding whether your kernel is memory or compute-bound and identifying optimization opportunities.

Q: What are the steps to port an application to CUDA?

A : Porting an application to CUDA involves systematic analysis, implementation, and optimization. Here's a comprehensive approach:

Step 1: Profile and Identify Hotspots
  • Use profiling tools (gprof, perf, etc.) to identify compute-intensive sections
  • Focus on code that accounts for >80% of execution time
  • Look for data-parallel operations (loops without dependencies)
// Example: Identify this loop as a hotspot
for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];  // Data-parallel, good for GPU
}
Step 2: Assess Parallelism
  • Determine if the algorithm is data-parallel
  • Check for dependencies between iterations
  • Estimate potential speedup using Amdahl's Law
  • Consider data transfer overhead vs. computation time
Step 3: Design Memory Management Strategy
  • Identify data that needs to be on GPU
  • Plan data transfer pattern (host-to-device, device-to-host)
  • Consider using pinned memory for faster transfers
  • Minimize transfers between CPU and GPU
// Allocate device memory
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);

// Copy data to device
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
Step 4: Implement CUDA Kernels
  • Start with a simple, naive implementation
  • Convert loops to parallel kernel launches
  • Calculate appropriate grid and block dimensions
  • Handle boundary conditions
__global__ void vectorAddKernel(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAddKernel<<>>(d_a, d_b, d_c, n);
Step 5: Verify Correctness
  • Compare GPU results with CPU reference implementation
  • Use cuda-memcheck for memory errors
  • Check for kernel launch errors
  • Test with different input sizes and edge cases
// Check for kernel launch errors
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    printf("Kernel launch error: %s\n", cudaGetErrorString(err));
}

// Synchronize and check for execution errors
cudaDeviceSynchronize();
Step 6: Optimize Performance
  • Profile with NVIDIA Nsight Systems and Nsight Compute
  • Optimize memory access patterns (coalescing)
  • Use shared memory for data reuse
  • Minimize divergence
  • Tune grid and block dimensions
  • Overlap computation with data transfer using streams
Step 7: Advanced Optimization
  • Consider using CUDA libraries (cuBLAS, cuFFT, Thrust)
  • Implement persistent kernel strategies
  • Use async operations and multiple streams
  • Optimize register and shared memory usage
  • Consider using unified memory for easier development
Step 8: Iterate and Refine
  • Measure performance improvements at each stage
  • Focus on bottlenecks identified by profiling
  • Document assumptions and design decisions
  • Consider portability to other GPU architectures


Q: What is a barrier in CUDA?

A : A barrier is a synchronization point where threads must wait until all threads in a group reach that point before proceeding. In CUDA, barriers ensure memory consistency and coordinate execution among threads.

Block-Level Barrier (__syncthreads()):
The most common barrier in CUDA synchronizes all threads within a block:
__global__ void barrierExample() {
    __shared__ float sharedData[256];
    int tid = threadIdx.x;
    
    // Phase 1: Each thread writes to shared memory
    sharedData[tid] = tid * 2.0f;
    
    // Barrier: Wait for all threads to complete writes
    __syncthreads();
    
    // Phase 2: Now safe to read any location in sharedData
    float neighbor = sharedData[(tid + 1) % blockDim.x];
}
Important Rules:
  • All threads in a block must reach the same __syncthreads() call
  • Cannot place __syncthreads() in conditional code where not all threads execute it
  • Only synchronizes threads within a block (not across blocks)
Incorrect usage (causes deadlock):
__global__ void incorrectBarrier() {
    if (threadIdx.x < 128) {
        __syncthreads();  // ERROR! Only half the threads hit this
    }
}
Correct usage:
__global__ void correctBarrier() {
    __shared__ float data[256];
    
    if (threadIdx.x < 128) {
        data[threadIdx.x] = threadIdx.x;
    }
    
    // All threads hit this barrier
    __syncthreads();
    
    // Now safe to read
    float value = data[threadIdx.x];
}
Warp-Level Synchronization:
Modern CUDA provides warp-level synchronization primitives:
__global__ void warpBarrier() {
    int lane = threadIdx.x % 32;
    int value = lane;
    
    // Synchronize threads in a warp with mask
    __syncwarp(0xffffffff);  // All 32 threads in warp
    
    // Shuffle operations include implicit synchronization
    int neighbor = __shfl_down_sync(0xffffffff, value, 1);
}
Grid-Level Synchronization (Cooperative Groups):
For synchronizing across blocks (requires cooperative launch):
#include 

__global__ void gridBarrier() {
    auto grid = cooperative_groups::this_grid();
    
    // Do work...
    
    // Synchronize ALL threads in ALL blocks
    grid.sync();
    
    // Continue after all blocks reach this point
}
Performance Considerations:
  • Barriers cause stalls and reduce occupancy temporarily
  • Minimize the number of barriers in your kernel
  • Combine multiple synchronization points when possible
  • Consider redesigning algorithms to reduce synchronization needs
Common Use Cases:
  • Ensuring shared memory writes are visible before reads
  • Multi-phase algorithms within a kernel
  • Cooperative algorithms like parallel reduction
  • Tiled matrix operations


Q: What is a Stream in CUDA?

A : A CUDA stream is a sequence of operations that execute in order on the GPU. Streams enable concurrent execution of kernels and memory transfers, improving overall throughput.

Key Concepts:
  • Operations in the same stream execute in order
  • Operations in different streams can execute concurrently
  • Default stream (stream 0) serializes operations
  • Useful for overlapping computation and data transfer
Creating and Using Streams:
// Create streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Launch operations on different streams
cudaMemcpyAsync(d_a1, h_a1, size, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_a2, h_a2, size, cudaMemcpyHostToDevice, stream2);

kernel<<>>(d_a1, d_b1);
kernel<<>>(d_a2, d_b2);

cudaMemcpyAsync(h_c1, d_c1, size, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_c2, d_c2, size, cudaMemcpyDeviceToHost, stream2);

// Synchronize streams
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
Overlapping Computation and Transfer:
const int nStreams = 4;
cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; i++) {
    cudaStreamCreate(&streams[i]);
}

int chunkSize = n / nStreams;

for (int i = 0; i < nStreams; i++) {
    int offset = i * chunkSize;
    
    // These operations can overlap across streams
    cudaMemcpyAsync(&d_data[offset], &h_data[offset], 
                   chunkSize * sizeof(float), 
                   cudaMemcpyHostToDevice, streams[i]);
                   
    processKernel<<>>(&d_data[offset], chunkSize);
    
    cudaMemcpyAsync(&h_result[offset], &d_result[offset], 
                   chunkSize * sizeof(float), 
                   cudaMemcpyDeviceToHost, streams[i]);
}

// Wait for all streams
cudaDeviceSynchronize();
Stream Priorities:
// Create high and low priority streams
int leastPriority, greatestPriority;
cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);

cudaStream_t highPriorityStream;
cudaStreamCreateWithPriority(&highPriorityStream, cudaStreamNonBlocking, 
                            greatestPriority);
Benefits of Using Streams:
  • Improved GPU utilization through concurrent execution
  • Overlap data transfer with computation
  • Reduced overall execution time
  • Better resource utilization
Requirements for Concurrency:
  • Use cudaMemcpyAsync instead of cudaMemcpy
  • Use pinned (page-locked) host memory for async transfers
  • Sufficient GPU resources (multiple copy engines, enough SMs)
  • Operations must be independent
Best Practices:
  • Use multiple streams to hide memory transfer latency
  • Divide work into chunks that can be processed independently
  • Profile to ensure concurrency is actually achieved
  • Balance chunk sizes to maximize overlap
  • Reuse streams rather than creating/destroying frequently

See Also

Python Interview Questions C++ Interview Questions Machine Learning Interview Questions Deep Learning Interview Questions Parallel Programming Interview Questions HPC Interview Questions OpenCL Interview Questions

Popular Posts