Top CUDA Programming frequently asked interview questions.

In this post we will look at CUDA Programming Interview questions. Examples are provided with explanations.

Q: How many different kinds of memories are in a GPU?

A : CUDA-enabled GPUs have several types of memory, each with different characteristics in terms of size, speed, and scope:

Global Memory - Largest memory space accessible by all threads. High latency (400-800 cycles). Persists for the lifetime of the application. Typically several GB in size.
Shared Memory - On-chip memory shared among threads within a block. Much faster than global memory (low latency). Limited size (typically 48-96 KB per SM).
Registers - Fastest memory, private to each thread. Very limited in quantity (typically 64KB per SM, divided among all threads).
Local Memory - Private to each thread but physically resides in global memory. Used for register spilling and large structures. Same latency as global memory.
Constant Memory - Read-only memory cached on-chip. Total size of 64KB. Optimized for broadcast access patterns where all threads read the same address.
Texture Memory - Read-only memory with special caching optimized for 2D spatial locality. Useful for image processing.
L1/L2 Cache - Hardware-managed caches for global and local memory accesses. L2 is shared across all SMs.

Understanding the memory hierarchy is crucial for optimizing CUDA applications, as memory bandwidth is often the primary performance bottleneck.

Q: What does coalesced/uncoalesced memory access mean?

A : Memory coalescing is a technique where the GPU combines multiple memory accesses from threads in a warp into fewer transactions to improve memory bandwidth utilization.

Coalesced Access:
When consecutive threads in a warp access consecutive memory addresses, the memory controller can combine these accesses into a single transaction. This is the most efficient access pattern.

Example of coalesced access:

__global__ void coalescedAccess(float *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Each thread accesses consecutive addresses
    float value = data[idx];  // COALESCED
}

Uncoalesced Access:
When threads in a warp access scattered or misaligned memory addresses, each access may require a separate transaction, significantly reducing memory bandwidth efficiency.

Example of uncoalesced access:

__global__ void uncoalescedAccess(float *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Strided access with large stride
    float value = data[idx * 32];  // UNCOALESCED
}

Performance Impact:
Uncoalesced memory accesses can reduce memory bandwidth utilization by 10x or more. Modern GPUs are more tolerant of some access patterns, but coalescing remains critical for performance.

Best Practices:

Align data structures to 128-byte boundaries
Access memory in sequential patterns within a warp
Avoid scattered or strided access patterns when possible
Use shared memory to reorganize data for coalesced global memory access

Q: Can you implement a matrix transpose kernel?

A : Matrix transpose is a common operation that demonstrates the importance of memory access patterns in CUDA. Here's an optimized implementation using shared memory to avoid uncoalesced writes:

#define TILE_DIM 32
#define BLOCK_ROWS 8

__global__ void transposeCoalesced(float *odata, float *idata, 
                                   int width, int height) {
    __shared__ float tile[TILE_DIM][TILE_DIM + 1]; // +1 to avoid bank conflicts
    
    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    
    // Coalesced read from global memory into shared memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < width && (y + j) < height) {
            tile[threadIdx.y + j][threadIdx.x] = 
                idata[(y + j) * width + x];
        }
    }
    
    __syncthreads();
    
    // Transpose block indices
    x = blockIdx.y * TILE_DIM + threadIdx.x;
    y = blockIdx.x * TILE_DIM + threadIdx.y;
    
    // Coalesced write from shared memory to global memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < height && (y + j) < width) {
            odata[(y + j) * height + x] = 
                tile[threadIdx.x][threadIdx.y + j];
        }
    }
}

Key optimizations:

Uses shared memory as an intermediate buffer
Adds padding (+1) to shared memory to avoid bank conflicts
Ensures both reads and writes are coalesced
Uses tiles to improve data locality

Usage:

dim3 dimGrid((width + TILE_DIM - 1) / TILE_DIM, 
             (height + TILE_DIM - 1) / TILE_DIM);
dim3 dimBlock(TILE_DIM, BLOCK_ROWS);
transposeCoalesced<<>>(d_out, d_in, width, height);

Q: What is a warp?

A : A warp is the fundamental unit of execution in CUDA. It consists of 32 threads that execute the same instruction simultaneously in SIMT (Single Instruction, Multiple Thread) fashion.

Key Characteristics:

Size: Always 32 threads (on current NVIDIA GPUs)
Execution: All threads in a warp execute the same instruction at the same time
Scheduling: The warp scheduler selects warps that are ready to execute
Thread Grouping: Threads are grouped into warps based on their threadIdx (threads 0-31 form warp 0, threads 32-63 form warp 1, etc.)

Warp Divergence:
When threads in a warp take different execution paths (e.g., due to if-else statements), this is called warp divergence. The warp must execute both paths serially, with inactive threads masked out, reducing performance.

Example of warp divergence:

__global__ void divergentKernel(int *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // This causes divergence if idx % 2 varies within a warp
    if (idx % 2 == 0) {
        data[idx] = idx * 2;      // Some threads execute this
    } else {
        data[idx] = idx * 3;      // Other threads execute this
    }
}

Best Practices:

Minimize branch divergence within warps
Organize data and thread indexing to keep threads in the same warp on the same code path
Ensure memory accesses within a warp are coalesced
Consider warp size when designing algorithms and data structures

Q: How many warps can run simultaneously inside a multiprocessor?

A : The number of warps that can run simultaneously on a Streaming Multiprocessor (SM) depends on the GPU architecture and available resources. This is determined by several factors:

Hardware Limits (varies by architecture):

Maximum resident warps per SM: Typically 32-64 warps depending on architecture
- Fermi/Kepler: 64 warps per SM (2048 threads)
- Maxwell/Pascal: 64 warps per SM (2048 threads)
- Volta/Turing: 64 warps per SM (2048 threads)
- Ampere: 64 warps per SM (2048 threads)
Maximum thread blocks per SM: 16-32 depending on architecture

Resource Constraints:
The actual number of concurrent warps is limited by whichever resource runs out first:

1. Registers: Limited number of registers per SM (e.g., 65,536)
   Warps = (Total Registers) / (Registers per thread * 32)

2. Shared Memory: Limited shared memory per SM (e.g., 96 KB)
   Warps = (Total Shared Memory) / (Shared Memory per block / warps per block)

3. Thread Block Limit: Maximum blocks per SM

4. Warp Limit: Hard limit on concurrent warps

Example calculation:

// If a kernel uses 32 registers per thread:
// Register limit: 65536 / (32 * 32) = 64 warps â

// If a kernel uses 48KB shared memory per block with 8 warps per block:
// Shared memory limit: 98304 / 48KB = 2 blocks = 16 warps â
// Shared memory becomes the bottleneck!

Occupancy:
Occupancy is the ratio of active warps to maximum possible warps. Higher occupancy generally improves performance by hiding memory latency, but 100% occupancy isn't always necessary for optimal performance.

Use the CUDA Occupancy Calculator or cudaOccupancyMaxActiveBlocksPerMultiprocessor() to determine occupancy for your kernels.

Q: What is the difference between a block and a thread?

A : Blocks and threads are two fundamental levels of the CUDA execution hierarchy:

Thread:

The smallest unit of execution in CUDA
Each thread executes the kernel code independently
Has a unique threadIdx (x, y, z) within its block
Has private registers and local memory
Can access shared memory within its block
Can access global memory across all blocks

Block (Thread Block):

A group of threads that execute together on the same SM
Threads within a block can cooperate via shared memory and synchronization
Has a unique blockIdx (x, y, z) within the grid
Contains up to 1024 threads (architecture dependent)
All threads in a block share the same shared memory space
Blocks execute independently and can complete in any order

Example kernel configuration:

// Define block and grid dimensions
dim3 threadsPerBlock(16, 16);  // 256 threads per block
dim3 numBlocks(32, 32);        // 1024 blocks in grid

// Launch kernel with 262,144 total threads
myKernel<<>>(...);

// Inside kernel:
__global__ void myKernel(...) {
    // Calculate global thread ID
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    // Thread can access:
    // - Its own registers (private)
    // - Shared memory (block-level)
    // - Global memory (device-level)
}

Key Differences:

Aspect	Thread	Block
Scope	Individual execution unit	Group of threads
Synchronization	Within warp (implicit)	Within block (__syncthreads())
Shared Memory	Access via block	Owns shared memory space
Independence	Depends on block	Completely independent

Q: Can threads communicate between them? What about blocks?

A : Communication between threads and blocks in CUDA has different capabilities and limitations:

Thread Communication Within a Block:
Threads within the same block CAN communicate efficiently through:

1. Shared Memory:

__global__ void communicateWithinBlock() {
    __shared__ float sharedData[256];
    
    int tid = threadIdx.x;
    
    // Thread writes to shared memory
    sharedData[tid] = tid * 2.0f;
    
    // Synchronize to ensure all writes are visible
    __syncthreads();
    
    // Thread reads data written by other threads in the block
    float neighbor = sharedData[(tid + 1) % blockDim.x];
}

2. Warp-level Primitives (within same warp):

__global__ void warpCommunication() {
    int lane = threadIdx.x % 32;
    int value = lane;
    
    // Shuffle data between threads in the same warp
    int neighbor = __shfl_down_sync(0xffffffff, value, 1);
    
    // Warp-level reduction
    int sum = __reduce_add_sync(0xffffffff, value);
}

Block Communication:
Threads from DIFFERENT blocks CANNOT directly communicate during kernel execution because:

Blocks may execute on different SMs
Blocks may execute at different times
No synchronization mechanism exists between blocks during a single kernel launch

Workarounds for Inter-block Communication:

1. Global Memory (with atomic operations):

__global__ void interBlockComm(int *globalCounter) {
    // Atomic operations ensure correct updates across blocks
    atomicAdd(globalCounter, 1);
    
    // But no guarantee about execution order!
}

2. Multiple Kernel Launches:

// Kernel 1: All blocks write results to global memory
kernel1<<>>(data);
cudaDeviceSynchronize();  // Wait for all blocks to complete

// Kernel 2: Read and process results from all blocks
kernel2<<>>(data);

3. Cooperative Groups (CUDA 9+):

#include 

__global__ void cooperativeKernel(int *data) {
    auto grid = cooperative_groups::this_grid();
    
    // Do work...
    
    // Synchronize ALL blocks in the grid
    grid.sync();
    
    // Now all blocks have completed previous work
}

// Launch with cooperative launch API
cudaLaunchCooperativeKernel((void*)cooperativeKernel, numBlocks, 
                           threadsPerBlock, args);

Summary:

Within Block: Easy communication via shared memory and __syncthreads()
Within Warp: Very fast communication via shuffle operations
Between Blocks: No direct communication; use global memory, multiple kernels, or cooperative groups

Q: How does a cache work in GPU architecture?

A : GPU caches are designed differently from CPU caches because GPUs optimize for throughput rather than latency. Understanding GPU caching is essential for performance optimization.

GPU Cache Hierarchy:

1. L1 Cache:

Located on each SM (Streaming Multiprocessor)
Caches global and local memory accesses
Size: Typically 16-128 KB per SM (architecture dependent)
Shared with shared memory in some architectures
Can be configured for different shared memory/L1 splits

Configuration example:

// Prefer more L1 cache
cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferL1);

// Prefer more shared memory
cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferShared);

// Equal split
cudaFuncSetCacheConfig(myKernel, cudaFuncCachePreferEqual);

2. L2 Cache:

Shared by all SMs on the GPU
Caches global memory accesses
Size: Several MB (e.g., 6MB on A100)
Unified cache for all memory traffic
Persisting cache in newer architectures (Ampere+)

L2 cache persistence (Ampere+):

// Set aside part of L2 for persistent data
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 4 * 1024 * 1024);

// Mark memory region for L2 persistence
cudaStreamAttrValue stream_attribute;
stream_attribute.accessPolicyWindow.base_ptr = d_data;
stream_attribute.accessPolicyWindow.num_bytes = size;
stream_attribute.accessPolicyWindow.hitRatio = 1.0;
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, 
                      &stream_attribute);

3. Texture Cache:

Read-only cache optimized for 2D/3D spatial locality
Useful for image processing and irregular access patterns
Can provide performance benefits for certain access patterns

4. Constant Cache:

Dedicated cache for constant memory
Extremely fast for broadcast reads (all threads read same address)
64 KB constant memory space

Cache Behavior Characteristics:

Write Policy: GPUs use write-through or write-evict policies (no write-back)
Coherence: No hardware cache coherence between L1 caches on different SMs
Line Size: Typically 128 bytes (32 floats or 16 doubles)
Latency Hiding: GPUs rely on thread parallelism to hide cache miss latency

Best Practices:

Maximize memory coalescing to utilize cache lines effectively
Reuse data within blocks using shared memory rather than relying on L1 cache
Use constant memory for read-only data accessed by all threads
Consider texture memory for irregular access patterns
Profile cache hit rates using NVIDIA Nsight Compute

Q: What is the difference between shared memory and registers?

A : Shared memory and registers are both on-chip memory types in CUDA, but they have different characteristics and use cases:

Registers:

Scope: Private to each thread
Speed: Fastest memory (1 cycle latency)
Size: Limited (typically 255 registers per thread max, 64KB total per SM)
Allocation: Automatic by compiler for local variables
Access: No conflicts, each thread has its own registers
Lifetime: Duration of thread execution

__global__ void registerExample() {
    // These variables are stored in registers
    int x = threadIdx.x;
    float temp = x * 2.0f;
    float result = temp + 1.0f;
    
    // Each thread has its own x, temp, result in registers
}

Shared Memory:

Scope: Shared among all threads in a block
Speed: Very fast (comparable to registers when no bank conflicts)
Size: Configurable (typically 48-96 KB per SM)
Allocation: Explicit declaration by programmer
Access: Can have bank conflicts if multiple threads access same bank
Lifetime: Duration of block execution

Shared memory usage example:

__global__ void sharedMemoryExample() {
    // Explicitly declared shared memory
    __shared__ float sharedData[256];
    
    int tid = threadIdx.x;
    
    // Each thread can read/write shared memory
    sharedData[tid] = tid * 2.0f;
    __syncthreads();
    
    // Threads can access each other's data
    float neighbor = sharedData[(tid + 1) % blockDim.x];
}

Bank Conflicts in Shared Memory:
Shared memory is organized into banks (32 banks on modern GPUs). When multiple threads in a warp access different addresses in the same bank simultaneously, a bank conflict occurs, serializing the accesses.

Avoiding bank conflicts:

// Bad: Bank conflicts
__shared__ float data[32][32];
float value = data[threadIdx.x][threadIdx.y];  // Conflict!

// Good: Add padding to avoid conflicts
__shared__ float data[32][33];  // +1 padding
float value = data[threadIdx.x][threadIdx.y];  // No conflict

Comparison Table:

Aspect	Registers	Shared Memory
Visibility	Private to thread	Shared within block
Speed	Fastest	Very fast (if no conflicts)
Size	~64KB per SM	48-96KB per SM
Allocation	Automatic	Explicit (__shared__)
Use Case	Thread-local variables	Inter-thread communication

When to Use Each:

Registers: For thread-private frequently-accessed variables
Shared Memory: For data sharing and cooperative algorithms within a block
Excessive register usage reduces occupancy (register spilling to local memory)
Excessive shared memory usage also reduces occupancy

Q: Which algorithms perform better on GPU: data bound or compute bound?

A : Understanding whether an algorithm is memory-bound or compute-bound is crucial for GPU optimization:

Memory-Bound (Data-Bound) Algorithms:

Performance limited by memory bandwidth rather than compute throughput
Spend most time waiting for data from memory
Low arithmetic intensity (few operations per byte transferred)
Examples: Vector addition, simple element-wise operations, memory copies

Example (memory-bound):

// Vector addition: 2 reads + 1 write = 12 bytes, 1 FLOP
// Arithmetic intensity = 1/12 â 0.08 FLOP/byte (very low)
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];  // Memory-bound
    }
}

Compute-Bound Algorithms:

Performance limited by computational throughput
High arithmetic intensity (many operations per byte)
Keep GPU cores busy with calculations
Examples: Matrix multiplication, FFT, ray tracing, deep learning training

Example (compute-bound):

// Matrix multiplication: O(nÂ³) operations, O(nÂ²) data
// Much higher arithmetic intensity
__global__ void matMul(float *A, float *B, float *C, int n) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    float sum = 0.0f;
    for (int k = 0; k < n; k++) {
        sum += A[row * n + k] * B[k * n + col];  // Compute-bound
    }
    C[row * n + col] = sum;
}

Which Performs Better on GPU?

Compute-bound algorithms typically see greater speedup on GPUs because:

GPUs have massive parallel compute capability (thousands of cores)
High operations-to-memory-transfer ratio maximizes GPU utilization
Can achieve near-theoretical peak FLOPS
Speedups of 10-100x over CPU are common

Memory-bound algorithms see less speedup because:

GPU memory bandwidth, while high, is often the bottleneck
Many cores remain idle waiting for data
Speedup limited by memory bandwidth ratio (GPU:CPU â 10-20x)
Harder to optimize beyond coalescing and caching

Arithmetic Intensity Analysis:

Arithmetic Intensity = (Operations) / (Bytes Transferred)

High intensity (>10 FLOP/byte):  Compute-bound â GPU excels
Low intensity (<1 FLOP/byte):    Memory-bound â Limited speedup

Optimization Strategies:
For Memory-Bound:

Maximize memory coalescing
Use shared memory to reduce global memory accesses
Increase arithmetic intensity by fusing operations
Optimize memory access patterns

For Compute-Bound:

Maximize occupancy to hide latency
Use appropriate math precision (FP16 can double throughput)
Leverage tensor cores for matrix operations (if available)
Minimize divergence and control flow overhead

The Roofline Model is an excellent tool for understanding whether your kernel is memory or compute-bound and identifying optimization opportunities.

Q: What are the steps to port an application to CUDA?

A : Porting an application to CUDA involves systematic analysis, implementation, and optimization. Here's a comprehensive approach:

Step 1: Profile and Identify Hotspots

Use profiling tools (gprof, perf, etc.) to identify compute-intensive sections
Focus on code that accounts for >80% of execution time
Look for data-parallel operations (loops without dependencies)

// Example: Identify this loop as a hotspot
for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];  // Data-parallel, good for GPU
}

Step 2: Assess Parallelism

Determine if the algorithm is data-parallel
Check for dependencies between iterations
Estimate potential speedup using Amdahl's Law
Consider data transfer overhead vs. computation time

Step 3: Design Memory Management Strategy

Identify data that needs to be on GPU
Plan data transfer pattern (host-to-device, device-to-host)
Consider using pinned memory for faster transfers
Minimize transfers between CPU and GPU

// Allocate device memory
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);

// Copy data to device
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

Step 4: Implement CUDA Kernels

Start with a simple, naive implementation
Convert loops to parallel kernel launches
Calculate appropriate grid and block dimensions
Handle boundary conditions

__global__ void vectorAddKernel(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAddKernel<<>>(d_a, d_b, d_c, n);

Step 5: Verify Correctness

Compare GPU results with CPU reference implementation
Use cuda-memcheck for memory errors
Check for kernel launch errors
Test with different input sizes and edge cases

// Check for kernel launch errors
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    printf("Kernel launch error: %s\n", cudaGetErrorString(err));
}

// Synchronize and check for execution errors
cudaDeviceSynchronize();

Step 6: Optimize Performance

Profile with NVIDIA Nsight Systems and Nsight Compute
Optimize memory access patterns (coalescing)
Use shared memory for data reuse
Minimize divergence
Tune grid and block dimensions
Overlap computation with data transfer using streams

Step 7: Advanced Optimization

Consider using CUDA libraries (cuBLAS, cuFFT, Thrust)
Implement persistent kernel strategies
Use async operations and multiple streams
Optimize register and shared memory usage
Consider using unified memory for easier development

Step 8: Iterate and Refine

Measure performance improvements at each stage
Focus on bottlenecks identified by profiling
Document assumptions and design decisions
Consider portability to other GPU architectures

Q: What is a barrier in CUDA?

A : A barrier is a synchronization point where threads must wait until all threads in a group reach that point before proceeding. In CUDA, barriers ensure memory consistency and coordinate execution among threads.

Block-Level Barrier (__syncthreads()):
The most common barrier in CUDA synchronizes all threads within a block:

__global__ void barrierExample() {
    __shared__ float sharedData[256];
    int tid = threadIdx.x;
    
    // Phase 1: Each thread writes to shared memory
    sharedData[tid] = tid * 2.0f;
    
    // Barrier: Wait for all threads to complete writes
    __syncthreads();
    
    // Phase 2: Now safe to read any location in sharedData
    float neighbor = sharedData[(tid + 1) % blockDim.x];
}

Important Rules:

All threads in a block must reach the same __syncthreads() call
Cannot place __syncthreads() in conditional code where not all threads execute it
Only synchronizes threads within a block (not across blocks)

Incorrect usage (causes deadlock):

__global__ void incorrectBarrier() {
    if (threadIdx.x < 128) {
        __syncthreads();  // ERROR! Only half the threads hit this
    }
}

Correct usage:

__global__ void correctBarrier() {
    __shared__ float data[256];
    
    if (threadIdx.x < 128) {
        data[threadIdx.x] = threadIdx.x;
    }
    
    // All threads hit this barrier
    __syncthreads();
    
    // Now safe to read
    float value = data[threadIdx.x];
}

Warp-Level Synchronization:
Modern CUDA provides warp-level synchronization primitives:

__global__ void warpBarrier() {
    int lane = threadIdx.x % 32;
    int value = lane;
    
    // Synchronize threads in a warp with mask
    __syncwarp(0xffffffff);  // All 32 threads in warp
    
    // Shuffle operations include implicit synchronization
    int neighbor = __shfl_down_sync(0xffffffff, value, 1);
}

Grid-Level Synchronization (Cooperative Groups):
For synchronizing across blocks (requires cooperative launch):

#include 

__global__ void gridBarrier() {
    auto grid = cooperative_groups::this_grid();
    
    // Do work...
    
    // Synchronize ALL threads in ALL blocks
    grid.sync();
    
    // Continue after all blocks reach this point
}

Performance Considerations:

Barriers cause stalls and reduce occupancy temporarily
Minimize the number of barriers in your kernel
Combine multiple synchronization points when possible
Consider redesigning algorithms to reduce synchronization needs

Common Use Cases:

Ensuring shared memory writes are visible before reads
Multi-phase algorithms within a kernel
Cooperative algorithms like parallel reduction
Tiled matrix operations

Q: What is a Stream in CUDA?

A : A CUDA stream is a sequence of operations that execute in order on the GPU. Streams enable concurrent execution of kernels and memory transfers, improving overall throughput.

Key Concepts:

Operations in the same stream execute in order
Operations in different streams can execute concurrently
Default stream (stream 0) serializes operations
Useful for overlapping computation and data transfer

Creating and Using Streams:

// Create streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Launch operations on different streams
cudaMemcpyAsync(d_a1, h_a1, size, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_a2, h_a2, size, cudaMemcpyHostToDevice, stream2);

kernel<<>>(d_a1, d_b1);
kernel<<>>(d_a2, d_b2);

cudaMemcpyAsync(h_c1, d_c1, size, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_c2, d_c2, size, cudaMemcpyDeviceToHost, stream2);

// Synchronize streams
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Overlapping Computation and Transfer:

const int nStreams = 4;
cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; i++) {
    cudaStreamCreate(&streams[i]);
}

int chunkSize = n / nStreams;

for (int i = 0; i < nStreams; i++) {
    int offset = i * chunkSize;
    
    // These operations can overlap across streams
    cudaMemcpyAsync(&d_data[offset], &h_data[offset], 
                   chunkSize * sizeof(float), 
                   cudaMemcpyHostToDevice, streams[i]);
                   
    processKernel<<>>(&d_data[offset], chunkSize);
    
    cudaMemcpyAsync(&h_result[offset], &d_result[offset], 
                   chunkSize * sizeof(float), 
                   cudaMemcpyDeviceToHost, streams[i]);
}

// Wait for all streams
cudaDeviceSynchronize();

Stream Priorities:

// Create high and low priority streams
int leastPriority, greatestPriority;
cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);

cudaStream_t highPriorityStream;
cudaStreamCreateWithPriority(&highPriorityStream, cudaStreamNonBlocking, 
                            greatestPriority);

Benefits of Using Streams:

Improved GPU utilization through concurrent execution
Overlap data transfer with computation
Reduced overall execution time
Better resource utilization

Requirements for Concurrency:

Use cudaMemcpyAsync instead of cudaMemcpy
Use pinned (page-locked) host memory for async transfers
Sufficient GPU resources (multiple copy engines, enough SMs)
Operations must be independent

Best Practices:

Use multiple streams to hide memory transfer latency
Divide work into chunks that can be processed independently
Profile to ensure concurrency is actually achieved
Balance chunk sizes to maximize overlap
Reuse streams rather than creating/destroying frequently

Search Tutorials

Top CUDA Programming frequently asked interview questions.

Q: How many different kinds of memories are in a GPU?

Q: What does coalesced/uncoalesced memory access mean?

Q: Can you implement a matrix transpose kernel?

Q: What is a warp?

Q: How many warps can run simultaneously inside a multiprocessor?

Q: What is the difference between a block and a thread?

Q: Can threads communicate between them? What about blocks?

Q: How does a cache work in GPU architecture?

Q: What is the difference between shared memory and registers?

Q: Which algorithms perform better on GPU: data bound or compute bound?

Q: What are the steps to port an application to CUDA?

Q: What is a barrier in CUDA?

Q: What is a Stream in CUDA?

Popular Posts