GPU Programming

Neural Network Layer and low-level operators

A simple feedforward neural network for text classification consists of a sequence of standard layers. The input text is first mapped to vectors using an embedding layer, followed by linear layers with ReLU activations. An average pooling operation aggregates token representations into a single vector, and a softmax layer produces the final class probabilities.

At a lower level, these neural network layers are implemented using a small set of basic operations. Matrix multiplication is used in linear layers, element-wise operations handle activations, and reduction operations such as averaging are used for pooling. Efficient execution of these operations relies heavily on GPUs.

Components of A GPU Server

This figure illustrates a modern 4U computing server designed for high-performance and AI workloads. It combines dual AMD EPYC CPUs with multiple NVIDIA GPUs connected via PCIe and NVLink, alongside high-bandwidth DDR5 memory and NVMe storage to support large-scale parallel computation.

GPU Architecture

Modern GPUs achieve high performance by prioritizing throughput-oriented parallelism rather than single-thread latency. Instead of a few complex cores like CPUs, a GPU consists of many Streaming Multiprocessors (SMs), each of which contains a large number of simple compute cores designed to execute the same operations in parallel.

An SM is the fundamental execution unit of a GPU. Each SM is divided into four partitions, and each partition contains 32 CUDA cores, matching the size of a warp, which is the basic scheduling and execution unit in NVIDIA GPUs. A warp consists of 32 threads that execute the same instruction in lockstep (SIMT: Single Instruction, Multiple Threads). As a result, a single SM contains 128 cores and can execute up to 128 FP32 operations per cycle under ideal conditions.

Within each partition, a warp scheduler selects a ready warp every cycle, while the dispatch unit issues its instruction to the appropriate execution units (e.g., FP32 cores, Tensor Cores, or load/store units). GPUs do not schedule individual threads; instead, they schedule warps. When one warp stalls due to memory access, the scheduler can immediately switch to another ready warp with negligible overhead, allowing GPUs to effectively hide memory latency through massive concurrency.

Each SM also contains a large register file (the fastest on-chip storage) and a shared L1 cache / shared memory, which enable fast data reuse and communication between threads within a block. Newer architectures such as NVIDIA H100 significantly increase the L1/shared memory capacity and enhance Tensor Cores, further optimizing the SM for large-scale matrix and tensor computations commonly found in deep learning workloads.

H100 introduces FP8 to further improve throughput and reduce memory bandwidth requirements for large-scale AI workloads. By supporting multiple FP8 formats and accumulating results in higher precision, the architecture achieves both high performance and numerical stability.

Transformer Engine is a hardware–software co-designed system in H100 that automatically manages precision, scaling, and accumulation to safely use FP8 for Transformer workloads.

CPU vs. GPU

Program Execution on GPU

In the CUDA programming model, the CPU acts as the host that controls execution and launches GPU kernels, while the GPU serves as a device optimized for massively parallel computation. Kernels are executed using a SIMT model, where threads are organized into blocks and grids and mapped onto multiple SMs for parallel execution.

How Kernel Threads are Executed

When a CUDA kernel is launched, threads are first organized into thread blocks, which are then assigned to Streaming Multiprocessors (SMs). Within an SM, each thread block is further partitioned into warps, and warp is the fundamental unit used by the GPU to create, manage, schedule, and execute threads.

A warp consists of 32 threads. All threads in a warp start execution at the same program address, but each thread maintains its own program counter and register state. At any given cycle, a warp executes one common instruction across all 32 threads, following the SIMT (Single Instruction, Multiple Threads) execution model. Although threads within a warp can take different control-flow paths due to branching, divergent paths are serialized by the hardware, which can reduce performance.

Warp Execution Model on an SM

Once a warp is assigned to an SM, its execution context remains resident on the SM for the entire lifetime of the warp. This context includes the program counter, register values, and any shared memory used by the warp. Because all active warps keep their contexts on-chip, switching between warps does not require saving or restoring state, making warp-to-warp context switching effectively instantaneous.

At runtime, each SM contains one or more warp schedulers, which continuously select a warp with active and ready threads. The selected warp’s next instruction is then issued (dispatched) to the appropriate execution units, such as FP32 cores, Tensor Cores, or load/store units. If a warp stalls due to a long-latency operation (e.g., global memory access), the scheduler can immediately switch to another ready warp, allowing the SM to hide memory latency through concurrency.

Although NVIDIA GPUs expose a SIMT programming model, warp execution is implemented using SIMD-style vector execution with predication and masking at the hardware level.

CPU-GPU Data Movement

CUDA Kernel

A CUDA kernel is a function that runs on the GPU. From a programming perspective, a kernel describes the execution logic for a single thread, and the code itself is written in a serial manner. Parallelism does not come from the code structure, but from the fact that the kernel is executed simultaneously by a large number of threads.

When a kernel is launched, thousands of threads may execute the same kernel function concurrently. Each thread uses its thread index (e.g., threadIdx, blockIdx) to determine which portion of the data it operates on. This follows the SPMD (Single Program, Multiple Data) model: the same program is replicated across many threads, each working on different data elements.

Key characteristics of CUDA kernels:

The kernel code is serial per thread
Massive parallelism is achieved by many threads executing the same kernel
Thread indices are used to map threads to data
The programmer expresses what one thread does, not how threads are scheduled

Compiling CUDA Code

CUDA programs follow a heterogeneous programming model, where a single source file contains both host (CPU) code and device (GPU) code. The CUDA compiler, NVCC, acts as a compiler driver that separates these two parts and compiles them differently.

During compilation:

Host code is passed to a standard C/C++ compiler (e.g., gcc or clang) and compiled into CPU-executable code.
Device code (CUDA kernels) is compiled into PTX, an intermediate representation for NVIDIA GPUs.

At runtime, the PTX code is just-in-time (JIT) compiled by the NVIDIA driver into machine code specific to the target GPU architecture. This design allows the same CUDA binary to run on different GPU generations without recompiling the source code.

PTX serves as an intermediate representation rather than final machine code. At runtime, the NVIDIA driver JIT-compiles PTX into architecture-specific GPU instructions, enabling the same CUDA binary to run across different GPU generations.

Basic GPU CUDA operations

In CUDA, all GPU operations are orchestrated by the CPU. The host explicitly allocates device memory using cudaMalloc, transfers data between host and device with cudaMemcpy, launches GPU kernels, and finally releases device memory with cudaFree. This explicit memory management model reflects CUDA’s heterogeneous design, where the GPU acts as a compute accelerator controlled by the CPU.

GPU Memory

CUDA provides a hierarchical memory model. Each thread has private registers, which are the fastest memory to access.

Threads within the same thread block can communicate through shared memory, which is explicitly declared with __shared__ and has much lower latency than global memory.

In contrast, global memory is accessible by all threads across all blocks and persists across kernel launches, but it has the highest access latency.

Data Movement

Data movement between the CPU and GPU is explicitly managed by the programmer using cudaMemcpy. Memory can be copied from host to device or from device to host, with the transfer size specified in bytes. Since CPU–GPU data transfers are relatively expensive, performance-critical applications aim to minimize such transfers and maximize data reuse on the GPU.

On each Streaming Multiprocessor (SM), instructions are issued to execution units through a limited number of issue slots per cycle. An issue slot represents the hardware capacity to dispatch an instruction from a warp in a given cycle, and thus directly limits instruction throughput on the SM.

In pre-Hopper architectures (e.g., A100), warps waiting for memory operations—such as asynchronous copies into shared memory—still participate in scheduling. Even when a warp is stalled on a barrier, it repeatedly checks readiness, occupying issue slots and reducing overall SM utilization.

Hopper introduces a key change in this execution model. With the Tensor Memory Accelerator (TMA), memory transfers are offloaded to a dedicated hardware unit, and warps waiting on copy completion are put into a sleep state. Sleeping warps are removed from the scheduler’s active set and no longer consume issue slots. As a result, issue slots are exclusively used by warps that are ready to execute computation, enabling higher instruction throughput and more effective overlap between memory movement and compute.

Declaration of Host/Device function

Keyword	Call on	Execute On	Used
`__global__`	host（CPU）	device（GPU）	GPU kernel
`__device__`	device	device	GPU Func
`__host__`	host	host	Normal CPU Func

CUDA distinguishes host and device functions using explicit qualifiers. A __global__ function defines a GPU kernel, which is launched from the CPU but executed in parallel by many threads on the GPU. Each thread runs the same kernel code and uses its thread and block indices to determine which portion of the data it processes.

Calling Kernel at Runtime

In CUDA, a kernel’s parallel execution configuration is specified at runtime by the host (CPU) rather than being fixed at compile time. When launching a kernel, the host explicitly defines the grid–block–thread hierarchy, which determines how many GPU threads will execute the kernel.

A kernel is launched using the special syntax:

1	kernelFunc<<<Dg, Db>>>(args);

Here, Dg (grid dimension) and Db (block dimension) can be specified either as integers or as dim3 objects:

1
2
3

dim3 Dg(4, 2, 1);
dim3 Db(8, 8, 1);
kernelFunc<<<Dg, Db>>>(args);

Dg defines the size of the grid, i.e., the number of thread blocks.
Db defines the size of each block, i.e., the number of threads per block.

The total number of blocks is Dg.x * Dg.y * Dg.z, and the total number of threads per block is Db.x * Db.y * Db.z, which must not exceed 1024 on current NVIDIA GPUs.

Device Runtime Variables

Once a kernel is launched, each thread executing on the GPU must be able to identify which block it belongs to and which thread it is within that block. CUDA provides a set of built-in device-side variables for this purpose, which are automatically generated by the compiler.

These variables include:

gridDim (dim3): dimensions of the grid
blockIdx (uint3): index of the current block within the grid
blockDim (dim3): dimensions of a block
threadIdx (uint3): index of the current thread within the block

All kernel threads execute the same code, but by using these variables, each thread can compute a unique global index. A common pattern is:

1	int i = blockIdx.x * blockDim.x + threadIdx.x;

This allows threads to map naturally to elements of arrays or other data structures stored in global memory.

Calling a CUDA Kernel from the CPU

In practice, the host typically computes the grid and block sizes based on the input problem size. For example, when processing a vector of length n, a common approach is to choose a fixed number of threads per block and compute the number of required blocks using integer division with rounding up:

int n = 1024;
int threads_per_block = 256;
int num_blocks = (n + threads_per_block - 1) / threads_per_block;

VecAddKernel<<<num_blocks, threads_per_block>>>(dA, dB, dC, n);

When this launch occurs:

The CPU issues the kernel launch with the specified grid and block configuration.
The GPU creates the grid of blocks and schedules them onto available Streaming Multiprocessors (SMs).
Each block is further divided into warps, and threads execute the kernel code in parallel.
Each thread determines which data element it operates on using blockIdx, blockDim, and threadIdx.

GPU Acceleration

Memory Access Efficiency

CUDA Device Memory Model

Tiling for Matrix Multiplication

A implement in cuda cpp:

__global__ void MatrixMultiplyKernel(
    float *out,
    const int *out_shape,
    const int *out_strides,
    float *a_storage,
    const int *a_shape,
    const int *a_strides,
    float *b_storage,
    const int *b_shape,
    const int *b_strides)
{
  /**
   * Multiply two (compact) matrices into an output (also comapct) matrix. Matrix a and b are both in a batch
   * format, with shape [batch_size, m, n], [batch_size, n, p].
   * Requirements:
   * - All data must be first moved to shared memory.
   * - Only read each cell in a and b once.
   * - Only write to global memory once per kernel.
   * There is guarantee that a_shape[0] == b_shape[0], a_shape[2] == b_shape[1],
   * and out_shape[0] == a_shape[0], out_shape[1] == a_shape[1], out_shape[2] == b_shape[2].
   *
   * Args:
   *   out: compact 1D array of size batch_size x m x p to write the output to
   *   out_shape: shape of the output array
   *   out_strides: strides of the output array
   *   a_storage: compact 1D array of size batch_size x m x n
   *   a_shape: shape of the a array
   *   a_strides: strides of the a array
   *   b_storage: comapct 2D array of size batch_size x n x p
   *   b_shape: shape of the b array
   *   b_strides: strides of the b array
   *
   * Returns:
   *   None (Fills in out array)
   */

  __shared__ float a_shared[TILE][TILE];
  __shared__ float b_shared[TILE][TILE];

  // In each block, we will compute a batch of the output matrix
  // All the threads in the block will work together to compute this batch
  int batch = blockIdx.z;
  int a_batch_stride = a_shape[0] > 1 ? a_strides[0] : 0;
  int b_batch_stride = b_shape[0] > 1 ? b_strides[0] : 0;

  /// BEGIN HW1_4
  /// TODO
  // Hints:
  // 1. Compute the row and column of the output matrix this block will compute
  int row = blockIdx.y * TILE + threadIdx.y;
  int col = blockIdx.x * TILE + threadIdx.x;

  // Get matrix dimensions: A[batch, m, n] × B[batch, n, p] = C[batch, m, p]
  int m = a_shape[1];
  int n = a_shape[2];
  int p = b_shape[2];

  // 2. Compute the position in the output array that this thread will write to
  // (moved to step 7 for bounds checking)

  // Accumulator for dot product
  float sum = 0.0f;

  // 3. Iterate over tiles of the two input matrices, read the data into shared memory
  int num_tiles = (n + TILE - 1) / TILE;

  for (int t = 0; t < num_tiles; t++) {
    // Load A tile: A[row, t*TILE + threadIdx.x]
    int a_col = t * TILE + threadIdx.x;
    if (row < m && a_col < n) {
      a_shared[threadIdx.y][threadIdx.x] = a_storage[batch * a_batch_stride + row * a_strides[1] + a_col * a_strides[2]];
    } else {
      a_shared[threadIdx.y][threadIdx.x] = 0.0f;
    }

    // Load B tile: B[t*TILE + threadIdx.y, col]
    int b_row = t * TILE + threadIdx.y;
    if (b_row < n && col < p) {
      b_shared[threadIdx.y][threadIdx.x] = b_storage[batch * b_batch_stride + b_row * b_strides[1] + col * b_strides[2]];
    } else {
      b_shared[threadIdx.y][threadIdx.x] = 0.0f;
    }

    // 4. Synchronize to make sure the data is available to all threads
    __syncthreads();

    // 5. Compute the output tile for this thread block
    for (int k = 0; k < TILE; k++) {
      sum += a_shared[threadIdx.y][k] * b_shared[k][threadIdx.x];
    }

    // 6. Synchronize to make sure all threads are done computing the output tile for (row, col)
    __syncthreads();
  }

  // 7. Write the output to global memory
  if (row < m && col < p) {
    int out_pos = batch * out_strides[0] + row * out_strides[1] + col * out_strides[2];
    out[out_pos] = sum;
  }
  /// END HW1_4
}

Matrix Transpose on GPU: Coalescing, Shared Memory, and Bank Conflicts

A Coalesced Transpose Kernel

__global__ void smem_cuda_transpose(int m, float* a, float* c) {
    __shared__ float smem[BLOCK_X][BLOCK_Y];

    int tileCol = blockDim.x * blockIdx.x;
    int tileRow = blockDim.y * blockIdx.y;

    // Load from global memory → shared memory
    smem[threadIdx.x][threadIdx.y] =
        a[(tileRow + threadIdx.y) * m + (tileCol + threadIdx.x)];

    __syncthreads();

    // Store from shared memory → global memory
    c[(tileCol + threadIdx.y) * m + (tileRow + threadIdx.x)] =
        smem[threadIdx.y][threadIdx.x];
}

What this fixes

✅ Global memory reads are coalesced ✅ Global memory writes are coalesced

But…

❌ Shared memory bank conflicts appear

Shared Memory Banks: The Real Bottleneck

Shared memory is divided into 32 banks
Each bank serves one 32-bit word per cycle
A warp can access shared memory in 1 cycle only if:
- Each thread accesses a different bank

smem[threadIdx.x][threadIdx.y] this code actually cause the bank conflict.

The Fix: Padding Shared Memory

__global__ void smem_cuda_transpose(int m, float* a, float* c) {
    __shared__ float smem[BLOCK_X][BLOCK_Y + 1];

    int tileCol = blockDim.x * blockIdx.x;
    int tileRow = blockDim.y * blockIdx.y;

    smem[threadIdx.x][threadIdx.y] =
        a[(tileRow + threadIdx.y) * m + (tileCol + threadIdx.x)];

    __syncthreads();

    c[(tileCol + threadIdx.y) * m + (tileRow + threadIdx.x)] =
        smem[threadIdx.y][threadIdx.x];
}

Sparse Matrix Multiplication

__global__ void SpMVCSRKernel(
    float* data,
    int*   col_index,
    int*   row_ptr,
    float* x,
    float* y,
    int    num_rows
) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < num_rows) {
        float dot = 0.0f;

        int row_start = row_ptr[row];
        int row_end   = row_ptr[row + 1];

        for (int elem = row_start; elem < row_end; ++elem) {
            dot += x[col_index[elem]] * data[elem];
        }

        y[row] += dot;
    }
}

cuBLAS

cuBLAS stands for CUDA Basic Linear Algebra Subroutines.

It is NVIDIA’s highly optimized GPU implementation of the classic BLAS interface, targeting operations such as:

Vector operations (Level-1 BLAS)
Matrix-vector operations (Level-2 BLAS)
Matrix-matrix operations (Level-3 BLAS, especially GEMM)

Programming Model

cuBLAS uses a handle-based API:

cublasHandle_t handle;
cublasCreate(&handle);

// cuBLAS calls

cublasDestroy(handle);

All cuBLAS operations require a valid cublasHandle_t, which stores execution context such as the CUDA stream and math mode.

Level-1 BLAS: Vector Dot Product

Compute the dot product of two vectors: \[ \text{result} = x^T y \]

cublasStatus_t cublasSdot(
    cublasHandle_t handle,
    int n,
    const float* x, int incx,
    const float* y, int incy,
    float* result
);

Key points:

n is the number of elements
incx, incy specify the stride between elements
x and y must reside in device memory
result is written to host memory

This is a basic building block for norms, projections, and reductions.

Level-2 BLAS: Matrix–Vector Multiplication

Compute: \[ y = \alpha A x + \beta y \]

cublasStatus_t cublasSgemv(
    cublasHandle_t handle,
    cublasOperation_t trans,
    int m, int n,
    const float* alpha,
    const float* A, int lda,
    const float* x, int incx,
    const float* beta,
    float* y, int incy
);

Key points:

A is an \(m \times n\) matrix stored in column-major order
trans controls whether A is transposed
lda is the leading dimension of A
All vectors and matrices are in device memory

This operation is common in iterative solvers and neural network layers.

Level-3 BLAS: Matrix–Matrix Multiplication (GEMM)

Compute: \[ C = \alpha A B + \beta C \]

cublasStatus_t cublasSgemm(
    cublasHandle_t handle,
    cublasOperation_t transA,
    cublasOperation_t transB,
    int m, int n, int k,
    const float* alpha,
    const float* A, int lda,
    const float* B, int ldb,
    const float* beta,
    float* C, int ldc
);

Key points:

This is the most important cuBLAS API
A, B, C are all stored in column-major layout
m, n, k define matrix shapes
Internally uses tiling, shared memory, and Tensor Cores (when available)

GEMM is the performance backbone of deep learning and scientific computing.