Sunday, May 17, 2026

Working with GPUs - Part6 - H100 SXM5 architecture

Now that we have a foundational understanding of the core utilities used to monitor and manage GPUs, let's dive into the hardware architecture of the NVIDIA H100 SXM. To truly understand GPU computing, it is essential to visualize how data flows through the silicon. The following overview maps out the internal components of the H100, providing a clear frame of reference so you can easily correlate key architectural terms such as Streaming Multiprocessors (SMs), Tensor Cores, L2 Cache, High Bandwidth Memory (HBM3), etc.

Overview

  • H100 is released around Sep 2022
  • Based on the Hopper architecture 
  • It comes in two form factors
    • PCIe (300W)
    • SXM (700W)
  • Has 80 GB HBM3 memory (3.35 TB/s)
  • 132 SMs
  • 528 Tensor cores (4 per SM)
  • 80B transistors on a custom 4N process node

Architecture


Image ref: NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

HBM3 - High Bandwidth Memory

  • This is the off-chip 80 GB device memory.
  • Divided in 5 stacks and connected via 10 independent 512-bit memory controllers.
  • Data flow: SM - L1 - L2 - memory partition/ crossbar - memory controller - HBM stack
  • H100 SXM5 has 5 HBM3 stacks.
  • HBM3 stack is DRAM.

L2 cache

  • 50 MB of L2 cache, divided into two 25 MB partitions.
  • L2 cache is SRAM.

Unified shared memory + L1 cache

  • 256 KB size 33 TB/s bandwidth per SM divided into 32 banks, each 32 bits (4 bytes) wide.
  • These are SRAM.

Registers

  • Every thread gets a private set of on-chip registers. 
  • They have very high bandwidth, and very low latency.
  • 256 KB per SM.

Gigathread engine

  • This is the hardware that takes a kernel launch and hands out thread blocks to SMs.
  • It tracks which thread blocks are not yet started, running, and finished.
  • When an SM has capacity to run another thread block, the Gigathread engine assigns the next thread block to that SM.
  • This ensures intelligent work distribution for optimal GPU utilization.

SM - Streaming Multiprocessor

  • SMs are the fundamental execution unit of the GPU which executes thread blocks of a CUDA kernel. 
  • H100 SXM5 GPU has 132 SMs.
  • Following are the components of SM:
    • FP32 CUDA cores, Int/FP64 units
    • 4th gen Tensor cores
    • Shared memory/ L1 cache
    • L1 instruction cache
    • Warp scheduler
    • Dispatch units
    • Registers
    • L0 instruction cache
  • Each SM is divided into 4 identical sub-divisions called Quadrants or SMSPs (SM Sub Partitions).

TMA - Tensor Memory Accelerator

  • Each SM has a TMA unit.
  • Offloads the tensor copy operations from the SMs.

Tensor core

  • They are really fast units for performing MMA operations (Matrix Multiply Accumulate).
  • 4 tensor cores per SM.

GPC - Graphics Processing Cluster

  • It is a group of 18 SMs.
  • There are 8 GPCs in a H100.
  • Each GPC is connected to its own chunk of L2 cache.
  • GPCs also enable the use of distributed shared memory between the SMs.

TPC - Texture Processing Unit

  • Single TPC holds 2 SMs.
  • Job of TPC is to have shared SM block, so that communication between the two SMs is really fast.

NVLink 

  • 4th gen NVLink.
  • 18 NVLink 4.0 lanes which gives 900 GB/s of GPU-GPU bandwidth.

References

Hope it was useful. Cheers!