vineethac.blogspot.com: May 2026

Sunday, May 17, 2026

Working with GPUs - Part6 - H100 SXM5 architecture

Now that we have a foundational understanding of the core utilities used to monitor and manage GPUs, let's dive into the hardware architecture of the NVIDIA H100 SXM. To truly understand GPU computing, it is essential to visualize how data flows through the silicon. The following overview maps out the internal components of the H100, providing a clear frame of reference so you can easily correlate key architectural terms such as Streaming Multiprocessors (SMs), Tensor Cores, L2 Cache, High Bandwidth Memory (HBM3), etc.

Overview

H100 is released around Sep 2022
Based on the Hopper architecture
It comes in two form factors

PCIe (300W)
SXM (700W)

Has 80 GB HBM3 memory (3.35 TB/s)
132 SMs
528 Tensor cores (4 per SM)
80B transistors on a custom 4N process node

Architecture

Image ref: NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

HBM3 - High Bandwidth Memory

This is the off-chip 80 GB device memory.
Divided in 5 stacks and connected via 10 independent 512-bit memory controllers.
Data flow: SM - L1 - L2 - memory partition/ crossbar - memory controller - HBM stack
H100 SXM5 has 5 HBM3 stacks.
HBM3 stack is DRAM.

L2 cache

50 MB of L2 cache, divided into two 25 MB partitions.
L2 cache is SRAM.

Unified shared memory + L1 cache

256 KB size 33 TB/s bandwidth per SM divided into 32 banks, each 32 bits (4 bytes) wide.
These are SRAM.

Registers

Every thread gets a private set of on-chip registers.
They have very high bandwidth, and very low latency.
256 KB per SM.

Gigathread engine

This is the hardware that takes a kernel launch and hands out thread blocks to SMs.
It tracks which thread blocks are not yet started, running, and finished.
When an SM has capacity to run another thread block, the Gigathread engine assigns the next thread block to that SM.
This ensures intelligent work distribution for optimal GPU utilization.

SM - Streaming Multiprocessor

SMs are the fundamental execution unit of the GPU which executes thread blocks of a CUDA kernel.
H100 SXM5 GPU has 132 SMs.
Following are the components of SM:

FP32 CUDA cores, Int/FP64 units
4th gen Tensor cores
Shared memory/ L1 cache
L1 instruction cache
Warp scheduler
Dispatch units
Registers
L0 instruction cache

Each SM is divided into 4 identical sub-divisions called Quadrants or SMSPs (SM Sub Partitions).

TMA - Tensor Memory Accelerator

Each SM has a TMA unit.
Offloads the tensor copy operations from the SMs.

Tensor core

They are really fast units for performing MMA operations (Matrix Multiply Accumulate).
4 tensor cores per SM.

GPC - Graphics Processing Cluster

It is a group of 18 SMs.
There are 8 GPCs in a H100.
Each GPC is connected to its own chunk of L2 cache.
GPCs also enable the use of distributed shared memory between the SMs.

TPC - Texture Processing Unit

Single TPC holds 2 SMs.
Job of TPC is to have shared SM block, so that communication between the two SMs is really fast.

NVLink

4th gen NVLink.
18 NVLink 4.0 lanes which gives 900 GB/s of GPU-GPU bandwidth.

References

Hope it was useful. Cheers!