XID (short for eXception ID) errors are diagnostic messages emitted by the Nvidia kernel driver (NVRM) when a GPU encounters an abnormal condition or fault. While some point to minor software glitches, others signal catastrophic hardware failures. With the H100 equipped with High Bandwidth Memory (HBM3) and NVLink interconnects - understanding these errors are critical to minimizing downtime.
How do we identify if any GPUs has XID errors
DCGMI diag
# dcgmi diag -r 3 Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Metadata ----------+------------------------------------------------| | DCGM Version | 3.1.8 | | Driver Version Detected | 550.90.07 | | GPU Device IDs Detected | 2330,2330,2330,2330,2330,2330,2330,2330 | |----- Deployment --------+------------------------------------------------| | Denylist | Pass | | NVML Library | Pass | | CUDA Main Library | Pass | | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | | Page Retirement/Row Remap | Pass | | Graphics Processes | Pass | | Inforom | Pass | +----- Integration -------+------------------------------------------------+ | PCIe | Pass - All | +----- Hardware ----------+------------------------------------------------+ | GPU Memory | Pass - All | | Diagnostic | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7 | | | Fail - GPU: 0 | | Warning | GPU 0 Found 56954234 faulty memory elements o | | | n GPU 0 Run a field diagnostic on the GPU. | | Info | GPU 0 Allocated space for 137 output matricie | | | s from 75937126809 bytes available., GPU 0 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 0 GPU 0 calculated at approximately 230 | | | 2.72 gigaflops during this test | +----- Stress ------------+------------------------------------------------+ | Targeted Stress | Pass - All | | Targeted Power | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7 | | | Fail - GPU: 0 | | Warning | GPU 0 Detected 43 xid_errors for GPU 0 | < xid_error | Info | GPU 0 GPU 0 power average: 161 W | | Info | GPU 1 GPU 1 power average: 170 W | | Info | GPU 2 GPU 2 power average: 164 W | | Info | GPU 3 GPU 3 power average: 158 W | | Info | GPU 4 GPU 4 power average: 154 W | | Info | GPU 5 GPU 5 power average: 169 W | | Info | GPU 6 GPU 6 power average: 158 W | | Info | GPU 7 GPU 7 power average: 159 W | | Memory Bandwidth | Pass - All | | EUD Test | Skip - All | +---------------------------+------------------------------------------------+
DCGMI dmon
# dcgmi dmon -e 230 --count 1 #Entity XIDER ID GPU 7 0 GPU 6 0 GPU 5 0 GPU 4 0 GPU 3 0 GPU 2 0 GPU 1 0 GPU 0 43
- -e 230 is the filed id that shows the XID errors. The value shown under XIDER column is the specific XID error.
- Note: If there are non‑zero values, that would mean one or more GPUs had logged Xid errors, and we need to cross‑reference the specific Xid codes in the kernel log and documentation to understand the nature of the fault.
- Ref: Field Identifiers — NVIDIA DCGM Documentation latest documentation
OS logs
# dmesg | grep -i xid
[ 747.179157] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address
[ 747.180533] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174
[ 747.209815] NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009
[ 1250.627528] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address
[ 1250.629013] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174
[ 1250.657381] NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009
[45911.449627] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address
[45911.451042] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174
[45911.479823] NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009
# journalctl -k | grep -i xid
May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address
May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174
May 04 20:30:00 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009
May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address
May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174
May 04 20:38:23 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009
May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address
May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174
May 05 09:02:43 xx110-r113-node-02 kernel: NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009
# grep -i xid /var/log/syslog
May 4 20:30:00 xx110-r113-node-02 kernel: [ 747.179157] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Out Of Range Address
May 4 20:30:00 xx110-r113-node-02 kernel: [ 747.180533] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a7b0=0x100000e 0x56a7b4=0x20 0x56a7a8=0x1f81fb60 0x56a7ac=0x1174
May 4 20:30:00 xx110-r113-node-02 kernel: [ 747.209815] NVRM: Xid (PCI:0000:19:00): 43, pid=8548, name=nvvs, Ch 00000009
May 4 20:38:23 xx110-r113-node-02 kernel: [ 1250.627528] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 0): Out Of Range Address
May 4 20:38:23 xx110-r113-node-02 kernel: [ 1250.629013] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x568730=0x100000e 0x568734=0x20 0x568728=0x1f81fb60 0x56872c=0x1174
May 4 20:38:23 xx110-r113-node-02 kernel: [ 1250.657381] NVRM: Xid (PCI:0000:19:00): 43, pid=10603, name=nvvs, Ch 00000009
May 4 20:40:44 xx110-r113-node-02 drpcli[4139]: Starting xid error detection test...
May 4 20:40:44 xx110-r113-node-02 drpcli[4139]: [MANDATORY] test_gpu_xid_errors: PASS, GPU XID error check passed. No errors found.
May 5 09:02:43 xx110-r113-node-02 kernel: [45911.449627] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 0): Out Of Range Address
May 5 09:02:43 xx110-r113-node-02 kernel: [45911.451042] NVRM: Xid (PCI:0000:19:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x56a730=0x103000e 0x56a734=0x20 0x56a728=0x1f81fb60 0x56a72c=0x1174
May 5 09:02:43 xx110-r113-node-02 kernel: [45911.479823] NVRM: Xid (PCI:0000:19:00): 43, pid=79302, name=nvvs, Ch 00000009
