The NVIDIA Data Center GPU Manager (DCGM) is a lightweight agent that performs several functions like GPU behavior monitoring, health and diagnostics, policy management, etc. DCGM is the underlying framework, and when you install it, it runs a service called the Host Engine, which collects data, monitors health, and manages GPUs. DCGMI is simply the CLI tool (the interface) used to talk to the engine. If you want to know what the engine is seeing or if you want to tell the engine to do something, you use dcgmi.
Install DCGM
- In my case, I am installing it on Ubuntu 22.04.2 host.
- You can download the required binaries from this Nvidia repository.
- If there are previous versions of the package installed, you can follow this documentation and remove them.
- For example, I am installing version 4.5.2, which is compatible with cuda 13.0.
- Download the following packages from the above mentioned repo:
- datacenter-gpu-manager-4-core_4.5.2-1_amd64.deb
- datacenter-gpu-manager-4-cuda13_4.5.2-1_amd64.deb
- datacenter-gpu-manager-4-proprietary_4.5.2-1_amd64.deb
- datacenter-gpu-manager-4-proprietary-cuda13_4.5.2-1_amd64.deb
- Install them.
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-core_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-cuda13_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-proprietary_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-proprietary-cuda13_4.5.2-1_amd64.deb
# Enable the dcgm service
systemctl --now enable nvidia-dcgm
systemctl start nvidia-dcgm
# Check if the service is active
systemctl is-active --quiet nvidia-dcgm- Verify the installed version using: dcgmi --version
# dcgmi --version dcgmi version: 4.5.2
- List all GPUs discovered by the host engine: dcgmi discovery -l
DCGM Diagnostics
Diagnostics is a subsystem within DCGM designed to stress-test and validate the physical and logical integrity of the GPU. It is a suite of automated tests that push the GPU beyond its normal idle state to uncover hidden hardware defects, driver instabilities, or environmental issues (like poor cooling or failing power supplies). In production environments, this utility helps to assess cluster readiness levels before a workload is deployed on it. It supports multiple run levels as explained below.
- Level 1: used for sanity Check which run before starting a container or job to ensure the GPU is "alive."
- Level 2: used for analyzing/ examining failures and to get more context about it.
- Level 3/4: for extensive hardware screening (e.g., checking for thermal throttling, bandwidth checks, etc.).
# dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 4.5.2 |
| Driver Version Detected | 580.105.08 |
| GPU Device IDs Detected | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 |
|----- Deployment --------+------------------------------------------------|
| software | Fail |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU3: Fail |
| Warning: GPU3 | Page Retirement/Row Remap: GPU 3 had uncorrec |
| | table memory errors and row remapping failed. |
| | Run a field diagnostic on the GPU. |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
+----- Hardware ----------+------------------------------------------------+
| memory | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| diagnostic | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| nvbandwidth | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+----- Integration -------+------------------------------------------------+
| pcie | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+----- Stress ------------+------------------------------------------------+
| memory_bandwidth | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| targeted_stress | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| targeted_power | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+---------------------------+------------------------------------------------+# dcgmi diag -r 4
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 4.5.2 |
| Driver Version Detected | 580.105.08 |
| GPU Device IDs Detected | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 |
|----- Deployment --------+------------------------------------------------|
| software | Fail |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU3: Fail |
| Warning: GPU3 | Page Retirement/Row Remap: GPU 3 had uncorrec |
| | table memory errors and row remapping failed. |
| | Run a field diagnostic on the GPU. |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
+----- Hardware ----------+------------------------------------------------+
| memory | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| diagnostic | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| nvbandwidth | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| pulse_test | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+----- Integration -------+------------------------------------------------+
| pcie | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+----- Stress ------------+------------------------------------------------+
| memtest | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| memory_bandwidth | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| targeted_stress | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
| targeted_power | Pass |
| | GPU0: Pass |
| | GPU1: Pass |
| | GPU2: Pass |
| | GPU4: Pass |
| | GPU5: Pass |
| | GPU6: Pass |
| | GPU7: Pass |
| | GPU3: Skip |
+---------------------------+------------------------------------------------+
Here you can see GPU 3 had uncorrectable memory errors and row remap failed.
The dcgmi diag utility consists of multiple plugins as detailed below. Based on the selected run levels, respective plugins will be used to conduct the tests.
- Deployment: verifies the compute environment is ready to run CUDA applications and is able to load the NVML library.
- Diagnostic: performs large matrix multiplications. This will stress the GPU by having it draw a large amount of power and provide a high-level of throughput for five minutes (by default). During this process, the GPU will be monitored for all standard errors like XIDs, temperature violations, uncorrectable memory errors, etc. as well as the correctness of data being written and read.
- PCIe - GPU bandwidth: purpose of this plugin is to stress the communication from the host to the GPUs as well as among the GPUs on the system. It will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe.
- GPU memory: It performs comprehensive memory testing to detect hardware faults, ECC errors, and memory corruption issues.
- Targeted power: This test’s core purpose is to sustain a high level of power usage. It relies on CUDA and performs large matrix multiplications simultaneously on each GPU in order to keep the GPUs busy and drawing power. Each GPU has a large workload that is sustained throughout the test; the workload does not pulse.
- Targeted stress: maintains a constant stress level on the GPU by continuously queuing matrix operations and adjusting the workload to achieve the target performance.
- Memtest diagnostic: this is similar to memtest86, which will exercise GPU memory with various test patterns.
- Pulse test: this is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
- NVbandwidth: performs bandwidth measurements on NVIDIA GPUs on a single host.
- Memory bandwidth: It measures how fast each GPU can read from and write to its own memory, which is critical for applications that require high memory throughput. It allocates large memory arrays on each GPU and runs intensive memory operations to stress the memory subsystem. During this process, the GPU will be monitored for memory errors, CUDA errors, and performance thresholds.
Note: It is highly recommended to run these diagnostic tests while the node is in maintenance mode or when no active workloads (such as training jobs or inference services) are running on the GPU. Attempting to run higher-level diagnostics (especially levels 3 and 4) on an active node is a recipe for trouble: the diagnostic tests will likely fail to get the resources they need, and the contention for compute engines and VRAM may cause your production workloads to crash.