Showing posts with label debug. Show all posts
Showing posts with label debug. Show all posts

Saturday, March 7, 2026

Working with GPUs – A Practical Blog Series

This blog series captures practical learnings from working with GPUs in real‑world environments, with a focus on operations, reliability, and scale. Each post deep‑dives into specific aspects of GPU systems based on hands‑on experience, incidents, and operational challenges. Together, these articles aim to share actionable insights, highlight common pitfalls, and help teams build more robust and predictable GPU operations.


Part 01: Using nvidia-smi
Part 02: Memory fault indicators
Part 03: Using dcgmi
Part 04: Thermal issues


Sunday, January 25, 2026

Working with GPUs - Part3 - Using dcgmi

The NVIDIA Data Center GPU Manager (DCGM) is a lightweight agent that performs several functions like GPU behavior monitoring, health and diagnostics, policy management, etc. DCGM is the underlying framework, and when you install it, it runs a service called the Host Engine, which collects data, monitors health, and manages GPUs. DCGMI is simply the CLI tool (the interface) used to talk to the engine. If you want to know what the engine is seeing or if you want to tell the engine to do something, you use dcgmi.

Install DCGM

  • In my case, I am installing it on Ubuntu 22.04.2 host. 
  • You can download the required binaries from this Nvidia repository.
  • If there are previous versions of the package installed, you can follow this documentation and remove them.
  • For example, I am installing version 4.5.2, which is compatible with cuda 13.0.
  • Download the following packages from the above mentioned repo:
    • datacenter-gpu-manager-4-core_4.5.2-1_amd64.deb
    • datacenter-gpu-manager-4-cuda13_4.5.2-1_amd64.deb
    • datacenter-gpu-manager-4-proprietary_4.5.2-1_amd64.deb
    • datacenter-gpu-manager-4-proprietary-cuda13_4.5.2-1_amd64.deb
  • Install them.
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-core_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-cuda13_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-proprietary_4.5.2-1_amd64.deb
sudo DEBIAN_FRONTEND=noninteractive dpkg -i datacenter-gpu-manager-4-proprietary-cuda13_4.5.2-1_amd64.deb

# Enable the dcgm service
systemctl --now enable nvidia-dcgm
systemctl start nvidia-dcgm

# Check if the service is active systemctl is-active --quiet nvidia-dcgm
  • Verify the installed version using: dcgmi --version
# dcgmi --version

dcgmi  version: 4.5.2

  • List all GPUs discovered by the host engine: dcgmi discovery -l


DCGM Diagnostics

Diagnostics is a subsystem within DCGM designed to stress-test and validate the physical and logical integrity of the GPU. It is a suite of automated tests that push the GPU beyond its normal idle state to uncover hidden hardware defects, driver instabilities, or environmental issues (like poor cooling or failing power supplies).  In production environments, this utility helps to assess cluster readiness levels before a workload is deployed on it. It supports multiple run levels as explained below.

  • Level 1: used for sanity Check which run before starting a container or job to ensure the GPU is "alive."
  • Level 2: used for analyzing/ examining failures and to get more context about it.
  • Level 3/4: for extensive hardware screening (e.g., checking for thermal throttling, bandwidth checks, etc.).

Here is how you can run a level 3 test: dcgmi diag -r 3
# dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 4.5.2                                          |
| Driver Version Detected   | 580.105.08                                     |
| GPU Device IDs Detected   | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 |
|-----  Deployment  --------+------------------------------------------------|
| software                  | Fail                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU3: Fail                                     |
| Warning: GPU3             | Page Retirement/Row Remap: GPU 3 had uncorrec  |
|                           | table memory errors and row remapping failed.  |
|                           |  Run a field diagnostic on the GPU.            |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
+-----  Hardware  ----------+------------------------------------------------+
| memory                    | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| diagnostic                | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| nvbandwidth               | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+-----  Integration  -------+------------------------------------------------+
| pcie                      | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+-----  Stress  ------------+------------------------------------------------+
| memory_bandwidth          | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| targeted_stress           | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
| targeted_power            | Pass                                           |
|                           | GPU0: Pass                                     |
|                           | GPU1: Pass                                     |
|                           | GPU2: Pass                                     |
|                           | GPU4: Pass                                     |
|                           | GPU5: Pass                                     |
|                           | GPU6: Pass                                     |
|                           | GPU7: Pass                                     |
|                           | GPU3: Skip                                     |
+---------------------------+------------------------------------------------+

# dcgmi diag -r 4
Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Metadata ----------+------------------------------------------------| | DCGM Version | 4.5.2 | | Driver Version Detected | 580.105.08 | | GPU Device IDs Detected | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 | |----- Deployment --------+------------------------------------------------| | software | Fail | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Fail | | Warning: GPU3 | Page Retirement/Row Remap: GPU 3 had uncorrec | | | table memory errors and row remapping failed. | | | Run a field diagnostic on the GPU. | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | +----- Hardware ----------+------------------------------------------------+ | memory | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | | diagnostic | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | | nvbandwidth | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | | pulse_test | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | +----- Integration -------+------------------------------------------------+ | pcie | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | +----- Stress ------------+------------------------------------------------+ | memtest | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | | memory_bandwidth | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | | targeted_stress | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | | targeted_power | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | | GPU3: Skip | +---------------------------+------------------------------------------------+

Here you can see GPU 3 had uncorrectable memory errors and row remap failed.

The dcgmi diag utility consists of multiple plugins as detailed below. Based on the selected run levels, respective plugins will be used to conduct the tests.

  • Deployment: verifies the compute environment is ready to run CUDA applications and is able to load the NVML library.
  • Diagnostic: performs large matrix multiplications. This will stress the GPU by having it draw a large amount of power and provide a high-level of throughput for five minutes (by default). During this process, the GPU will be monitored for all standard errors like XIDs, temperature violations, uncorrectable memory errors, etc. as well as the correctness of data being written and read.
  • PCIe - GPU bandwidth: purpose of this plugin is to stress the communication from the host to the GPUs as well as among the GPUs on the system. It will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe.
  • GPU memory: It performs comprehensive memory testing to detect hardware faults, ECC errors, and memory corruption issues.
  • Targeted power: This test’s core purpose is to sustain a high level of power usage. It relies on CUDA and performs large matrix multiplications simultaneously on each GPU in order to keep the GPUs busy and drawing power. Each GPU has a large workload that is sustained throughout the test; the workload does not pulse.
  • Targeted stress: maintains a constant stress level on the GPU by continuously queuing matrix operations and adjusting the workload to achieve the target performance.
  • Memtest diagnostic: this is similar to memtest86, which will exercise GPU memory with various test patterns.
  • Pulse test: this is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
  • NVbandwidth: performs bandwidth measurements on NVIDIA GPUs on a single host.
  • Memory bandwidth: It measures how fast each GPU can read from and write to its own memory, which is critical for applications that require high memory throughput. It allocates large memory arrays on each GPU and runs intensive memory operations to stress the memory subsystem. During this process, the GPU will be monitored for memory errors, CUDA errors, and performance thresholds.

Note: It is highly recommended to run these diagnostic tests while the node is in maintenance mode or when no active workloads (such as training jobs or inference services) are running on the GPU. Attempting to run higher-level diagnostics (especially levels 3 and 4) on an active node is a recipe for trouble: the diagnostic tests will likely fail to get the resources they need, and the contention for compute engines and VRAM may cause your production workloads to crash.

References


Hope it was useful. Cheers!

Sunday, October 29, 2023

Kubernetes 101 - Part12 - Debug pod

When it comes to troubleshooting application connectivity and name resolution issues in Kubernetes, having the right tools at your disposal can make all the difference. One of the most common challenges is accessing essential utilities like ping, nslookup, dig, traceroute, and more. To simplify this process, we've created a container image that packs a range of these utilities, making it easy to quickly identify and resolve connectivity issues.

 

The Container Image: A Swiss Army Knife for Troubleshooting

This container image, designed specifically for Kubernetes troubleshooting, comes pre-installed with the following essential utilities:

  1. ping: A classic network diagnostic tool for testing connectivity.
  2. dig: A DNS lookup tool for resolving domain names to IP addresses.
  3. nslookup: A network troubleshooting tool for resolving hostnames to IP addresses.
  4. traceroute: A network diagnostic tool for tracing the path of packets across a network.
  5. curl: A command-line tool for transferring data to and from a web server using HTTP, HTTPS, SCP, SFTP, TFTP, and more.
  6. wget: A command-line tool for downloading files from the web.
  7. nc: A command-line tool for reading and writing data to a network socket.
  8. netstat: A command-line tool for displaying network connections, routing tables, and interface statistics.
  9. ifconfig: A command-line tool for configuring network interfaces.
  10. route: A command-line tool for displaying and modifying the IP routing table.
  11. host: A command-line tool for performing DNS lookups and resolving hostnames.
  12. arp: A command-line tool for displaying and modifying the ARP cache.
  13. iostat: A command-line tool for displaying disk I/O statistics.
  14. top: A command-line tool for displaying system resource usage.
  15. free: A command-line tool for displaying free memory and swap space.
  16. vmstat: A command-line tool for displaying virtual memory statistics.
  17. pmap: A command-line tool for displaying process memory maps.
  18. mpstat: A command-line tool for displaying multiprocessor statistics.
  19. python3: A programming language and interpreter.
  20. pip: A package installer for Python.

 

Run as a pod on Kubernetes

kubectl run debug --image=vineethac/debug -n default -- sleep infinity

 

Exec into the debug pod

kubectl exec -it debug -n default -- bash 
root@debug:/# ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 64 bytes from 8.8.8.8: icmp_seq=1 ttl=46 time=49.3 ms 64 bytes from 8.8.8.8: icmp_seq=2 ttl=45 time=57.4 ms 64 bytes from 8.8.8.8: icmp_seq=3 ttl=46 time=49.4 ms ^C --- 8.8.8.8 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 49.334/52.030/57.404/3.799 ms root@debug:/#
root@debug:/# nslookup google.com Server: 10.96.0.10 Address: 10.96.0.10#53 Non-authoritative answer: Name: google.com Address: 142.250.72.206 Name: google.com Address: 2607:f8b0:4005:80c::200e root@debug:/# exit exit ❯

 

Reference

https://github.com/vineethac/Docker/tree/main/debug-image

By having these essential utilities at your fingertips, you'll be better equipped to quickly identify and resolve connectivity issues in your Kubernetes cluster, saving you time and reducing the complexity of troubleshooting.

Hope it was useful. Cheers!

Friday, September 22, 2023

Configure syslog forwarding in vCenter servers using Python

As a system administrator, it's essential to ensure that your vCenter servers are properly configured to collect and forward system logs to a central location for monitoring and analysis. In this blog, we'll explore how to configure syslog forwarding in vCenter servers using Python.

You can access the Python script from my GitHub repository: 
https://github.com/vineethac/VMware/tree/main/vCenter/syslog_forwarding



In this blog, we've demonstrated how to get, test, and set syslog forwarding configuration in vCenter servers using Python. By following these steps, you can ensure that your vCenter servers are properly configured to collect and forward system logs to a central location for monitoring and analysis. Remember to replace the placeholders in the config file with your actual vCenter server names, syslog server IP address or hostname, port, and protocol.

Hope it was useful. Cheers!

Saturday, July 9, 2016

Anatomy of Hyper-V cluster debug log

  • Get-ClusterLog dumps the events to a text file
  • Location: C:\Windows\Clsuter\Reports\Cluster.log
  • It captures last 72 hours log
  • Cluster log is in GMT (because of geographically spanned multi-site clusters)
  • Usage: Get-ClusterLog -timespan (which gives last "x" minutes logs)
  • You can also set the levels of logs
  • Set-ClusterLog -Level 3 (level 3 is default)
  • It can be from level 0 to level 5 (increasing level of logging has performance impact)
  • Level 5 will provide the highest level of detail
  • Log format:
    [ProcessID] [ThreadID] [Date/Time] [INFO/WARN/ERR/DBG] [RescouceType] [ResourceName] [Description]