vineethac.blogspot.com: sensor

During routine diagnostics using dcgmi diag -r 3 test suite, several GPU nodes began failing due to thermal throttling issues. This blog post outlines how the issue was identified, investigated, and resolved.

The problem

Multiple GPU servers began consistently failing level 3 dcgmi stress tests following a routine BMC (Baseboard Management Controller) firmware upgrade on the 8 GPU Supermicro servers. The diagnostic output flagged clocks_throttle_reason_sw_thermal_slowdown errors and indicated some GPUs exceeded the user-specified maximum allowed temperature of 87. Importantly, no underlying GPU hardware faults were identified. The environment was running DCGM version 3.1.8 with GPU driver version 550.90.x.

When the GPU temperature goes above 87, you will see the following thermal diagnostic warnings in the dcgmi r3 results:

Initial investigation

The initial investigation confirmed that no persistent hardware issues, sensor failures, or abnormal readings existed outside of the R3 test execution. The datacenter partner also confirmed no temperature events in the facility, and physical walkthroughs of affected nodes showed conditions normal.

This pointed towards a cooling configuration problem triggered only under the extreme thermal load of the R3 stress test. As a targeted experiment, the server fan mode was manually switched from the default "Optimal Speed" to "Full Speed", followed by a power cycle. The result was definitive: with fans forced to maximum, all GPUs stayed within safe temperature limits and all R3 tests passed. This was validated across multiple nodes and was giving consistent pass results. The datacenter partner's engineering team separately confirmed the recommendation to operate these H100 GPUs is in "Optimal Speed" fan mode, ruling out Full Speed as a viable permanent fix.

While the dcgmi R3 tests are running, we continuously gathered GPU metrics and sensor values using nvidia-smi and ipmitool for further review.

nvidia-smi --query-gpu=index,name,temperature.gpu,temperature.memory --format=csv -l 1 >> gpu_temp.log

# cat gpu_temp.log | grep "HBM3, 88" <<< you can see GPU temperature > 87
2, NVIDIA H100 80GB HBM3, 88, 90
1, NVIDIA H100 80GB HBM3, 88, 86
7, NVIDIA H100 80GB HBM3, 88, 80
6, NVIDIA H100 80GB HBM3, 88, 89
6, NVIDIA H100 80GB HBM3, 88, 91
0, NVIDIA H100 80GB HBM3, 88, 85
1, NVIDIA H100 80GB HBM3, 88, 89
5, NVIDIA H100 80GB HBM3, 88, 91
5, NVIDIA H100 80GB HBM3, 88, 92

# cat gpu_temp.log | grep "HBM3, 89" <<< you can see GPU temperature > 87
3, NVIDIA H100 80GB HBM3, 89, 85
3, NVIDIA H100 80GB HBM3, 89, 81
5, NVIDIA H100 80GB HBM3, 89, 93
5, NVIDIA H100 80GB HBM3, 89, 91

#!/usr/bin/env bash

while true; do
  sudo ipmitool sensor list >> sensor_data.log
done

# cat sensor_data.log | grep Fail | grep GPU
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<< you can see GPU temperature > 87
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     88C/190F |    5C/41F |   87C/189F | <<<
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |
  Fail   | (1210) GPU Temp          |     87C/189F |    5C/41F |   87C/189F |

The above readings clearly indicate the GPUs are actually experiencing high temperature values above 87.

Root cause analysis

Working with the hardware vendor, the team uncovered the precise mechanism behind the cooling failure. The newer BMC firmware had introduced a fundamentally different fan control algorithm - a "T‑Limit"-based thermal management model - replacing the legacy temperature-threshold-based fan curve. However, the SDR (Sensor Data Record), which holds the fan curve data, is intentionally preserved (by default) across BMC firmware upgrades. As a result, after the firmware update the BMC continued operating with the outdated temperature-based fan curve parameters from the previous firmware version.

The change from temperature-based to T.Limit-based fan curve is a fundamental codebase change in the BMC thermal function. The SDR holds the fan curve data and must be cleared for the new curve to take effect. In practical terms, the stale fan curve meant that under the Optimal fan mode, fans did not ramp up aggressively enough to handle the rapid GPU thermal load generated during R3 tests, causing GPUs to exceed 87 °C.

Here is a brief explanation of the difference between temperature based and T.Limit based fan curve. In the temperature-based model, fan speeds are directly tied to specific temperature thresholds. As the temperature of a component (like a GPU) increases, the fan speed increases in predefined steps.

The BMC monitors the GPU temperature.
When the temperature crosses a threshold (e.g., 60°C, 70°C, 80°C), the fan speed increases accordingly.
The fan response is reactive - it only ramps up after the temperature rises.

Example:

GPU Temp (°C)	Fan Speed (%)
< 60	30%
60–70	50%
70–85	85%
> 85	100%

Note: During high-load scenarios (like DCGM R3 stress tests), the temperature can spike rapidly. The fan response may lag, allowing the GPU to overheat before the fans catch up.

T.Limit-Based Fan Curve is a newer model and uses a proactive approach. Instead of waiting for the temperature to rise, it uses the GPU’s thermal limits (T.Limit) and workload predictions to adjust fan speeds preemptively.

The BMC reads the GPU’s T.Limit (e.g., 87°C).
It monitors power draw, workload intensity, and thermal headroom.
Fan speed is adjusted dynamically to prevent the GPU from ever approaching the T.Limit.

Example:

GPU Temp (°C)	Distance from T.Limit (87°C)	Fan Speed (%)	Behavior Description
45	42°C below	20%	Idle state, minimal cooling needed
60	27°C below	50%	Moderate load detected, fans ramping up
70	17°C below	70%	High load, proactive cooling engaged
80	7°C below	100%	Nearing T.Limit, fans set to max speed
85	2°C below	100%	Critical threshold, full fan speed to prevent overheat
87	At T.Limit	100%	Max cooling, risk of thermal throttling
>87	Exceeded	100% + Throttle	Emergency cooling + GPU throttling initiated

Note: The system anticipates thermal load and ramps up fans early, preventing overheating during sudden spikes in GPU usage.

Resolution

The SDR preservation behavior is by design. The --overwrite_sdr flag is the correct and intended mechanism for applying new thermal control parameters when the fan curve implementation changes between BMC firmware versions on Supermicro servers. If you do not want to reflash the BMC firmware with --overwrite_sdr flag, you can try clear the SDR using ipmitool. In my case, I've used the following command to clear the SDR.

ipmitool -I lanplus -H <BMC_IP> -U <USERNAME> -P '<PASSWORD>' raw 0x30 0x44

Note: In a production environment, you may first check with the server hardware vendor before executing these commands. It may also vary between different vendors.

After clearing the SDR and performing a server power cycle, the thermal diagnostic warnings were no longer observed, and all dcgmi diag -r 3 tests passed successfully with the fan mode set to Optimal.

# dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 550.90.07                                      |
| GPU Device IDs Detected   | 2330,2330,2330,2330,2330,2330,2330,2330        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
| Diagnostic                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - All                                     |
| Memory Bandwidth          | Pass - All                                     |
| EUD Test                  | Skip - All                                     |
+---------------------------+------------------------------------------------+

Identifying and resolving GPU thermal issues is critical to maintaining system stability and performance, especially under high-load scenarios like training jobs. Left unaddressed, thermal throttling can degrade performance, cause test failures, and even lead to hardware damage or job interruptions. Proactive thermal management ensures reliable operation and maximizes the efficiency of GPU-intensive workloads.

Hope it was useful. Cheers!

vineethac.blogspot.com

Pages

Saturday, February 28, 2026

Working with GPUs - Part4 - Thermal issues

The problem

Initial investigation

Root cause analysis

Resolution