Showing posts with label memory. Show all posts
Showing posts with label memory. Show all posts

Friday, November 14, 2025

Working with GPUs - Part2 - Memory fault indicators

AI workloads rely heavily on GPU memory reliability and, memory faults can silently degrade performance long before a GPU outright fails. Understanding which signals truly indicate GPU memory issues, and how to act on them is essential for stable operations at scale. This post focuses on authoritative memory fault indicators on Nvidia H100 GPUs, how they differ, and how to use them together to make correct operational decisions.

HBM3 (High Bandwidth Memory 3) memory on H100 delivers massive bandwidth, but it operates under extreme thermal, electrical, and utilization stress. When memory reliability starts degrading:

  • Model training can fail intermittently
  • NCCL performance may collapse unexpectedly
  • Silent data corruption risks increase
  • Faulty GPUs can impact entire multi‑GPU jobs

Early detection lets you drain, reset, isolate, or RMA a GPU before a customer‑visible incident occurs.

Primary indicators

ECC errors

ECC (Error Correcting Code) detects and reports bit‑level memory errors in HBM3.

ECC error types:

  • Correctable Errors (CE)
    • Single‑bit errors fixed automatically
    • Indicate beginning of memory stress or aging
  • Uncorrectable Errors (UE)
    • Multi‑bit errors not recoverable
    • High risk of data corruption
    • Immediate GPU isolation required

nvidia-smi -q -d ECC


nvidia-smi --query-gpu=index,name,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total --format=csv


nvidia-smi --query-gpu=index,name,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total --format=csv


Notes

  • Rising CE counts is an early warning. Monitor closely.
  • Any UE count > 0, then drain workload, isolate GPU, and proceed to fix it.

Remapped Rows

Row Remapping is a hardware healing mechanism in H100 HBM3.

When the GPU identifies failing memory rows:

  • Faulty rows are permanently retired
  • Spare rows are mapped in their place

nvidia-smi -q -d ROW_REMAPPER


Notes
  • Remapped Rows Pending = Yes
    • GPU detected bad memory rows
    • Reset required to complete remap
  • Remapped Rows > 0
    • Hardware has already consumed spare memory
    • Strong early RMA signal

Row remapping is the earliest and strongest indicator of degrading HBM3 memory; often appearing before serious ECC failures.

Hope it was useful. Cheers!

Monday, July 1, 2024

vSphere with Tanzu using NSX-T - Part34 - CPU and Memory utilization of a supervisor cluster

vSphere with Tanzu is a Kubernetes-based platform for deploying and managing containerized applications. As with any cloud-native platform, it's essential to monitor the performance and utilization of the underlying infrastructure to ensure optimal resource allocation and avoid any potential issues. In this blog post, we'll explore a Python script that can be used to check the CPU and memory allocation/ usage of a WCP Supervisor cluster.


You can access the Python script from my GitHub repository: https://github.com/vineethac/VMware/tree/main/vSphere_with_Tanzu/wcp_cluster_util


Sample screenshot of the output


The script uses the Kubernetes Python client library (kubernetes) to connect to the Supervisor cluster using the admin kubeconfig and retrieve information about the nodes and their resource utilization. The script then calculates the average CPU and memory utilization across all nodes and prints the results to the console.

Note: In my case instead of running it as a script every time, I made it an executable plugin and copied it to the system executable path. I placed it in $HOME/.krew/bin in my laptop.

Hope it was useful. Cheers!

Saturday, August 5, 2023

vSphere with Tanzu using NSX-T - Part28 - Create a custom VM Class

A VM class is a template that defines CPU, memory, and reservations for VMs. If you want to create a custom vmclass you can use dcli or vSphere UI. 

Following is an example using dcli:

❯ dcli +server vcenter-server-fqdn +skip-server-verification com vmware vcenter namespacemanagement virtualmachineclasses create --id best-effort-16xlarge --cpu-count 64 --memory-mb 131072

This will create a vmclass with 64 vCPUs and 128GB memory with no reservations.

❯ dcli +server vcenter-server-fqdn +skip-server-verification com vmware vcenter namespacemanagement virtualmachineclasses create --id guaranteed-16xlarge --cpu-count 64 --memory-mb 131072 --cpu-reservation 100 --memory-reservation 100

This will create a vmclass with 64 vCPUs and 128GB memory with 100% reservations.

Note: You will need to attach this newly created vmclass to a supervisor namespace to use it.

Here is the documentation reference: https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-with-tanzu-services-workloads/GUID-18C7B2E3-BCF5-488C-9C50-937E29BB0C48.html

Hope it was useful. Cheers!

Saturday, September 19, 2020

Performance monitoring in Linux

CPU

cd /proc/

cat cpuinfo

less cpuinfo

less cpuinfo | grep processor

uptime

The load average is the CPU usage load average over 1 min, 5 min, and 15 min. The calculation for load avg value is given below.

For a single processor system:

Load avg value 1.0 = 100% CPU capacity usage

Load avg value 0.5 = 50% CPU capacity usage


This means for a 4 processor system:

Load avg value 4.0 = 100% CPU capacity usage

Load avg value 2.0 = 50% CPU capacity usage

Load avg value 1.0 = 25% CPU capacity usage


top

To get details of a process: ps aux | grep <process ID>

To get logs of a process: journalctl _PID=<process ID>


Memory


cat /proc/meminfo
less /proc/meminfo


free -h


Average memory usage view by samples with regular intervals: 

vmstat <interval> <number of samples>

Hope it was useful. Cheers!