In plain terms
DCGM (Data Center GPU Manager) is NVIDIA's official tool for monitoring and managing GPUs in a server environment. It collects detailed health metrics โ how busy the GPU is, how hot it's running, how much memory is used, and whether any hardware errors have occurred. Think of it as the diagnostic dashboard for your AI hardware.
What it monitors
- GPU utilisation โ % of compute being used
- Memory usage โ how much VRAM is occupied vs. free
- Temperature โ current and max temperatures
- Power draw โ watts consumed vs. TDP
- Hardware errors โ ECC errors, retired pages, Xid errors
- PCIe bandwidth โ data throughput to/from the GPU
Quick health check โ all GPUs
Run dcgmi diag -r 1 for a quick diagnostic that checks all GPUs for common issues. Returns PASS/FAIL per GPU per test.
View current GPU stats
Run dcgmi dmon to see a live feed of GPU metrics โ utilisation, memory, temperature, and power for every GPU, updated every second.
Check GPU information and IDs
Run dcgmi discovery -l to list all GPUs on the system with their IDs, names, and PCIe slot information.
Check for hardware errors
Run dcgmi health -g 0 -c (replace 0 with your GPU group ID) to check the health status and see if any GPUs have reported hardware errors. Green = healthy.
GPU Utilisation
0โ10% โ GPU is idle, no model is actively processing
50โ80% โ Normal for active inference (queries being processed)
90โ100% โ Heavy load: training or many concurrent queries
Memory Usage
Low memory free โ A model is loaded and ready to serve
High memory free โ No model loaded; run ollama ps to check
Memory full + errors โ Model too large for available VRAM
Temperature
Under 75ยฐC โ Normal operating range
75โ85ยฐC โ Warm but acceptable under load
Over 85ยฐC โ Check airflow and cooling; contact support
Hardware Errors
ECC errors (corrected) โ Normal; GPU is self-correcting
ECC errors (uncorrected) โ Investigate immediately
Xid errors in logs โ Hardware or driver issue; contact support