DCGM — Nexus One AI Portal

What is it?

In plain terms

DCGM (Data Center GPU Manager) is NVIDIA's official tool for monitoring and managing GPUs in a server environment. It collects detailed health metrics — how busy the GPU is, how hot it's running, how much memory is used, and whether any hardware errors have occurred. Think of it as the diagnostic dashboard for your AI hardware.

What it monitors

GPU utilisation — % of compute being used
Memory usage — how much VRAM is occupied vs. free
Temperature — current and max temperatures
Power draw — watts consumed vs. TDP
Hardware errors — ECC errors, retired pages, Xid errors
PCIe bandwidth — data throughput to/from the GPU

How to use it

Quick health check — all GPUs

Run dcgmi diag -r 1 for a quick diagnostic that checks all GPUs for common issues. Returns PASS/FAIL per GPU per test.

View current GPU stats

Run dcgmi dmon to see a live feed of GPU metrics — utilisation, memory, temperature, and power for every GPU, updated every second.

Check GPU information and IDs

Run dcgmi discovery -l to list all GPUs on the system with their IDs, names, and PCIe slot information.

Check for hardware errors

Run dcgmi health -g 0 -c (replace 0 with your GPU group ID) to check the health status and see if any GPUs have reported hardware errors. Green = healthy.

💡 For a quick, visual live view of GPU usage, use nvtop instead — it's easier to read for day-to-day checks. DCGM is better for detailed diagnostics, historical data, and integration with monitoring systems.

What the numbers mean

GPU Utilisation

0–10% — GPU is idle, no model is actively processing
50–80% — Normal for active inference (queries being processed)
90–100% — Heavy load: training or many concurrent queries

Memory Usage

Low memory free — A model is loaded and ready to serve
High memory free — No model loaded; run ollama ps to check
Memory full + errors — Model too large for available VRAM

Temperature

Under 75°C — Normal operating range
75–85°C — Warm but acceptable under load
Over 85°C — Check airflow and cooling; contact support

Hardware Errors

ECC errors (corrected) — Normal; GPU is self-correcting
ECC errors (uncorrected) — Investigate immediately
Xid errors in logs — Hardware or driver issue; contact support

Works with

📈 nvtop 🦙 Ollama