Nexus One AI ๐Ÿ”” Basic Tier
๐Ÿ“ก
โ† All Tools

DCGM

NVIDIA's GPU health and monitoring tool โ€” tracks utilisation, temperature, memory, and hardware errors across all GPUs.
๐Ÿ’ป Run via SSH terminal
What is it?

In plain terms

DCGM (Data Center GPU Manager) is NVIDIA's official tool for monitoring and managing GPUs in a server environment. It collects detailed health metrics โ€” how busy the GPU is, how hot it's running, how much memory is used, and whether any hardware errors have occurred. Think of it as the diagnostic dashboard for your AI hardware.

What it monitors

  • GPU utilisation โ€” % of compute being used
  • Memory usage โ€” how much VRAM is occupied vs. free
  • Temperature โ€” current and max temperatures
  • Power draw โ€” watts consumed vs. TDP
  • Hardware errors โ€” ECC errors, retired pages, Xid errors
  • PCIe bandwidth โ€” data throughput to/from the GPU
How to use it
1

Quick health check โ€” all GPUs

Run dcgmi diag -r 1 for a quick diagnostic that checks all GPUs for common issues. Returns PASS/FAIL per GPU per test.

2

View current GPU stats

Run dcgmi dmon to see a live feed of GPU metrics โ€” utilisation, memory, temperature, and power for every GPU, updated every second.

3

Check GPU information and IDs

Run dcgmi discovery -l to list all GPUs on the system with their IDs, names, and PCIe slot information.

4

Check for hardware errors

Run dcgmi health -g 0 -c (replace 0 with your GPU group ID) to check the health status and see if any GPUs have reported hardware errors. Green = healthy.

๐Ÿ’ก For a quick, visual live view of GPU usage, use nvtop instead โ€” it's easier to read for day-to-day checks. DCGM is better for detailed diagnostics, historical data, and integration with monitoring systems.
What the numbers mean

GPU Utilisation

0โ€“10% โ€” GPU is idle, no model is actively processing
50โ€“80% โ€” Normal for active inference (queries being processed)
90โ€“100% โ€” Heavy load: training or many concurrent queries

Memory Usage

Low memory free โ€” A model is loaded and ready to serve
High memory free โ€” No model loaded; run ollama ps to check
Memory full + errors โ€” Model too large for available VRAM

Temperature

Under 75ยฐC โ€” Normal operating range
75โ€“85ยฐC โ€” Warm but acceptable under load
Over 85ยฐC โ€” Check airflow and cooling; contact support

Hardware Errors

ECC errors (corrected) โ€” Normal; GPU is self-correcting
ECC errors (uncorrected) โ€” Investigate immediately
Xid errors in logs โ€” Hardware or driver issue; contact support

Works with
๐Ÿ“ˆ nvtop ๐Ÿฆ™ Ollama