213 lines
8.8 KiB
HTML
213 lines
8.8 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta charset="UTF-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||
<title>DCGM — Nexus One AI Portal</title>
|
||
<link rel="stylesheet" href="style.css?v=4">
|
||
</head>
|
||
<body>
|
||
|
||
<header class="topnav">
|
||
<a href="index.html" class="brand">Nexus One <span>AI</span></a>
|
||
<nav>
|
||
<a href="index.html">Home</a>
|
||
<a href="quickstart.html">Quick Start</a>
|
||
<a href="prompts.html">Prompt Library</a>
|
||
<a href="usecases.html">Use Cases</a>
|
||
<span class="nav-sep"></span>
|
||
<div class="nav-dropdown">
|
||
<button class="nav-drop-btn">Help ▾</button>
|
||
<div class="nav-drop-menu">
|
||
<span class="nav-drop-cat">LEARN /</span>
|
||
<a href="quickstart.html">Quick Start</a>
|
||
<a href="models.html">Models</a>
|
||
<span class="nav-drop-cat">SUPPORT /</span>
|
||
<a href="troubleshooting.html">Troubleshoot</a>
|
||
<a href="faq.html">FAQ</a>
|
||
<span class="nav-drop-cat">MORE /</span>
|
||
<a href="glossary.html">Glossary</a>
|
||
<a href="whats-new.html">What's New</a>
|
||
</div>
|
||
</div>
|
||
<div class="nav-dropdown">
|
||
<button class="nav-drop-btn">Admin ▾</button>
|
||
<div class="nav-drop-menu nav-drop-menu-wide">
|
||
<span class="nav-drop-cat">DOCS /</span>
|
||
<a href="security.html">Security & Privacy</a>
|
||
<a href="admin.html">Admin Guide</a>
|
||
<span class="nav-drop-cat">MONITOR /</span>
|
||
<a href="dashboard.html">Dashboard</a>
|
||
<a href="analytics.html">Usage Analytics</a>
|
||
<a href="audit.html">Audit Log</a>
|
||
<a href="feedback.html">Feedback & Ratings</a>
|
||
<span class="nav-drop-cat">MANAGE /</span>
|
||
<a href="users.html">Users</a>
|
||
<a href="teams.html">Teams</a>
|
||
<a href="models-admin.html">Model Manager</a>
|
||
<a href="training.html">Training</a>
|
||
<a href="knowledge.html">Knowledge Base</a>
|
||
<span class="nav-drop-cat">TOOLS /</span>
|
||
<a href="apikeys.html">API Keys</a>
|
||
<a href="benchmark.html">Benchmarking</a>
|
||
<a href="model-compare.html">Model Compare</a>
|
||
<a href="api-playground.html">API Playground</a>
|
||
<a href="guardrails.html">Guardrails</a>
|
||
<a href="rag-quality.html">RAG Quality</a>
|
||
<a href="router.html">Model Router</a>
|
||
<a href="connectors.html">Connectors</a>
|
||
<span class="nav-drop-cat">SYSTEM /</span>
|
||
<a href="console.html">Console</a>
|
||
<a href="settings.html">Settings</a>
|
||
</div>
|
||
</div>
|
||
<div class="nav-dropdown">
|
||
<button class="nav-drop-btn">AI Tools ▾</button>
|
||
<div class="nav-drop-menu">
|
||
<span class="nav-drop-cat">INTELLIGENCE /</span>
|
||
<a href="documents.html">Document Intelligence</a>
|
||
<a href="chat-multi.html">Multimodal Chat</a>
|
||
<a href="prompt-studio.html">Prompt Studio</a>
|
||
<a href="meeting.html">Meeting Assistant</a>
|
||
<span class="nav-drop-cat">AUTOMATION /</span>
|
||
<a href="agents.html">Agent Builder</a>
|
||
<a href="schedules.html">Scheduled Jobs</a>
|
||
<a href="workflows.html">Workflow Automation</a>
|
||
<span class="nav-drop-cat">QUALITY /</span>
|
||
<a href="evals.html">AI Eval Suite</a>
|
||
<a href="chatrooms.html">Chat Rooms</a>
|
||
</div>
|
||
</div>
|
||
</nav>
|
||
<a href="notifications.html" style="position:relative">🔔</a>
|
||
<span class="badge" data-brand="tier">Basic Tier</span>
|
||
<div id="nav-org-logo" class="nav-org-logo"></div>
|
||
</header>
|
||
|
||
<div class="tool-hero">
|
||
<div class="tool-icon">📡</div>
|
||
<div class="tool-meta">
|
||
<a href="index.html#tools" class="back-link">← All Tools</a>
|
||
<h1>DCGM</h1>
|
||
<div class="tagline">NVIDIA's GPU health and monitoring tool — tracks utilisation, temperature, memory, and hardware errors across all GPUs.</div>
|
||
<div class="hero-actions">
|
||
<button class="btn-primary" onclick="copyCmd('dcgmi diag -r 1', this)">Copy Health Check Command</button>
|
||
<button class="btn-secondary" onclick="copyCmd('dcgmi dmon', this)">Copy Live Monitor Command</button>
|
||
<span class="access-pill">💻 Run via SSH terminal</span>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<script>
|
||
function copyCmd(cmd, btn) {
|
||
navigator.clipboard.writeText(cmd).then(() => {
|
||
const orig = btn.textContent;
|
||
btn.textContent = '✓ Copied!';
|
||
btn.style.background = '#059669';
|
||
setTimeout(() => { btn.textContent = orig; btn.style.background = ''; }, 2000);
|
||
});
|
||
}
|
||
</script>
|
||
|
||
<div class="content">
|
||
|
||
<div class="section-title">What is it?</div>
|
||
<div class="info-grid">
|
||
<div class="info-card">
|
||
<h4>In plain terms</h4>
|
||
<p>DCGM (Data Center GPU Manager) is NVIDIA's official tool for monitoring and managing GPUs in a server environment. It collects detailed health metrics — how busy the GPU is, how hot it's running, how much memory is used, and whether any hardware errors have occurred. Think of it as the diagnostic dashboard for your AI hardware.</p>
|
||
</div>
|
||
<div class="info-card">
|
||
<h4>What it monitors</h4>
|
||
<ul>
|
||
<li><strong>GPU utilisation</strong> — % of compute being used</li>
|
||
<li><strong>Memory usage</strong> — how much VRAM is occupied vs. free</li>
|
||
<li><strong>Temperature</strong> — current and max temperatures</li>
|
||
<li><strong>Power draw</strong> — watts consumed vs. TDP</li>
|
||
<li><strong>Hardware errors</strong> — ECC errors, retired pages, Xid errors</li>
|
||
<li><strong>PCIe bandwidth</strong> — data throughput to/from the GPU</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="section-title">How to use it</div>
|
||
<div class="steps">
|
||
<div class="step">
|
||
<div class="step-num">1</div>
|
||
<div>
|
||
<h4>Quick health check — all GPUs</h4>
|
||
<p>Run <code>dcgmi diag -r 1</code> for a quick diagnostic that checks all GPUs for common issues. Returns PASS/FAIL per GPU per test.</p>
|
||
</div>
|
||
</div>
|
||
<div class="step">
|
||
<div class="step-num">2</div>
|
||
<div>
|
||
<h4>View current GPU stats</h4>
|
||
<p>Run <code>dcgmi dmon</code> to see a live feed of GPU metrics — utilisation, memory, temperature, and power for every GPU, updated every second.</p>
|
||
</div>
|
||
</div>
|
||
<div class="step">
|
||
<div class="step-num">3</div>
|
||
<div>
|
||
<h4>Check GPU information and IDs</h4>
|
||
<p>Run <code>dcgmi discovery -l</code> to list all GPUs on the system with their IDs, names, and PCIe slot information.</p>
|
||
</div>
|
||
</div>
|
||
<div class="step">
|
||
<div class="step-num">4</div>
|
||
<div>
|
||
<h4>Check for hardware errors</h4>
|
||
<p>Run <code>dcgmi health -g 0 -c</code> (replace 0 with your GPU group ID) to check the health status and see if any GPUs have reported hardware errors. Green = healthy.</p>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="notice">
|
||
💡 For a quick, visual live view of GPU usage, use <a href="tool-nvtop.html">nvtop</a> instead — it's easier to read for day-to-day checks. DCGM is better for detailed diagnostics, historical data, and integration with monitoring systems.
|
||
</div>
|
||
|
||
<div class="section-title">What the numbers mean</div>
|
||
<div class="info-grid">
|
||
<div class="info-card">
|
||
<h4>GPU Utilisation</h4>
|
||
<p><strong>0–10%</strong> — GPU is idle, no model is actively processing<br>
|
||
<strong>50–80%</strong> — Normal for active inference (queries being processed)<br>
|
||
<strong>90–100%</strong> — Heavy load: training or many concurrent queries</p>
|
||
</div>
|
||
<div class="info-card">
|
||
<h4>Memory Usage</h4>
|
||
<p><strong>Low memory free</strong> — A model is loaded and ready to serve<br>
|
||
<strong>High memory free</strong> — No model loaded; run <code>ollama ps</code> to check<br>
|
||
<strong>Memory full + errors</strong> — Model too large for available VRAM</p>
|
||
</div>
|
||
<div class="info-card">
|
||
<h4>Temperature</h4>
|
||
<p><strong>Under 75°C</strong> — Normal operating range<br>
|
||
<strong>75–85°C</strong> — Warm but acceptable under load<br>
|
||
<strong>Over 85°C</strong> — Check airflow and cooling; contact support</p>
|
||
</div>
|
||
<div class="info-card">
|
||
<h4>Hardware Errors</h4>
|
||
<p><strong>ECC errors (corrected)</strong> — Normal; GPU is self-correcting<br>
|
||
<strong>ECC errors (uncorrected)</strong> — Investigate immediately<br>
|
||
<strong>Xid errors in logs</strong> — Hardware or driver issue; contact support</p>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="section-title">Works with</div>
|
||
<div class="works-with">
|
||
<a href="tool-nvtop.html">📈 nvtop</a>
|
||
<a href="tool-ollama.html">🦙 Ollama</a>
|
||
</div>
|
||
|
||
</div>
|
||
|
||
<footer>
|
||
<p>Nexus One AI · Powered by Cezen · Basic Tier</p>
|
||
<p>Questions? <a href="mailto:support@cezentech.com">support@cezentech.com</a></p>
|
||
</footer>
|
||
|
||
<script src="auth.js"></script>
|
||
<script src="branding.js"></script>
|
||
</body>
|
||
</html>
|