๐ 1. What is Fabric Manager?
๐ One-line definition
๐ A software service that manages high-speed communication between multiple GPUs (via NVSwitch), allowing them to behave like one large GPU
๐ Easy Explanation
Imagine a server with 8 GPUs:
- Without Fabric Manager → GPUs work independently
- With Fabric Manager + NVSwitch → GPUs work as a single unified system
๐ Key Technologies
- NVIDIA GPUs
- NVLink → High-speed GPU-to-GPU connection
- NVSwitch → Switch that connects all GPUs together
- Fabric Manager → Controls and manages this entire network
๐ Why is it important?
It is essential for:
- H100 / H200 / A100 GPU servers
- Distributed AI training (PyTorch / TensorFlow)
- NCCL communication (e.g., all_reduce)
๐ Without it:
- GPU communication becomes slow
- Multi-GPU jobs may fail
- NCCL timeouts can occur
๐ง 2. What does this command mean?
๐ Command
systemctl status nvidia-fabricmanager
๐ Meaning:
“Check whether the Fabric Manager service is running correctly”
๐ Command Breakdown
| Component | Description |
|---|---|
| systemctl | Linux service management tool |
| status | Check current state |
| nvidia-fabricmanager | Fabric Manager service name |
๐ 3. Understanding the Output (Very Important ⭐)
๐ Example (Healthy State)
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled)
Active: active (running)
Main PID: 2939 (nv-fabricmanager)
๐ Key Fields Explained
✅ 1. Loaded
Loaded: loaded (...; enabled)
- Service file is loaded correctly
-
enabled→ Starts automatically at boot
✅ 2. Active (Most Important)
Active: active (running)
| Status | Meaning |
|---|---|
| active (running) | ✅ Healthy |
| inactive | ❌ Stopped |
| failed | ❌ Error occurred |
✅ 3. Main PID
Main PID: 2939
- Process ID of the running service
✅ 4. Tasks / Memory
Tasks: 18
Memory: 50MB
- Resource usage of the service
⚠️ 4. Common Problem States
❌ 1. Service is stopped
Active: inactive (dead)
๐ Meaning:
- Fabric Manager is not running
- NVSwitch is not functioning
❌ 2. Service failure
Active: failed
๐ Possible causes:
- Driver issues
- NVSwitch errors
- GPU hardware problems
- Kernel conflicts
❌ 3. Status check fails
Failed to retrieve unit state: Connection timed out
๐ This is critical
Possible causes:
- systemd issue
- Node is hanging
- Network problem
- Kernel lockup
- Fabric Manager deadlock
๐ ️ 5. Troubleshooting Steps (Practical Guide)
✅ Step 1: Restart the service
systemctl restart nvidia-fabricmanager
✅ Step 2: Check status again
systemctl status nvidia-fabricmanager
✅ Step 3: Check logs
journalctl -u nvidia-fabricmanager -n 100
✅ Step 4: Check GPU status
nvidia-smi
๐ Look for:
- GPUs detected correctly?
- Any error messages?
- NVLink status
✅ Step 5: Check topology
nvidia-smi topo -m
๐ Verify NVLink/NVSwitch connections
๐ฅ Step 6: If everything fails
reboot
๐ Why?
- Fabric Manager operates at kernel + hardware level
- Many issues are resolved after reboot
⚡ 6. Real-World Impact (Very Important)
๐ In Slurm / Kubernetes environments
If Fabric Manager fails:
- NCCL timeouts occur
- Distributed training fails
- GPU communication slows down drastically
๐ Typical symptoms
-
BROADCAST timeout -
NCCL WARN - Sudden performance drop
๐ In many cases → Fabric Manager or NVSwitch issue
๐งฉ 7. Quick Summary
✔️ Key Points
- Fabric Manager = GPU communication controller
- Required for NVSwitch systems
- Check status with:
systemctl status nvidia-fabricmanager
✔️ Healthy state
Active: active (running)
✔️ Troubleshooting flow
- Restart service
- Check logs
-
Run
nvidia-smi - Reboot if needed
๐ฏ Final Takeaway
๐ Fabric Manager is a critical service that enables multiple GPUs to operate as one unified system. If it fails, distributed GPU workloads will likely break.
๋๊ธ ์์:
๋๊ธ ์ฐ๊ธฐ
์ฐธ๊ณ : ๋ธ๋ก๊ทธ์ ํ์๋ง ๋๊ธ์ ์์ฑํ ์ ์์ต๋๋ค.