A Beginner-Friendly Guide for GPU and HPC Environments
When operating GPU servers or HPC clusters, you may sometimes see error messages like:
Lustre RDMA failure
LustreError
LNetError
Request sent has timed out
connection lost to ... @o2ib
At first glance, these messages can look complicated.
However, the core meaning is simple:
A Lustre RDMA failure means that the high-speed communication path between a GPU server and a Lustre storage system has failed.
In most GPU cluster environments, Lustre storage is accessed over InfiniBand using RDMA.
So when this error occurs, it usually means there is a problem in the storage network path, not just a simple file error.
1. What Is Lustre? 📦
Lustre is a high-performance parallel file system commonly used in large-scale HPC and AI/GPU environments.
It allows many servers to access the same shared storage system at the same time with very high throughput.
For example, GPU servers may use Lustre-mounted paths such as:
/mnt/lustre
/mnt/ddn
/mnt/a2wbl
/home/user/data
Compared to regular storage:
| Storage Type | Description |
|---|---|
| Local Disk | Disk installed directly inside one server |
| NFS | Shared network storage for general workloads |
| Lustre | High-performance parallel storage for large clusters |
In AI training environments, Lustre is often used to store:
- Training datasets
- Checkpoints
- Model outputs
- Logs
- Shared project files
2. What Is RDMA? ⚡
RDMA stands for Remote Direct Memory Access.
In simple terms:
RDMA allows one server to transfer data directly to another server’s memory with very little CPU involvement.
A normal network data flow looks like this:
Application
↓
Operating System
↓
CPU Processing
↓
Network Card
↓
Remote Server
With RDMA, the flow becomes much faster:
Server A Memory
↓
RDMA / InfiniBand
↓
Server B Memory
This is why RDMA is widely used in:
- GPU clusters
- AI model training
- HPC workloads
- Parallel storage systems
- Low-latency network environments
3. What Does @o2ib Mean? 🌐
In Lustre logs or mount information, you may see something like this:
100.100.100.105@o2ib:/a2wbl /mnt/a2wbl lustre
The important part is:
@o2ib
o2ib means Lustre over InfiniBand.
| Term | Meaning |
|---|---|
o2ib | Lustre communication over InfiniBand |
| InfiniBand | High-speed network used in HPC/GPU clusters |
| RDMA | Direct memory-to-memory data transfer |
| Lustre RDMA | Lustre storage traffic using RDMA |
So when you see @o2ib, it means the Lustre client is communicating with the Lustre server through an InfiniBand/RDMA network path.
4. What Is a Lustre RDMA Failure? 🚨
A Lustre RDMA failure means:
GPU Server ↔ Lustre Storage Server RDMA communication failed
In simpler words:
The GPU server tried to read from or write to Lustre storage, but the high-speed RDMA communication path failed.
This failure can happen between:
GPU Server
↓
Lustre Client
↓
LNet
↓
o2ib / RDMA
↓
InfiniBand Switch
↓
Lustre Storage Server
So the issue may be caused by:
- The GPU server
- The InfiniBand adapter
- The InfiniBand switch
- The RDMA driver
- The Lustre client
- The Lustre server
- The DDN/storage backend
5. Simple Architecture View 🧭
A normal Lustre over RDMA path looks like this:
GPU Server
↓
Lustre Client
↓
LNet
↓
o2ib / RDMA
↓
InfiniBand Switch
↓
DDN / Lustre Storage Server
↓
MDT / OST
When a Lustre RDMA failure happens, the issue is commonly around this area:
LNet
↓
o2ib / RDMA ❌ Failure point
↓
InfiniBand Network
However, the actual root cause can still be on either the client side, network side, or storage side.
6. What Are MDT and OST? 🧱
Lustre is made of several important components.
You may see names like:
MDT0006
OST0012
a2wbl02-MDT0006-mdc
Here is what they mean:
| Component | Role | If It Has a Problem |
|---|---|---|
| MDT | Stores metadata such as file names, directories, and permissions | ls, stat, mkdir, rm may become slow |
| OST | Stores actual file data | File read/write may become slow or fail |
| MGS | Stores Lustre configuration information | Mount or configuration issues may occur |
| Client | GPU or compute server accessing Lustre | User sees file access issues |
For example, if logs repeatedly mention MDT0006, the problem may affect directory listing or metadata operations.
If logs repeatedly mention an OST, the problem may affect actual file reading or writing.
7. Common Error Messages and Their Meanings 🧾
1) Request sent has timed out
Example:
ptlrpc_expire_one_request()
Request sent has timed out
Meaning:
The Lustre client sent a request to the storage server but did not receive a response within the expected time.
Simple explanation:
Client: “Please give me this file information.”
Storage: No response.
Client: Timeout.
2) connection lost
Example:
connection lost to 100.100.100.105@o2ib
Meaning:
The Lustre client lost its connection to the Lustre server through the InfiniBand/RDMA path.
This may indicate an issue with:
- InfiniBand link
- HCA adapter
- Switch port
- Lustre server
- LNet/RDMA layer
3) rc = -5
Example:
LustreError: rc = -5
In Linux, -5 commonly means I/O error.
Simple explanation:
The client experienced an input/output error while communicating with the Lustre storage system.
4) LNetError
Example:
LNetError
LNet is the network layer used by Lustre.
The simplified stack looks like this:
Lustre
↓
LNet
↓
o2ib
↓
InfiniBand / RDMA
So LNetError usually means there is a problem in the Lustre networking layer.
8. Common Causes of Lustre RDMA Failure 🔍
1) InfiniBand Port Issue
If an InfiniBand port is not active, Lustre RDMA communication may fail.
Check with:
ibstat
A healthy state usually looks like:
State: Active
Physical state: LinkUp
A problematic state may look like:
State: Down
Physical state: Polling
2) Mellanox / NVIDIA HCA Issue
The InfiniBand adapter is often called an HCA.
If the HCA driver, firmware, or hardware has a problem, RDMA communication can fail.
Check kernel logs:
dmesg -T | grep -i mlx5
Important keywords to look for:
error
timeout
reset
port error
async error
3) InfiniBand Switch Issue
If many GPU nodes experience Lustre RDMA failures at the same time, the issue may not be on a single server.
In that case, you should suspect the shared network path:
Multiple GPU Servers
↓
InfiniBand Switch
↓
Lustre / DDN Storage
If multiple nodes show connection lost, timeout, or LNetError at the same time, the issue may be related to the InfiniBand fabric or storage backend.
4) Lustre Server or DDN Storage Delay
Even if the network is healthy, the Lustre server may be slow or overloaded.
For example:
- MDT response delay
- OST response delay
- DDN controller issue
- Storage backend overload
- Too many simultaneous I/O requests
From the client side, this may appear as a timeout or RDMA failure.
5) Heavy I/O Load
Large AI workloads can generate massive I/O traffic.
Common examples include:
- Many jobs reading the same dataset
- Many nodes writing checkpoints at the same time
- Too many small files
-
Heavy
find,du, orlsoperations - Large-scale distributed training jobs
This can overload Lustre metadata or data servers and cause timeouts.
9. What Symptoms Can Users See? 💥
When Lustre RDMA failure occurs, users may experience:
| Symptom | Description |
|---|---|
ls command hangs | Directory metadata cannot be retrieved |
df -h hangs | Filesystem status query waits for Lustre response |
| File read/write fails | Data cannot be accessed properly |
Input/output error | I/O request failed |
| Notebook hangs | Home or shared storage path becomes unresponsive |
| Training job fails | Dataset loading or checkpoint writing fails |
| Slurm job stuck | Job may remain in COMPLETING or abnormal state |
| NCCL timeout | Storage delay may indirectly affect distributed training |
Commands like these may hang during Lustre issues:
ls /mnt/lustre
df -h
du -sh /mnt/lustre/*
find /mnt/lustre
A safer way to test is to use timeout:
timeout 5 ls /mnt/lustre
timeout 5 df -h /mnt/lustre
10. Single Node Issue vs Cluster-Wide Issue 🧪
The most important step is to identify the scope of impact.
Case 1: Only One Node Has the Issue
If only one GPU server shows the error, the issue may be local to that node.
Possible causes:
| Cause | Description |
|---|---|
| IB port problem | InfiniBand link is not active |
| HCA issue | Adapter hardware, firmware, or driver problem |
| Cable issue | Physical link instability |
| Driver issue | mlx5 or RDMA driver problem |
| Kernel state issue | Temporary OS or module issue |
Check commands:
hostname
ibstat
ibdev2netdev
dmesg -T | grep -i mlx5
dmesg -T | egrep -i "lustre|lnet|o2ib|rdma|timeout"
Case 2: Many Nodes Have the Issue at the Same Time
If many servers report the same issue at the same time, you should suspect a shared component.
Possible causes:
| Cause | Description |
|---|---|
| InfiniBand switch issue | Shared network path problem |
| IB fabric issue | Routing or link instability |
| Lustre server issue | MDT/OST not responding properly |
| DDN storage issue | Storage controller or backend problem |
| Heavy I/O storm | Too many clients accessing storage at once |
In this case, rebooting individual nodes is usually not the first priority.
You should check the InfiniBand fabric and Lustre/DDN storage backend first.
11. Useful Commands for Troubleshooting ✅
Check Lustre-related logs
dmesg -T | egrep -i "lustre|lnet|o2ib|rdma|ptlrpc|timeout|connection"
Check recent kernel logs:
journalctl -k --since "1 hour ago" | egrep -i "lustre|lnet|o2ib|rdma|ptlrpc|timeout|connection"
Check InfiniBand port status
ibstat
Expected healthy output:
State: Active
Physical state: LinkUp
Map IB devices to network interfaces
ibdev2netdev
Example:
mlx5_0 port 1 ==> ib0 (Up)
mlx5_1 port 1 ==> ib1 (Up)
Check Mellanox/NVIDIA HCA logs
dmesg -T | grep -i mlx5
Or filter important keywords:
dmesg -T | egrep -i "mlx5|error|timeout|reset|port"
Check Lustre mount information
mount | grep lustre
Or:
df -hT | grep lustre
During a Lustre issue, df can hang, so this may be safer:
timeout 5 df -hT | grep lustre
Check Lustre device list
lctl dl
This shows Lustre devices such as:
- MDC
- OSC
- MDT
- OST
- MGC
Ping Lustre NID
If the log shows a target like:
100.100.100.105@o2ib
You can test it with:
lctl ping 100.100.100.105@o2ib
If this fails or times out, the Lustre LNet/RDMA path may have a problem.
12. Key Points to Check During Analysis 🔎
When analyzing Lustre RDMA failure, check these five points:
1) Which node reported the error?
hostname
Determine whether the issue is isolated or widespread.
2) When did it happen?
Compare timestamps across multiple nodes.
Example:
Fri May 22 04:16:59 2026
If many nodes show errors at the same time, it may be a shared infrastructure issue.
3) Is the target MDT or OST?
Example:
MDT0006
OST0012
| Target | Possible Impact |
|---|---|
| MDT | Directory listing, file creation, permission checks |
| OST | File read/write operations |
| MGS | Mount or configuration operations |
4) Is the same IP repeated?
Example:
100.100.100.105@o2ib
If the same IP appears repeatedly, focus on that Lustre server or network path.
5) Are the same error keywords repeated?
Important keywords include:
LustreError
LNetError
RDMA failure
connection lost
Request sent has timed out
rc = -5
o2ib
Repeated patterns are more meaningful than a single isolated log line.
13. Example Incident Report Statement 📌
You can use the following sentence in an operation report:
The Lustre RDMA failure indicates a communication failure between the GPU node and the Lustre storage system over the o2ib/RDMA network path.
The client sent requests to the Lustre MDT or OST server, but the request either timed out or the connection became unstable at the LNet/RDMA layer.
If the issue occurs on a single node, the InfiniBand HCA, port, cable, and driver state of that node should be checked first. If the issue occurs across multiple nodes at the same time, a shared InfiniBand fabric or Lustre/DDN storage-side issue should be investigated.
14. Simple Analogy for Beginners 🚗
Think of Lustre storage as a large warehouse.
GPU Server = Factory
Lustre Storage = Warehouse
InfiniBand/RDMA = High-speed highway
File Request = Delivery request
Normal situation:
Factory → Highway → Warehouse
Data is delivered quickly.
Failure situation:
Factory → Highway ❌ → Warehouse
The road is blocked or unstable.
So the user may experience:
Files are slow to open
Directory listing hangs
Training jobs fail
Notebook becomes unresponsive
Input/output errors appear
In this analogy, a Lustre RDMA failure means that the high-speed road between the GPU server and storage system is not working properly.
15. Summary 🧠
| Item | Explanation |
|---|---|
| Error Name | Lustre RDMA failure |
| Meaning | Failure in RDMA/InfiniBand communication between Lustre client and server |
| Main Path | GPU Server → LNet → o2ib → InfiniBand → Lustre Server |
| Common Logs | LustreError, LNetError, timeout, connection lost, rc=-5 |
| Single Node Cause | IB port, HCA, cable, driver, local kernel issue |
| Multi-Node Cause | IB switch, fabric, Lustre server, DDN storage issue |
| User Impact | File access delay, I/O error, job failure, notebook hang |
| Useful Commands | ibstat, dmesg, lctl ping, lctl dl, mount |
Conclusion ✅
A Lustre RDMA failure is not just a normal file error.
It usually means there is a problem in the high-speed communication path between the GPU server and the Lustre/DDN storage system.
If the error occurs on only one node, check that node’s:
- InfiniBand port
- HCA card
- Cable
- Driver
- Kernel logs
If the error occurs on many nodes at the same time, investigate shared infrastructure such as:
- InfiniBand switch
- IB fabric
- Lustre MDT/OST servers
- DDN storage backend
In short:
A Lustre RDMA failure means the storage highway between your GPU servers and Lustre storage is unstable, delayed, or temporarily broken.