A Beginner-Friendly Guide for GPU and HPC Environments

When operating GPU servers or HPC clusters, you may sometimes see error messages like:


Lustre RDMA failure
LustreError
LNetError
Request sent has timed out
connection lost to ... @o2ib

At first glance, these messages can look complicated.
However, the core meaning is simple:

A Lustre RDMA failure means that the high-speed communication path between a GPU server and a Lustre storage system has failed.

In most GPU cluster environments, Lustre storage is accessed over InfiniBand using RDMA.
So when this error occurs, it usually means there is a problem in the storage network path, not just a simple file error.

1. What Is Lustre? 📦

Lustre is a high-performance parallel file system commonly used in large-scale HPC and AI/GPU environments.

It allows many servers to access the same shared storage system at the same time with very high throughput.

For example, GPU servers may use Lustre-mounted paths such as:


/mnt/lustre
/mnt/ddn
/mnt/a2wbl
/home/user/data

Compared to regular storage:

Storage Type	Description
Local Disk	Disk installed directly inside one server
NFS	Shared network storage for general workloads
Lustre	High-performance parallel storage for large clusters

In AI training environments, Lustre is often used to store:

Training datasets
Checkpoints
Model outputs
Logs
Shared project files

2. What Is RDMA? ⚡

RDMA stands for Remote Direct Memory Access.

In simple terms:

RDMA allows one server to transfer data directly to another server’s memory with very little CPU involvement.

A normal network data flow looks like this:


Application
  ↓
Operating System
  ↓
CPU Processing
  ↓
Network Card
  ↓
Remote Server

With RDMA, the flow becomes much faster:


Server A Memory
  ↓
RDMA / InfiniBand
  ↓
Server B Memory

This is why RDMA is widely used in:

GPU clusters
AI model training
HPC workloads
Parallel storage systems
Low-latency network environments

3. What Does `@o2ib` Mean? 🌐

In Lustre logs or mount information, you may see something like this:


100.100.100.105@o2ib:/a2wbl /mnt/a2wbl lustre

The important part is:


@o2ib

o2ib means Lustre over InfiniBand.

Term	Meaning
`o2ib`	Lustre communication over InfiniBand
InfiniBand	High-speed network used in HPC/GPU clusters
RDMA	Direct memory-to-memory data transfer
Lustre RDMA	Lustre storage traffic using RDMA

So when you see @o2ib, it means the Lustre client is communicating with the Lustre server through an InfiniBand/RDMA network path.

4. What Is a Lustre RDMA Failure? 🚨

A Lustre RDMA failure means:


GPU Server ↔ Lustre Storage Server RDMA communication failed

In simpler words:

The GPU server tried to read from or write to Lustre storage, but the high-speed RDMA communication path failed.

This failure can happen between:


GPU Server
  ↓
Lustre Client
  ↓
LNet
  ↓
o2ib / RDMA
  ↓
InfiniBand Switch
  ↓
Lustre Storage Server

So the issue may be caused by:

The GPU server
The InfiniBand adapter
The InfiniBand switch
The RDMA driver
The Lustre client
The Lustre server
The DDN/storage backend

5. Simple Architecture View 🧭

A normal Lustre over RDMA path looks like this:


GPU Server
  ↓
Lustre Client
  ↓
LNet
  ↓
o2ib / RDMA
  ↓
InfiniBand Switch
  ↓
DDN / Lustre Storage Server
  ↓
MDT / OST

When a Lustre RDMA failure happens, the issue is commonly around this area:


LNet
  ↓
o2ib / RDMA   ❌ Failure point
  ↓
InfiniBand Network

However, the actual root cause can still be on either the client side, network side, or storage side.

6. What Are MDT and OST? 🧱

Lustre is made of several important components.

You may see names like:


MDT0006
OST0012
a2wbl02-MDT0006-mdc

Here is what they mean:

Component	Role	If It Has a Problem
MDT	Stores metadata such as file names, directories, and permissions	`ls`, `stat`, `mkdir`, `rm` may become slow
OST	Stores actual file data	File read/write may become slow or fail
MGS	Stores Lustre configuration information	Mount or configuration issues may occur
Client	GPU or compute server accessing Lustre	User sees file access issues

For example, if logs repeatedly mention MDT0006, the problem may affect directory listing or metadata operations.

If logs repeatedly mention an OST, the problem may affect actual file reading or writing.

7. Common Error Messages and Their Meanings 🧾

1) `Request sent has timed out`

Example:


ptlrpc_expire_one_request()
Request sent has timed out

Meaning:

The Lustre client sent a request to the storage server but did not receive a response within the expected time.

Simple explanation:


Client: “Please give me this file information.”
Storage: No response.
Client: Timeout.

2) `connection lost`

Example:


connection lost to 100.100.100.105@o2ib

Meaning:

The Lustre client lost its connection to the Lustre server through the InfiniBand/RDMA path.

This may indicate an issue with:

InfiniBand link
HCA adapter
Switch port
Lustre server
LNet/RDMA layer

3) `rc = -5`

Example:


LustreError: rc = -5

In Linux, -5 commonly means I/O error.

Simple explanation:

The client experienced an input/output error while communicating with the Lustre storage system.

4) `LNetError`

Example:


LNetError

LNet is the network layer used by Lustre.

The simplified stack looks like this:


Lustre
  ↓
LNet
  ↓
o2ib
  ↓
InfiniBand / RDMA

So LNetError usually means there is a problem in the Lustre networking layer.

8. Common Causes of Lustre RDMA Failure 🔍

1) InfiniBand Port Issue

If an InfiniBand port is not active, Lustre RDMA communication may fail.

Check with:


ibstat

A healthy state usually looks like:


State: Active
Physical state: LinkUp

A problematic state may look like:


State: Down
Physical state: Polling

2) Mellanox / NVIDIA HCA Issue

The InfiniBand adapter is often called an HCA.

If the HCA driver, firmware, or hardware has a problem, RDMA communication can fail.

Check kernel logs:


dmesg -T | grep -i mlx5

Important keywords to look for:


error
timeout
reset
port error
async error

3) InfiniBand Switch Issue

If many GPU nodes experience Lustre RDMA failures at the same time, the issue may not be on a single server.

In that case, you should suspect the shared network path:


Multiple GPU Servers
  ↓
InfiniBand Switch
  ↓
Lustre / DDN Storage

If multiple nodes show connection lost, timeout, or LNetError at the same time, the issue may be related to the InfiniBand fabric or storage backend.

4) Lustre Server or DDN Storage Delay

Even if the network is healthy, the Lustre server may be slow or overloaded.

For example:

MDT response delay
OST response delay
DDN controller issue
Storage backend overload
Too many simultaneous I/O requests

From the client side, this may appear as a timeout or RDMA failure.

5) Heavy I/O Load

Large AI workloads can generate massive I/O traffic.

Common examples include:

Many jobs reading the same dataset
Many nodes writing checkpoints at the same time
Too many small files
Heavy find, du, or ls operations
Large-scale distributed training jobs

This can overload Lustre metadata or data servers and cause timeouts.

9. What Symptoms Can Users See? 💥

When Lustre RDMA failure occurs, users may experience:

Symptom	Description
`ls` command hangs	Directory metadata cannot be retrieved
`df -h` hangs	Filesystem status query waits for Lustre response
File read/write fails	Data cannot be accessed properly
`Input/output error`	I/O request failed
Notebook hangs	Home or shared storage path becomes unresponsive
Training job fails	Dataset loading or checkpoint writing fails
Slurm job stuck	Job may remain in `COMPLETING` or abnormal state
NCCL timeout	Storage delay may indirectly affect distributed training

Commands like these may hang during Lustre issues:


ls /mnt/lustre
df -h
du -sh /mnt/lustre/*
find /mnt/lustre

A safer way to test is to use timeout:


timeout 5 ls /mnt/lustre
timeout 5 df -h /mnt/lustre

10. Single Node Issue vs Cluster-Wide Issue 🧪

The most important step is to identify the scope of impact.

Case 1: Only One Node Has the Issue

If only one GPU server shows the error, the issue may be local to that node.

Possible causes:

Cause	Description
IB port problem	InfiniBand link is not active
HCA issue	Adapter hardware, firmware, or driver problem
Cable issue	Physical link instability
Driver issue	`mlx5` or RDMA driver problem
Kernel state issue	Temporary OS or module issue

Check commands:


hostname
ibstat
ibdev2netdev
dmesg -T | grep -i mlx5
dmesg -T | egrep -i "lustre|lnet|o2ib|rdma|timeout"

Case 2: Many Nodes Have the Issue at the Same Time

If many servers report the same issue at the same time, you should suspect a shared component.

Possible causes:

Cause	Description
InfiniBand switch issue	Shared network path problem
IB fabric issue	Routing or link instability
Lustre server issue	MDT/OST not responding properly
DDN storage issue	Storage controller or backend problem
Heavy I/O storm	Too many clients accessing storage at once

In this case, rebooting individual nodes is usually not the first priority.
You should check the InfiniBand fabric and Lustre/DDN storage backend first.

11. Useful Commands for Troubleshooting ✅

Check Lustre-related logs


dmesg -T | egrep -i "lustre|lnet|o2ib|rdma|ptlrpc|timeout|connection"

Check recent kernel logs:


journalctl -k --since "1 hour ago" | egrep -i "lustre|lnet|o2ib|rdma|ptlrpc|timeout|connection"

Check InfiniBand port status


ibstat

Expected healthy output:


State: Active
Physical state: LinkUp

Map IB devices to network interfaces


ibdev2netdev

Example:


mlx5_0 port 1 ==> ib0 (Up)
mlx5_1 port 1 ==> ib1 (Up)

Check Mellanox/NVIDIA HCA logs


dmesg -T | grep -i mlx5

Or filter important keywords:


dmesg -T | egrep -i "mlx5|error|timeout|reset|port"

Check Lustre mount information


mount | grep lustre

Or:


df -hT | grep lustre

During a Lustre issue, df can hang, so this may be safer:


timeout 5 df -hT | grep lustre

Check Lustre device list


lctl dl

This shows Lustre devices such as:

Ping Lustre NID

If the log shows a target like:


100.100.100.105@o2ib

You can test it with:


lctl ping 100.100.100.105@o2ib

If this fails or times out, the Lustre LNet/RDMA path may have a problem.

12. Key Points to Check During Analysis 🔎

When analyzing Lustre RDMA failure, check these five points:

1) Which node reported the error?


hostname

Determine whether the issue is isolated or widespread.

2) When did it happen?

Compare timestamps across multiple nodes.

Example:


Fri May 22 04:16:59 2026

If many nodes show errors at the same time, it may be a shared infrastructure issue.

3) Is the target MDT or OST?

Example:


MDT0006
OST0012

Target	Possible Impact
MDT	Directory listing, file creation, permission checks
OST	File read/write operations
MGS	Mount or configuration operations

4) Is the same IP repeated?

Example:


100.100.100.105@o2ib

If the same IP appears repeatedly, focus on that Lustre server or network path.

5) Are the same error keywords repeated?

Important keywords include:


LustreError
LNetError
RDMA failure
connection lost
Request sent has timed out
rc = -5
o2ib

Repeated patterns are more meaningful than a single isolated log line.

13. Example Incident Report Statement 📌

You can use the following sentence in an operation report:


The Lustre RDMA failure indicates a communication failure between the GPU node and the Lustre storage system over the o2ib/RDMA network path.

The client sent requests to the Lustre MDT or OST server, but the request either timed out or the connection became unstable at the LNet/RDMA layer.

If the issue occurs on a single node, the InfiniBand HCA, port, cable, and driver state of that node should be checked first. If the issue occurs across multiple nodes at the same time, a shared InfiniBand fabric or Lustre/DDN storage-side issue should be investigated.

14. Simple Analogy for Beginners 🚗

Think of Lustre storage as a large warehouse.


GPU Server = Factory
Lustre Storage = Warehouse
InfiniBand/RDMA = High-speed highway
File Request = Delivery request

Normal situation:


Factory → Highway → Warehouse
Data is delivered quickly.

Failure situation:


Factory → Highway ❌ → Warehouse
The road is blocked or unstable.

So the user may experience:


Files are slow to open
Directory listing hangs
Training jobs fail
Notebook becomes unresponsive
Input/output errors appear

In this analogy, a Lustre RDMA failure means that the high-speed road between the GPU server and storage system is not working properly.

15. Summary 🧠

Item	Explanation
Error Name	Lustre RDMA failure
Meaning	Failure in RDMA/InfiniBand communication between Lustre client and server
Main Path	GPU Server → LNet → o2ib → InfiniBand → Lustre Server
Common Logs	`LustreError`, `LNetError`, `timeout`, `connection lost`, `rc=-5`
Single Node Cause	IB port, HCA, cable, driver, local kernel issue
Multi-Node Cause	IB switch, fabric, Lustre server, DDN storage issue
User Impact	File access delay, I/O error, job failure, notebook hang
Useful Commands	`ibstat`, `dmesg`, `lctl ping`, `lctl dl`, `mount`

Conclusion ✅

A Lustre RDMA failure is not just a normal file error.

It usually means there is a problem in the high-speed communication path between the GPU server and the Lustre/DDN storage system.

If the error occurs on only one node, check that node’s:

InfiniBand port
HCA card
Cable
Driver
Kernel logs

If the error occurs on many nodes at the same time, investigate shared infrastructure such as:

InfiniBand switch
IB fabric
Lustre MDT/OST servers
DDN storage backend

In short:

A Lustre RDMA failure means the storage highway between your GPU servers and Lustre storage is unstable, delayed, or temporarily broken.

What Does “Lustre RDMA Failure” Mean? 🚨

A Beginner-Friendly Guide for GPU and HPC Environments

1. What Is Lustre? 📦

2. What Is RDMA? ⚡

3. What Does @o2ib Mean? 🌐

4. What Is a Lustre RDMA Failure? 🚨

5. Simple Architecture View 🧭

6. What Are MDT and OST? 🧱

7. Common Error Messages and Their Meanings 🧾

1) Request sent has timed out

2) connection lost

3) rc = -5

4) LNetError

8. Common Causes of Lustre RDMA Failure 🔍

1) InfiniBand Port Issue

2) Mellanox / NVIDIA HCA Issue

3) InfiniBand Switch Issue

4) Lustre Server or DDN Storage Delay

5) Heavy I/O Load

9. What Symptoms Can Users See? 💥

10. Single Node Issue vs Cluster-Wide Issue 🧪

Case 1: Only One Node Has the Issue

Case 2: Many Nodes Have the Issue at the Same Time

11. Useful Commands for Troubleshooting ✅

Check Lustre-related logs

Check InfiniBand port status

Map IB devices to network interfaces

Check Mellanox/NVIDIA HCA logs

Check Lustre mount information

Check Lustre device list

Ping Lustre NID

12. Key Points to Check During Analysis 🔎

1) Which node reported the error?

2) When did it happen?

3) Is the target MDT or OST?

4) Is the same IP repeated?

5) Are the same error keywords repeated?

13. Example Incident Report Statement 📌

14. Simple Analogy for Beginners 🚗

15. Summary 🧠

Conclusion ✅

댓글 없음:

댓글 쓰기

📈 What Is an ETF? A Beginner's Guide to Smart Investing

3. What Does `@o2ib` Mean? 🌐

1) `Request sent has timed out`

2) `connection lost`

3) `rc = -5`

4) `LNetError`