What Does “Lustre RDMA Failure” Mean? ๐Ÿšจ


A Beginner-Friendly Guide for GPU and HPC Environments

When operating GPU servers or HPC clusters, you may sometimes see error messages like:

Lustre RDMA failure
LustreError
LNetError
Request sent has timed out
connection lost to ... @o2ib

At first glance, these messages can look complicated.
However, the core meaning is simple:

A Lustre RDMA failure means that the high-speed communication path between a GPU server and a Lustre storage system has failed.

In most GPU cluster environments, Lustre storage is accessed over InfiniBand using RDMA.
So when this error occurs, it usually means there is a problem in the storage network path, not just a simple file error.


1. What Is Lustre? ๐Ÿ“ฆ

Lustre is a high-performance parallel file system commonly used in large-scale HPC and AI/GPU environments.

It allows many servers to access the same shared storage system at the same time with very high throughput.

For example, GPU servers may use Lustre-mounted paths such as:

/mnt/lustre
/mnt/ddn
/mnt/a2wbl
/home/user/data

Compared to regular storage:

Storage TypeDescription
Local DiskDisk installed directly inside one server
NFSShared network storage for general workloads
LustreHigh-performance parallel storage for large clusters

In AI training environments, Lustre is often used to store:

  • Training datasets
  • Checkpoints
  • Model outputs
  • Logs
  • Shared project files

2. What Is RDMA? ⚡

RDMA stands for Remote Direct Memory Access.

In simple terms:

RDMA allows one server to transfer data directly to another server’s memory with very little CPU involvement.

A normal network data flow looks like this:

Application

Operating System

CPU Processing

Network Card

Remote Server

With RDMA, the flow becomes much faster:

Server A Memory

RDMA / InfiniBand

Server B Memory

This is why RDMA is widely used in:

  • GPU clusters
  • AI model training
  • HPC workloads
  • Parallel storage systems
  • Low-latency network environments

3. What Does @o2ib Mean? ๐ŸŒ

In Lustre logs or mount information, you may see something like this:

100.100.100.105@o2ib:/a2wbl /mnt/a2wbl lustre

The important part is:

@o2ib

o2ib means Lustre over InfiniBand.

TermMeaning
o2ibLustre communication over InfiniBand
InfiniBandHigh-speed network used in HPC/GPU clusters
RDMADirect memory-to-memory data transfer
Lustre RDMALustre storage traffic using RDMA

So when you see @o2ib, it means the Lustre client is communicating with the Lustre server through an InfiniBand/RDMA network path.


4. What Is a Lustre RDMA Failure? ๐Ÿšจ

A Lustre RDMA failure means:

GPU Server ↔ Lustre Storage Server RDMA communication failed

In simpler words:

The GPU server tried to read from or write to Lustre storage, but the high-speed RDMA communication path failed.

This failure can happen between:

GPU Server

Lustre Client

LNet

o2ib / RDMA

InfiniBand Switch

Lustre Storage Server

So the issue may be caused by:

  • The GPU server
  • The InfiniBand adapter
  • The InfiniBand switch
  • The RDMA driver
  • The Lustre client
  • The Lustre server
  • The DDN/storage backend

5. Simple Architecture View ๐Ÿงญ

A normal Lustre over RDMA path looks like this:

GPU Server

Lustre Client

LNet

o2ib / RDMA

InfiniBand Switch

DDN / Lustre Storage Server

MDT / OST

When a Lustre RDMA failure happens, the issue is commonly around this area:

LNet

o2ib / RDMA ❌ Failure point

InfiniBand Network

However, the actual root cause can still be on either the client side, network side, or storage side.


6. What Are MDT and OST? ๐Ÿงฑ

Lustre is made of several important components.

You may see names like:

MDT0006
OST0012
a2wbl02-MDT0006-mdc

Here is what they mean:

ComponentRoleIf It Has a Problem
MDTStores metadata such as file names, directories, and permissionsls, stat, mkdir, rm may become slow
OSTStores actual file dataFile read/write may become slow or fail
MGSStores Lustre configuration informationMount or configuration issues may occur
ClientGPU or compute server accessing LustreUser sees file access issues

For example, if logs repeatedly mention MDT0006, the problem may affect directory listing or metadata operations.

If logs repeatedly mention an OST, the problem may affect actual file reading or writing.


7. Common Error Messages and Their Meanings ๐Ÿงพ

1) Request sent has timed out

Example:

ptlrpc_expire_one_request()
Request sent has timed out

Meaning:

The Lustre client sent a request to the storage server but did not receive a response within the expected time.

Simple explanation:

Client: “Please give me this file information.”
Storage: No response.
Client: Timeout.

2) connection lost

Example:

connection lost to 100.100.100.105@o2ib

Meaning:

The Lustre client lost its connection to the Lustre server through the InfiniBand/RDMA path.

This may indicate an issue with:

  • InfiniBand link
  • HCA adapter
  • Switch port
  • Lustre server
  • LNet/RDMA layer

3) rc = -5

Example:

LustreError: rc = -5

In Linux, -5 commonly means I/O error.

Simple explanation:

The client experienced an input/output error while communicating with the Lustre storage system.


4) LNetError

Example:

LNetError

LNet is the network layer used by Lustre.

The simplified stack looks like this:

Lustre

LNet

o2ib

InfiniBand / RDMA

So LNetError usually means there is a problem in the Lustre networking layer.


8. Common Causes of Lustre RDMA Failure ๐Ÿ”

1) InfiniBand Port Issue

If an InfiniBand port is not active, Lustre RDMA communication may fail.

Check with:

ibstat

A healthy state usually looks like:

State: Active
Physical state: LinkUp

A problematic state may look like:

State: Down
Physical state: Polling

2) Mellanox / NVIDIA HCA Issue

The InfiniBand adapter is often called an HCA.

If the HCA driver, firmware, or hardware has a problem, RDMA communication can fail.

Check kernel logs:

dmesg -T | grep -i mlx5

Important keywords to look for:

error
timeout
reset
port error
async error

3) InfiniBand Switch Issue

If many GPU nodes experience Lustre RDMA failures at the same time, the issue may not be on a single server.

In that case, you should suspect the shared network path:

Multiple GPU Servers

InfiniBand Switch

Lustre / DDN Storage

If multiple nodes show connection lost, timeout, or LNetError at the same time, the issue may be related to the InfiniBand fabric or storage backend.


4) Lustre Server or DDN Storage Delay

Even if the network is healthy, the Lustre server may be slow or overloaded.

For example:

  • MDT response delay
  • OST response delay
  • DDN controller issue
  • Storage backend overload
  • Too many simultaneous I/O requests

From the client side, this may appear as a timeout or RDMA failure.


5) Heavy I/O Load

Large AI workloads can generate massive I/O traffic.

Common examples include:

  • Many jobs reading the same dataset
  • Many nodes writing checkpoints at the same time
  • Too many small files
  • Heavy find, du, or ls operations
  • Large-scale distributed training jobs

This can overload Lustre metadata or data servers and cause timeouts.


9. What Symptoms Can Users See? ๐Ÿ’ฅ

When Lustre RDMA failure occurs, users may experience:

SymptomDescription
ls command hangsDirectory metadata cannot be retrieved
df -h hangsFilesystem status query waits for Lustre response
File read/write failsData cannot be accessed properly
Input/output errorI/O request failed
Notebook hangsHome or shared storage path becomes unresponsive
Training job failsDataset loading or checkpoint writing fails
Slurm job stuckJob may remain in COMPLETING or abnormal state
NCCL timeoutStorage delay may indirectly affect distributed training

Commands like these may hang during Lustre issues:

ls /mnt/lustre
df -h
du -sh /mnt/lustre/*
find /mnt/lustre

A safer way to test is to use timeout:

timeout 5 ls /mnt/lustre
timeout 5 df -h /mnt/lustre

10. Single Node Issue vs Cluster-Wide Issue ๐Ÿงช

The most important step is to identify the scope of impact.


Case 1: Only One Node Has the Issue

If only one GPU server shows the error, the issue may be local to that node.

Possible causes:

CauseDescription
IB port problemInfiniBand link is not active
HCA issueAdapter hardware, firmware, or driver problem
Cable issuePhysical link instability
Driver issuemlx5 or RDMA driver problem
Kernel state issueTemporary OS or module issue

Check commands:

hostname
ibstat
ibdev2netdev
dmesg -T | grep -i mlx5
dmesg -T | egrep -i "lustre|lnet|o2ib|rdma|timeout"

Case 2: Many Nodes Have the Issue at the Same Time

If many servers report the same issue at the same time, you should suspect a shared component.

Possible causes:

CauseDescription
InfiniBand switch issueShared network path problem
IB fabric issueRouting or link instability
Lustre server issueMDT/OST not responding properly
DDN storage issueStorage controller or backend problem
Heavy I/O stormToo many clients accessing storage at once

In this case, rebooting individual nodes is usually not the first priority.
You should check the InfiniBand fabric and Lustre/DDN storage backend first.


11. Useful Commands for Troubleshooting ✅

Check Lustre-related logs

dmesg -T | egrep -i "lustre|lnet|o2ib|rdma|ptlrpc|timeout|connection"

Check recent kernel logs:

journalctl -k --since "1 hour ago" | egrep -i "lustre|lnet|o2ib|rdma|ptlrpc|timeout|connection"

Check InfiniBand port status

ibstat

Expected healthy output:

State: Active
Physical state: LinkUp

Map IB devices to network interfaces

ibdev2netdev

Example:

mlx5_0 port 1 ==> ib0 (Up)
mlx5_1 port 1 ==> ib1 (Up)

Check Mellanox/NVIDIA HCA logs

dmesg -T | grep -i mlx5

Or filter important keywords:

dmesg -T | egrep -i "mlx5|error|timeout|reset|port"

Check Lustre mount information

mount | grep lustre

Or:

df -hT | grep lustre

During a Lustre issue, df can hang, so this may be safer:

timeout 5 df -hT | grep lustre

Check Lustre device list

lctl dl

This shows Lustre devices such as:

  • MDC
  • OSC
  • MDT
  • OST
  • MGC

Ping Lustre NID

If the log shows a target like:

100.100.100.105@o2ib

You can test it with:

lctl ping 100.100.100.105@o2ib

If this fails or times out, the Lustre LNet/RDMA path may have a problem.


12. Key Points to Check During Analysis ๐Ÿ”Ž

When analyzing Lustre RDMA failure, check these five points:

1) Which node reported the error?

hostname

Determine whether the issue is isolated or widespread.


2) When did it happen?

Compare timestamps across multiple nodes.

Example:

Fri May 22 04:16:59 2026

If many nodes show errors at the same time, it may be a shared infrastructure issue.


3) Is the target MDT or OST?

Example:

MDT0006
OST0012
TargetPossible Impact
MDTDirectory listing, file creation, permission checks
OSTFile read/write operations
MGSMount or configuration operations

4) Is the same IP repeated?

Example:

100.100.100.105@o2ib

If the same IP appears repeatedly, focus on that Lustre server or network path.


5) Are the same error keywords repeated?

Important keywords include:

LustreError
LNetError
RDMA failure
connection lost
Request sent has timed out
rc = -5
o2ib

Repeated patterns are more meaningful than a single isolated log line.


13. Example Incident Report Statement ๐Ÿ“Œ

You can use the following sentence in an operation report:

The Lustre RDMA failure indicates a communication failure between the GPU node and the Lustre storage system over the o2ib/RDMA network path.

The client sent requests to the Lustre MDT or OST server, but the request either timed out or the connection became unstable at the LNet/RDMA layer.

If the issue occurs on a single node, the InfiniBand HCA, port, cable, and driver state of that node should be checked first. If the issue occurs across multiple nodes at the same time, a shared InfiniBand fabric or Lustre/DDN storage-side issue should be investigated.

14. Simple Analogy for Beginners ๐Ÿš—

Think of Lustre storage as a large warehouse.

GPU Server = Factory
Lustre Storage = Warehouse
InfiniBand/RDMA = High-speed highway
File Request = Delivery request

Normal situation:

Factory → Highway → Warehouse
Data is delivered quickly.

Failure situation:

Factory → Highway ❌ → Warehouse
The road is blocked or unstable.

So the user may experience:

Files are slow to open
Directory listing hangs
Training jobs fail
Notebook becomes unresponsive
Input/output errors appear

In this analogy, a Lustre RDMA failure means that the high-speed road between the GPU server and storage system is not working properly.


15. Summary ๐Ÿง 

ItemExplanation
Error NameLustre RDMA failure
MeaningFailure in RDMA/InfiniBand communication between Lustre client and server
Main PathGPU Server → LNet → o2ib → InfiniBand → Lustre Server
Common LogsLustreError, LNetError, timeout, connection lost, rc=-5
Single Node CauseIB port, HCA, cable, driver, local kernel issue
Multi-Node CauseIB switch, fabric, Lustre server, DDN storage issue
User ImpactFile access delay, I/O error, job failure, notebook hang
Useful Commandsibstat, dmesg, lctl ping, lctl dl, mount

Conclusion ✅

A Lustre RDMA failure is not just a normal file error.

It usually means there is a problem in the high-speed communication path between the GPU server and the Lustre/DDN storage system.

If the error occurs on only one node, check that node’s:

  • InfiniBand port
  • HCA card
  • Cable
  • Driver
  • Kernel logs

If the error occurs on many nodes at the same time, investigate shared infrastructure such as:

  • InfiniBand switch
  • IB fabric
  • Lustre MDT/OST servers
  • DDN storage backend

In short:

A Lustre RDMA failure means the storage highway between your GPU servers and Lustre storage is unstable, delayed, or temporarily broken.



 

๋Œ“๊ธ€ ์—†์Œ:

๋Œ“๊ธ€ ์“ฐ๊ธฐ

์ฐธ๊ณ : ๋ธ”๋กœ๊ทธ์˜ ํšŒ์›๋งŒ ๋Œ“๊ธ€์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

What Does “Lustre RDMA Failure” Mean? ๐Ÿšจ

A Beginner-Friendly Guide for GPU and HPC Environments When operating GPU servers or HPC clusters, you may sometimes see error messages li...