What Is NVIDIA GPU Direct Storage?
GPU Direct Storage enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. Using this direct path can relieve system bandwidth bottlenecks and decrease the latency and utilization load on the CPU.
Assumptions
This document assumes that we have already deployed a OpenShift Cluster and have installed the necessary operators required for GPU Direct Storage. Those operators would be Node Feature Discover which should also be configured along with the base installation of the NVIDIA Network Operator (no NicClusterPolicy yet) and the NVIDIA GPU Operator (no GpuClusterPolicy yet).
Considerations
If any of the NVME devices in the system participate in either the operating system or other services (machine configs for LVMs or other customized access) the NVME kernel modules will not be able to unload properly even with the workaround defined in this documentation. Any use of GDS requires that the NVME drives are not in use during the deployment of the Network Operator in order for the Network Operator to be able to unload in-tree drivers and then load NVIDIA's out of tree drivers in place.
NVIDIA Network Operator Configuration
We assume the Network Operator has already been installed on the cluster but the NicClusterPolicy still needs to be created. The following NicClusterPolicy example will provide the needed configuration to ensure RDMA is properly loaded for nvme. The key option in this policy is the ENABLE_NFSRDMA variable and having it set to true. I want to note that this policy also optinonally has an rdmaSharedDevice and ENTRYPOINT_DEBUG set to true for more verbose logging.
$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
nicFeatureDiscovery:
image: nic-feature-discovery
repository: ghcr.io/mellanox
version: v0.0.1
docaTelemetryService:
image: doca_telemetry
repository: nvcr.io/nvidia/doca
version: 1.16.5-doca2.6.0-host
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_eth",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ens1f0np0"]
}
}
]
}
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: 'sha256:9f468fdc4449e65e4772575f83aa85840a00f97165f9a00ba34695c91d610fbd'
secondaryNetwork:
ipoib:
image: ipoib-cni
repository: ghcr.io/mellanox
version: v1.2.0
nvIpam:
enableWebhook: false
image: nvidia-k8s-ipam
repository: ghcr.io/mellanox
version: v0.2.0
ofedDriver:
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
forcePrecompiled: false
terminationGracePeriodSeconds: 300
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: true
enable: true
force: true
timeoutSeconds: 300
podSelector: ''
maxParallelUpgrades: 1
safeLoad: false
waitForCompletion:
timeoutSeconds: 0
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.01-0.6.0.0-0
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
- name: CREATE_IFNAMES_UDEV
value: "true"
- name: ENABLE_NFSRDMA
value: "true"
- name: ENTRYPOINT_DEBUG
value: 'true'
EOF
Before creating the NicClusterPolicy on the cluster we need to prepare a script which will allow us to workaround an issue with GPU Direct Storage in the NVIDIA Network Operator. This script when run right after creating the NicClusterPolicy will determine which nodes have mofed pods running on them and based on that node list will ssh as the core user into each node and unload the following modules: nvme, nvme_tcp, nvme_fabrics, nvme_core. By using the script to unload the modules while the mofed container is busying building the doca drivers we eliminate an issue where when the mofed container goes to install the compiled doca drivers there is a failure to load. This issue is being investigated by NVIDIA.
$ cat <<EOF > nvme-fixer.sh
#!/bin/bash
### Set array of modules to be unloaded
declare -a modarr=("nvme" "nvme_tcp" "nvme_fabrics" "nvme_core")
### Determine which hosts have mofed container running on them
declare -a hostarr=(`oc get pods -n nvidia-network-operator -o custom-columns=POD:.metadata.name,NODE:.spec..nodeName --no-headers|grep mofed|awk {'print $2'}`)
### Iterate through modules on each host and unload them
for host in "${hostarr[@]}"
do
echo "Unloading nvme dependencies on $host..."
for module in "${modarr[@]}"
do
echo "Unloading module $module..."
ssh core@$host sudo rmmod $module
done
done
Change the execute bit on the script.
$ chmod +x nvme-fixer.sh
Now we are ready to create the NicClusterPolicy on the cluster and follow it up by running the nvme-fixer.sh script. If there are any rmmod "not currently loaded" errors those can safely be ignored as the module was not loaded to start with. In the example below we had two workers nodes that had mofed pods running on them so the script went ahead and unloaded the nvme modules.
$ oc create -f network-sharedrdma-nic-cluster-policy.yaml
nicclusterpolicy.mellanox.com/nic-cluster-policy created
$ ./nvme-fixer.sh
Unloading nvme dependencies on nvd-srv-22.nvidia.eng.rdu2.dc.redhat.com...
Unloading module nvme...
Unloading module nvme_tcp...
rmmod: ERROR: Module nvme_tcp is not currently loaded
Unloading module nvme_fabrics...
rmmod: ERROR: Module nvme_fabrics is not currently loaded
Unloading module nvme_core...
Unloading nvme dependencies on nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com...
Unloading module nvme...
Unloading module nvme_tcp...
Unloading module nvme_fabrics...
Unloading module nvme_core...
$
Now we wait for the mofed pod to finish compiling and installed the GPU Direct Storage modules. We will know its complete when the pods are in a running state like below:
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
kube-ipoib-cni-ds-5f8wk 1/1 Running 0 38s
kube-ipoib-cni-ds-956nv 1/1 Running 0 38s
kube-ipoib-cni-ds-jpbph 1/1 Running 0 38s
kube-ipoib-cni-ds-jwtw2 1/1 Running 0 38s
kube-ipoib-cni-ds-v4sb8 1/1 Running 0 38s
mofed-rhcos4.17-69fb4cd685-ds-j77vl 2/2 Running 0 37s
mofed-rhcos4.17-69fb4cd685-ds-lw7t9 2/2 Running 0 37s
nic-feature-discovery-ds-527wc 1/1 Running 0 36s
nic-feature-discovery-ds-fnn9v 1/1 Running 0 36s
nic-feature-discovery-ds-l9lkf 1/1 Running 0 36s
nic-feature-discovery-ds-qn4m9 1/1 Running 0 36s
nic-feature-discovery-ds-w7vw4 1/1 Running 0 36s
nv-ipam-controller-67556c846b-c4sfq 1/1 Running 0 36s
nv-ipam-controller-67556c846b-wvm59 1/1 Running 0 36s
nv-ipam-node-22rw9 1/1 Running 0 36s
nv-ipam-node-6w4x4 1/1 Running 0 36s
nv-ipam-node-f2p96 1/1 Running 0 36s
nv-ipam-node-jssjh 1/1 Running 0 36s
nv-ipam-node-z2mws 1/1 Running 0 36s
nvidia-network-operator-controller-manager-57c7cfddc8-6nw6j 1/1 Running 16 (10h ago) 14d
We can validate things look correct from a module perspective by logging into one of nodes either via SSH or even debug pod and listing out the nvme modules. The results should look like the following output below. Note I ran a lsblk to also show that my nvme device is visible as well.
$ ssh core@nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com
Red Hat Enterprise Linux CoreOS 417.94.202502051822-0
Part of OpenShift 4.17, RHCOS is a Kubernetes-native operating system
managed by the Machine Config Operator (`clusteroperator/machine-config`).
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
https://docs.openshift.com/container-platform/4.17/architecture/architecture-rhcos.html
Last login: Fri Mar 21 17:48:41 2025 from 10.22.81.26
[systemd]
Failed Units: 1
NetworkManager-wait-online.service
[core@nvd-srv-23 ~]$ sudo bash
[root@nvd-srv-23 core]# lsmod|grep nvme
nvme_rdma 57344 0
nvme_fabrics 45056 1 nvme_rdma
nvme 73728 0
nvme_core 204800 3 nvme,nvme_rdma,nvme_fabrics
rdma_cm 155648 3 rpcrdma,nvme_rdma,rdma_ucm
ib_core 557056 10 rdma_cm,ib_ipoib,rpcrdma,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx_compat 20480 17 rdma_cm,ib_ipoib,mlxdevm,rpcrdma,nvme,nvme_rdma,mlxfw,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
nvme_common 24576 0
t10_pi 24576 2 sd_mod,nvme_core
[root@nvd-srv-23 core]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 1.5T 0 disk
├─sda1 8:1 0 1M 0 part
├─sda2 8:2 0 127M 0 part
├─sda3 8:3 0 384M 0 part /boot
└─sda4 8:4 0 1.5T 0 part /var
/sysroot/ostree/deploy/rhcos/var
/usr
/etc
/
/sysroot
sdb 8:16 0 1.5T 0 disk
sdc 8:32 0 1.5T 0 disk
sdd 8:48 0 1.5T 0 disk
nvme0n1 259:1 0 894.2G 0 disk
This completes the NVIDIA Network Operator portion of the configuration for GPU Direct Storage.
NVIDIA GPU Operator Configuration
Now that the NicClusterPolicy is defined and the proper NVME modules have been loaded we can move into configuring our GPU ClusterPolicy. The below example is a policy that will enable GPU Direct Storage on the worker nodes that have a proper NVIDIA GPU.
$ cat <<EOF > gpu-cluster-policy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
serviceMonitor:
enabled: true
enabled: true
cdi:
default: false
enabled: false
driver:
licensingConfig:
nlsEnabled: true
configMapName: ''
certConfig:
name: ''
rdma:
enabled: true
kernelModuleConfig:
name: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
repoConfig:
configMapName: ''
virtualTopology:
config: ''
enabled: true
useNvidiaDriverCRD: false
useOpenKernelModules: true
devicePlugin:
config:
name: ''
default: ''
mps:
root: /run/nvidia/mps
enabled: true
gdrcopy:
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: true
image: 'nvcr.io/nvidia/cloud-native/nvidia-fs:2.20.5'
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
installDir: /usr/local/nvidia
enabled: true
EOF
Now let's create the policy on the cluster.
$ oc create -f gpu-cluster-policy.yaml
clusterpolicy.nvidia.com/gpu-cluster-policy created
Once the policy is created let's validate the pods are running before we move onto the next step.
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-499wh 1/1 Running 0 18h
gpu-feature-discovery-m68bn 1/1 Running 0 18h
gpu-operator-c9ccd586d-htl5q 1/1 Running 0 19h
nvidia-container-toolkit-daemonset-8m4r5 1/1 Running 0 18h
nvidia-container-toolkit-daemonset-ld7qz 1/1 Running 0 18h
nvidia-cuda-validator-fddq7 0/1 Completed 0 18h
nvidia-cuda-validator-mdk6b 0/1 Completed 0 18h
nvidia-dcgm-565tj 1/1 Running 0 18h
nvidia-dcgm-exporter-jtgt6 1/1 Running 1 (18h ago) 18h
nvidia-dcgm-exporter-znpgh 1/1 Running 1 (18h ago) 18h
nvidia-dcgm-xpxbx 1/1 Running 0 18h
nvidia-device-plugin-daemonset-2vn52 1/1 Running 0 18h
nvidia-device-plugin-daemonset-kjzjz 1/1 Running 0 18h
nvidia-driver-daemonset-417.94.202502051822-0-pj7hk 5/5 Running 2 (18h ago) 18h
nvidia-driver-daemonset-417.94.202502051822-0-qp8xb 5/5 Running 5 (18h ago) 18h
nvidia-node-status-exporter-48cx7 1/1 Running 0 18h
nvidia-node-status-exporter-dpmsr 1/1 Running 0 18h
nvidia-operator-validator-fmcz4 1/1 Running 0 18h
nvidia-operator-validator-g2fbt 1/1 Running 0 18h
With the NVIDIA GPU Operator pods running we can rsh
into the daemonset pods and confirm GDS is enabled by running the lsmod
command (note the nvidia_fs module) and cat out the /proc/driver/nvidia-fs/stats
file.
$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk
sh-4.4# lsmod|grep nvidia
nvidia_fs 327680 0
nvidia_peermem 24576 0
nvidia_modeset 1507328 0
video 73728 1 nvidia_modeset
nvidia_uvm 6889472 8
nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
ib_uverbs 217088 19 nvidia_peermem,rdma_ucm,mlx5_ib
drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200
$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk
sh-4.4# cat /proc/driver/nvidia-fs/stats
GDS Version: 1.10.0.4
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.20.5)
Mellanox PeerDirect Supported: True
IO stats: Disabled, peer IO stats: Disabled
Logging level: info
Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads : err=0 io_state_err=0
Sparse Reads : n=0 io=0 holes=0 pages=0
Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap : n=0 ok=0 err=0 munmap=0
Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops : Read=0 Write=0 BatchIO=0
If everything looks good we can move onto an additional step to confirm GDS is ready for workload consumption.
GDS Cuda Workload Container
Once the GPU Direct Storage drivers are loaded we can use one more additional tool to check and confirm GDS capability. This involves building a container that contains the CUDA packages and then running it on a node. The following pod yaml defines this configuration.
$ cat <<EOF > gds-check-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: gds-check-workload
namespace: default
spec:
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.3
name: gds-check-workload
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
volumeMounts:
- name: udev
mountPath: /run/udev
volumeMounts:
- name: kernel-config
mountPath: /sys/kernel/config
volumeMounts:
- name: dev
mountPath: /run/dev
volumeMounts:
- name: sys
mountPath: /sys
volumeMounts:
- name: results
mountPath: /results
volumeMounts:
- name: lib
mountPath: /lib/modules
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
volumes:
- name: udev
hostPath:
path: /run/udev
- name: kernel-config
hostPath:
path: /sys/kernel/config
- name: dev
hostPath:
path: /run/dev
- name: sys
hostPath:
path: /sys
- name: results
hostPath:
path: /results
- name: lib
hostPath:
path: /lib/modules
EOF
Now let's generate a service account CRD to use in the default
namespace.
$ cat <<EOF > default-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: rdma
namespace: default
EOF
Next we can create it on our cluster.
$ oc create -f default-serviceaccount.yaml
serviceaccount/rdma created
Finally with the service account create we can add privleges to it.
$ oc -n default adm policy add-scc-to-user privileged -z rdma
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"
With the service account defined and our pod yaml ready we can create it on the cluster.
$ oc create -f gds-check-workload.yaml
pod/gds-check-workload created
$ oc get pods
NAME READY STATUS RESTARTS AGE
gds-check-workload 1/1 Running 0 3s
Once the pod is up and running we can rsh
into the pod and run the gdscheck
tool to confirm capabilities and configuration of GPU Direct Storage.
$ oc rsh gds-check-workload
sh-5.1# /usr/local/cuda/gds/tools/gdscheck -p
GDS release version: 1.13.1.3
nvidia_fs version: 2.20 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe P2PDMA : Unsupported
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Supported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_pci_p2pdma : false
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 64
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
fs.gpfs.gds_async_support: true
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA A40 bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
==============
PLATFORM INFO:
==============
Found ACS enabled for switch 0000:e0:01.0
IOMMU: Pass-through or enabled
WARN: GDS is not guaranteed to work functionally or in a performant way with iommu=on/pt
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12040
Platform: PowerEdge R760xa, Arch: x86_64(Linux 5.14.0-427.50.1.el9_4.x86_64)
Platform verification succeeded
Hopefully this provides enough detail to enable GPU Direct Storage on OpenShift.