Tuesday, April 01, 2025

NVIDIA GPU Direct Storage on OpenShift


Welcome to the NVIDIA GPU Direct Storage on OpenShift workflow.  The goal of this workflow is to understand and configure NVIDIA GPU Direct Storage for NVME devices in worker nodes of an OpenShift cluster.

What Is NVIDIA GPU Direct Storage?

GPU Direct Storage enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. Using this direct path can relieve system bandwidth bottlenecks and decrease the latency and utilization load on the CPU.

Assumptions

This document assumes that we have already deployed a OpenShift Cluster and have installed the necessary operators required for GPU Direct Storage. Those operators would be Node Feature Discover which should also be configured along with the base installation of the NVIDIA Network Operator (no NicClusterPolicy yet) and the NVIDIA GPU Operator (no GpuClusterPolicy yet).

Considerations

If any of the NVME devices in the system participate in either the operating system or other services (machine configs for LVMs or other customized access) the NVME kernel modules will not be able to unload properly even with the workaround defined in this documentation. Any use of GDS requires that the NVME drives are not in use during the deployment of the Network Operator in order for the Network Operator to be able to unload in-tree drivers and then load NVIDIA's out of tree drivers in place.

NVIDIA Network Operator Configuration

We assume the Network Operator has already been installed on the cluster but the NicClusterPolicy still needs to be created. The following NicClusterPolicy example will provide the needed configuration to ensure RDMA is properly loaded for nvme. The key option in this policy is the ENABLE_NFSRDMA variable and having it set to true. I want to note that this policy also optinonally has an rdmaSharedDevice and ENTRYPOINT_DEBUG set to true for more verbose logging.

$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: nicFeatureDiscovery: image: nic-feature-discovery repository: ghcr.io/mellanox version: v0.0.1 docaTelemetryService: image: doca_telemetry repository: nvcr.io/nvidia/doca version: 1.16.5-doca2.6.0-host rdmaSharedDevicePlugin: config: | { "configList": [ { "resourceName": "rdma_shared_device_eth", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ens1f0np0"] } } ] } image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: 'sha256:9f468fdc4449e65e4772575f83aa85840a00f97165f9a00ba34695c91d610fbd' secondaryNetwork: ipoib: image: ipoib-cni repository: ghcr.io/mellanox version: v1.2.0 nvIpam: enableWebhook: false image: nvidia-k8s-ipam repository: ghcr.io/mellanox version: v0.2.0 ofedDriver: readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 forcePrecompiled: false terminationGracePeriodSeconds: 300 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: true enable: true force: true timeoutSeconds: 300 podSelector: '' maxParallelUpgrades: 1 safeLoad: false waitForCompletion: timeoutSeconds: 0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 image: doca-driver repository: nvcr.io/nvidia/mellanox version: 25.01-0.6.0.0-0 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" - name: ENABLE_NFSRDMA value: "true" - name: ENTRYPOINT_DEBUG value: 'true' EOF

Before creating the NicClusterPolicy on the cluster we need to prepare a script which will allow us to workaround an issue with GPU Direct Storage in the NVIDIA Network Operator. This script when run right after creating the NicClusterPolicy will determine which nodes have mofed pods running on them and based on that node list will ssh as the core user into each node and unload the following modules: nvme, nvme_tcp, nvme_fabrics, nvme_core. By using the script to unload the modules while the mofed container is busying building the doca drivers we eliminate an issue where when the mofed container goes to install the compiled doca drivers there is a failure to load. This issue is being investigated by NVIDIA.

$ cat <<EOF > nvme-fixer.sh #!/bin/bash ### Set array of modules to be unloaded declare -a modarr=("nvme" "nvme_tcp" "nvme_fabrics" "nvme_core") ### Determine which hosts have mofed container running on them declare -a hostarr=(`oc get pods -n nvidia-network-operator -o custom-columns=POD:.metadata.name,NODE:.spec..nodeName --no-headers|grep mofed|awk {'print $2'}`) ### Iterate through modules on each host and unload them for host in "${hostarr[@]}" do echo "Unloading nvme dependencies on $host..." for module in "${modarr[@]}" do echo "Unloading module $module..." ssh core@$host sudo rmmod $module done done

Change the execute bit on the script.

$ chmod +x nvme-fixer.sh

Now we are ready to create the NicClusterPolicy on the cluster and follow it up by running the nvme-fixer.sh script. If there are any rmmod "not currently loaded" errors those can safely be ignored as the module was not loaded to start with. In the example below we had two workers nodes that had mofed pods running on them so the script went ahead and unloaded the nvme modules.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created $ ./nvme-fixer.sh Unloading nvme dependencies on nvd-srv-22.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... rmmod: ERROR: Module nvme_tcp is not currently loaded Unloading module nvme_fabrics... rmmod: ERROR: Module nvme_fabrics is not currently loaded Unloading module nvme_core... Unloading nvme dependencies on nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... Unloading module nvme_fabrics... Unloading module nvme_core... $

Now we wait for the mofed pod to finish compiling and installed the GPU Direct Storage modules. We will know its complete when the pods are in a running state like below:

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE kube-ipoib-cni-ds-5f8wk 1/1 Running 0 38s kube-ipoib-cni-ds-956nv 1/1 Running 0 38s kube-ipoib-cni-ds-jpbph 1/1 Running 0 38s kube-ipoib-cni-ds-jwtw2 1/1 Running 0 38s kube-ipoib-cni-ds-v4sb8 1/1 Running 0 38s mofed-rhcos4.17-69fb4cd685-ds-j77vl 2/2 Running 0 37s mofed-rhcos4.17-69fb4cd685-ds-lw7t9 2/2 Running 0 37s nic-feature-discovery-ds-527wc 1/1 Running 0 36s nic-feature-discovery-ds-fnn9v 1/1 Running 0 36s nic-feature-discovery-ds-l9lkf 1/1 Running 0 36s nic-feature-discovery-ds-qn4m9 1/1 Running 0 36s nic-feature-discovery-ds-w7vw4 1/1 Running 0 36s nv-ipam-controller-67556c846b-c4sfq 1/1 Running 0 36s nv-ipam-controller-67556c846b-wvm59 1/1 Running 0 36s nv-ipam-node-22rw9 1/1 Running 0 36s nv-ipam-node-6w4x4 1/1 Running 0 36s nv-ipam-node-f2p96 1/1 Running 0 36s nv-ipam-node-jssjh 1/1 Running 0 36s nv-ipam-node-z2mws 1/1 Running 0 36s nvidia-network-operator-controller-manager-57c7cfddc8-6nw6j 1/1 Running 16 (10h ago) 14d

We can validate things look correct from a module perspective by logging into one of nodes either via SSH or even debug pod and listing out the nvme modules. The results should look like the following output below. Note I ran a lsblk to also show that my nvme device is visible as well.

$ ssh core@nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com Red Hat Enterprise Linux CoreOS 417.94.202502051822-0 Part of OpenShift 4.17, RHCOS is a Kubernetes-native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.17/architecture/architecture-rhcos.html Last login: Fri Mar 21 17:48:41 2025 from 10.22.81.26 [systemd] Failed Units: 1 NetworkManager-wait-online.service [core@nvd-srv-23 ~]$ sudo bash [root@nvd-srv-23 core]# lsmod|grep nvme nvme_rdma 57344 0 nvme_fabrics 45056 1 nvme_rdma nvme 73728 0 nvme_core 204800 3 nvme,nvme_rdma,nvme_fabrics rdma_cm 155648 3 rpcrdma,nvme_rdma,rdma_ucm ib_core 557056 10 rdma_cm,ib_ipoib,rpcrdma,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm mlx_compat 20480 17 rdma_cm,ib_ipoib,mlxdevm,rpcrdma,nvme,nvme_rdma,mlxfw,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core nvme_common 24576 0 t10_pi 24576 2 sd_mod,nvme_core [root@nvd-srv-23 core]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 1.5T 0 disk ├─sda1 8:1 0 1M 0 part ├─sda2 8:2 0 127M 0 part ├─sda3 8:3 0 384M 0 part /boot └─sda4 8:4 0 1.5T 0 part /var /sysroot/ostree/deploy/rhcos/var /usr /etc / /sysroot sdb 8:16 0 1.5T 0 disk sdc 8:32 0 1.5T 0 disk sdd 8:48 0 1.5T 0 disk nvme0n1 259:1 0 894.2G 0 disk

This completes the NVIDIA Network Operator portion of the configuration for GPU Direct Storage.

NVIDIA GPU Operator Configuration

Now that the NicClusterPolicy is defined and the proper NVME modules have been loaded we can move into configuring our GPU ClusterPolicy. The below example is a policy that will enable GPU Direct Storage on the worker nodes that have a proper NVIDIA GPU.

$ cat <<EOF > gpu-cluster-policy.yaml apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' serviceMonitor: enabled: true enabled: true cdi: default: false enabled: false driver: licensingConfig: nlsEnabled: true configMapName: '' certConfig: name: '' rdma: enabled: true kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' enabled: true useNvidiaDriverCRD: false useOpenKernelModules: true devicePlugin: config: name: '' default: '' mps: root: /run/nvidia/mps enabled: true gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: 'nvcr.io/nvidia/cloud-native/nvidia-fs:2.20.5' vgpuManager: enabled: false vfioManager: enabled: true toolkit: installDir: /usr/local/nvidia enabled: true EOF

Now let's create the policy on the cluster.

$ oc create -f gpu-cluster-policy.yaml clusterpolicy.nvidia.com/gpu-cluster-policy created

Once the policy is created let's validate the pods are running before we move onto the next step.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-499wh 1/1 Running 0 18h gpu-feature-discovery-m68bn 1/1 Running 0 18h gpu-operator-c9ccd586d-htl5q 1/1 Running 0 19h nvidia-container-toolkit-daemonset-8m4r5 1/1 Running 0 18h nvidia-container-toolkit-daemonset-ld7qz 1/1 Running 0 18h nvidia-cuda-validator-fddq7 0/1 Completed 0 18h nvidia-cuda-validator-mdk6b 0/1 Completed 0 18h nvidia-dcgm-565tj 1/1 Running 0 18h nvidia-dcgm-exporter-jtgt6 1/1 Running 1 (18h ago) 18h nvidia-dcgm-exporter-znpgh 1/1 Running 1 (18h ago) 18h nvidia-dcgm-xpxbx 1/1 Running 0 18h nvidia-device-plugin-daemonset-2vn52 1/1 Running 0 18h nvidia-device-plugin-daemonset-kjzjz 1/1 Running 0 18h nvidia-driver-daemonset-417.94.202502051822-0-pj7hk 5/5 Running 2 (18h ago) 18h nvidia-driver-daemonset-417.94.202502051822-0-qp8xb 5/5 Running 5 (18h ago) 18h nvidia-node-status-exporter-48cx7 1/1 Running 0 18h nvidia-node-status-exporter-dpmsr 1/1 Running 0 18h nvidia-operator-validator-fmcz4 1/1 Running 0 18h nvidia-operator-validator-g2fbt 1/1 Running 0 18h

With the NVIDIA GPU Operator pods running we can rsh into the daemonset pods and confirm GDS is enabled by running the lsmod command (note the nvidia_fs module) and cat out the /proc/driver/nvidia-fs/stats file.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk sh-4.4# lsmod|grep nvidia nvidia_fs 327680 0 nvidia_peermem 24576 0 nvidia_modeset 1507328 0 video 73728 1 nvidia_modeset nvidia_uvm 6889472 8 nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset ib_uverbs 217088 19 nvidia_peermem,rdma_ucm,mlx5_ib drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200 $ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk sh-4.4# cat /proc/driver/nvidia-fs/stats GDS Version: 1.10.0.4 NVFS statistics(ver: 4.0) NVFS Driver(version: 2.20.5) Mellanox PeerDirect Supported: True IO stats: Disabled, peer IO stats: Disabled Logging level: info Active Shadow-Buffer (MiB): 0 Active Process: 0 Reads : err=0 io_state_err=0 Sparse Reads : n=0 io=0 holes=0 pages=0 Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0 Mmap : n=0 ok=0 err=0 munmap=0 Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0 Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0 Ops : Read=0 Write=0 BatchIO=0

If everything looks good we can move onto an additional step to confirm GDS is ready for workload consumption.

GDS Cuda Workload Container

Once the GPU Direct Storage drivers are loaded we can use one more additional tool to check and confirm GDS capability. This involves building a container that contains the CUDA packages and then running it on a node. The following pod yaml defines this configuration.

$ cat <<EOF > gds-check-workload.yaml apiVersion: v1 kind: Pod metadata: name: gds-check-workload namespace: default spec: serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.3 name: gds-check-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] volumeMounts: - name: udev mountPath: /run/udev volumeMounts: - name: kernel-config mountPath: /sys/kernel/config volumeMounts: - name: dev mountPath: /run/dev volumeMounts: - name: sys mountPath: /sys volumeMounts: - name: results mountPath: /results volumeMounts: - name: lib mountPath: /lib/modules resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 volumes: - name: udev hostPath: path: /run/udev - name: kernel-config hostPath: path: /sys/kernel/config - name: dev hostPath: path: /run/dev - name: sys hostPath: path: /sys - name: results hostPath: path: /results - name: lib hostPath: path: /lib/modules EOF

Now let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > default-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: rdma namespace: default EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z rdma clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

With the service account defined and our pod yaml ready we can create it on the cluster.

$ oc create -f gds-check-workload.yaml pod/gds-check-workload created $ oc get pods NAME READY STATUS RESTARTS AGE gds-check-workload 1/1 Running 0 3s

Once the pod is up and running we can rsh into the pod and run the gdscheck tool to confirm capabilities and configuration of GPU Direct Storage.

$ oc rsh gds-check-workload sh-5.1# /usr/local/cuda/gds/tools/gdscheck -p GDS release version: 1.13.1.3 nvidia_fs version: 2.20 libcufile version: 2.12 Platform: x86_64 ============ ENVIRONMENT: ============ ===================== DRIVER CONFIGURATION: ===================== NVMe P2PDMA : Unsupported NVMe : Supported NVMeOF : Supported SCSI : Unsupported ScaleFlux CSD : Unsupported NVMesh : Unsupported DDN EXAScaler : Unsupported IBM Spectrum Scale : Unsupported NFS : Supported BeeGFS : Unsupported WekaFS : Unsupported Userspace RDMA : Unsupported --Mellanox PeerDirect : Enabled --rdma library : Not Loaded (libcufile_rdma.so) --rdma devices : Not configured --rdma_device_status : Up: 0 Down: 0 ===================== CUFILE CONFIGURATION: ===================== properties.use_pci_p2pdma : false properties.use_compat_mode : true properties.force_compat_mode : false properties.gds_rdma_write_support : true properties.use_poll_mode : false properties.poll_mode_max_size_kb : 4 properties.max_batch_io_size : 128 properties.max_batch_io_timeout_msecs : 5 properties.max_direct_io_size_kb : 16384 properties.max_device_cache_size_kb : 131072 properties.max_device_pinned_mem_size_kb : 33554432 properties.posix_pool_slab_size_kb : 4 1024 16384 properties.posix_pool_slab_count : 128 64 64 properties.rdma_peer_affinity_policy : RoundRobin properties.rdma_dynamic_routing : 0 fs.generic.posix_unaligned_writes : false fs.lustre.posix_gds_min_kb: 0 fs.beegfs.posix_gds_min_kb: 0 fs.weka.rdma_write_support: false fs.gpfs.gds_write_support: false fs.gpfs.gds_async_support: true profile.nvtx : false profile.cufile_stats : 0 miscellaneous.api_check_aggressive : false execution.max_io_threads : 4 execution.max_io_queue_depth : 128 execution.parallel_io : true execution.min_io_threshold_size_kb : 8192 execution.max_request_parallelism : 4 properties.force_odirect_mode : false properties.prefer_iouring : false ========= GPU INFO: ========= GPU index 0 NVIDIA A40 bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled ============== PLATFORM INFO: ============== Found ACS enabled for switch 0000:e0:01.0 IOMMU: Pass-through or enabled WARN: GDS is not guaranteed to work functionally or in a performant way with iommu=on/pt Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed) Cuda Driver Version Installed: 12040 Platform: PowerEdge R760xa, Arch: x86_64(Linux 5.14.0-427.50.1.el9_4.x86_64) Platform verification succeeded

Hopefully this provides enough detail to enable GPU Direct Storage on OpenShift.