Showing posts with label rdma. Show all posts
Showing posts with label rdma. Show all posts

Friday, July 11, 2025

NVIDIA GPU Direct Storage on OpenShift

GPU Direct Storage enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. Using this direct path can relieve system bandwidth bottlenecks and decrease the latency and utilization load on the CPU. GPU Direct Storage can be used with NVMe or even NFS on a Netapp filer, the latter which this blog will cover.

Workflow

This blog is laid out with the follow sections all which build on top of one another to get the goal of successful GPU Direct Storage over NFS.

  • Assumptions
  • Considerations
  • Architecture
  • SRIOV Operator Configuration
  • Netapp VServer Setup
  • Netapp Trident CSI Operator Configuration
  • NVIDIA Network Operator Configuration
  • NVIDIA GPU Operator Configuration
  • GDS Cuda Workload Container

Assumptions

This document assumes that we have already deployed a OpenShift Cluster and have installed the necessary operators required for GPU Direct Storage. Those operators would be Node Feature Discover which should also be configured along with the base installation of the NVIDIA Network Operator (no NicClusterPolicy yet) and the NVIDIA GPU Operator (no GpuClusterPolicy yet), SRIOV Operator (no SRIOV policies or instances) and the Trident CSI Operator (No orchestrators or backends configured yet).

Considerations

If any of the nvme devices in the system participate in either the operating system or other services (machine configs for LVMs or other customized access) the nvme kernel modules will not be able to unload properly even with the workaround defined in this documentation. Any use of GDS requires that the nvme drives are not in use during the deployment of the Network Operator in order for the Network Operator to be able to unload in-tree drivers and then load NVIDIA's out of tree drivers in place.

Architecture

Below is a diagram of how the environment was architected from a networking perspective.

SRIOV Operator Configuration

For GPU Direct Storage over NFS to make performance sense we will need to use SRIOV here. So we first need to configure the SRIOV Operator assuming the SRIOV Operator is installed. The first step is to generate a basic SriovOperatorConfig custom resource file.

$ cat <<EOF > sriov-operator-config.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovOperatorConfig metadata: name: default namespace: openshift-sriov-network-operator spec: enableInjector: true enableOperatorWebhook: true logLevel: 2 EOF

Next we create the SriovOperatorConfig on the cluster.

$ oc create -f sriov-operator-config.yaml sriovoperatorconfig.sriovnetwork.openshift.io/default created

Now one key step here is to patch the SriovOperatorConfig so that it is aware of the NVIDIA Network Operator.

$ oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }' sriovoperatorconfig.sriovnetwork.openshift.io/default patched

Now we can move onto generating a SriovNetworkNodePolicy which will define the interface that we want to have VFs. In the case of multiple interfaces we would want to create multiple SriovNetworkNodePolicy files. The example below demonstrates how to configure an interface with an MTU of 9000 and generate 8 VFs.

$ cat <<EOF > sriov-network-node-policy.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: sriov-legacy-policy namespace: openshift-sriov-network-operator spec: deviceType: netdevice mtu: 9000 nicSelector: vendor: "15b3" pfNames: ["enp55s0np0#0-7"] nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" numVfs: 8 priority: 90 isRdma: true resourceName: sriovlegacy EOF

With the SriovNetworkNodePolicy generated we can create it on the cluster which will cause the worker nodes where it is applied to reboot.

$ oc create -f sriov-network-node-policy.yaml sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created

Once the node has rebooted we can optionally open a debug pod on the worker nodes and verify with ip link to confirm the interfaces were created. If we are ready to move forward we can next generate the SriovNetwork for the resource we created in the SriovNetworkNodePolicy. Again if we have multiple SriovNetworkNodePolicy files we will also have multiple SriovNetwork files. These define the network space for the VF interfaces. I should note that these networks need to have access to the Netapp data LIF as well in order for RDMA to function. In my example below I excluded the ipaddresses in range of 102.168.10.100-110 because my Netapp data LIF will have ipaddresss in that space.

$ cat <<EOF > sriov-network.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: sriov-network namespace: openshift-sriov-network-operator spec: vlan: 0 networkNamespace: "default" resourceName: "sriovlegacy" ipam: | { "type": "whereabouts", "range": "192.168.10.0/24", "exclude": [ "192.168.10.100/30", "192.168.10.110/32" ] } EOF

Now we can create the SriovNetwork custom resource on the cluster.

$ oc create -f sriov-network.yaml sriovnetwork.sriovnetwork.openshift.io/sriov-network created

At this point we have configured everything we need for SRIOV and can move onto the next section of the documentation.

Netapp VServer Setup

This section is really just to cover a few items of importance from the Netapp vserver perspective. This does not aim to be a comprehensive guide on how to setup a Netapp MetroCluster or the vservers within them. First in our example environment we had a vserver created and that vserver as two logical interfaces: management and data. With the management interface we can access the vserver and look at a few things. Depending on the environment this may or may not be accessible for the OpenShift administrator. In my case the storage team gave me access. To get on the vserver we can ssh to the vserver ipaddress or fqdn if it exists in DNS.

$ ssh trident@10.6.136.110 (trident@10.6.136.110) Password: Last login time: 5/7/2025 19:31:11

Once we are logged in I want to confirm that NFS 4 is enabled along with RDMA by using vserver nfs show.

ntap-rdu3-nv01-nvidia::> vserver nfs show Vserver: ntap-rdu3-nv01-nvidia General Access: true v3: enabled v4.0: enabled 4.1: enabled UDP: enabled TCP: enabled RDMA: enabled Default Windows User: - Default Windows Group: -

The above output looks good for my needs when doing GPU Direct Storage. Another item we can check is the export-policies with vserver export-policy show.

ntap-rdu3-nv01-nvidia::> vserver export-policy show Vserver Policy Name --------------- ------------------- ntap-rdu3-nv01-nvidia default ntap-rdu3-nv01-nvidia trident-8d6b2406-551a-416b-bcce-22626ed60242 2 entries were displayed.

And finally I wanted to confirm that my data interfaces connected to the NVIDIA high speed switch were indeed operating with jumbo frames. I can see that with the network port show command. Because this is a MetroCluster pair setup we can see the interfaces on both nodes is set appropriately.

ntap-rdu3-nv01-nvidia::> network port show Node: ntap-rdu3-nv01-a Speed(Mbps) Health Port Broadcast Domain Link MTU Admin/Oper Status --------- ------------ ---------------- ---- ---- ----------- -------- e0M Management up 1500 auto/1000 healthy e1b - down 1500 auto/- - e2a nvidia up 9000 auto/200000 healthy e2b - up 1500 auto/100000 healthy e2b-710 nfs up 1500 -/- healthy e6a - down 1500 auto/- - e6b - down 1500 auto/- - e7b - down 1500 auto/- - e8a - down 1500 auto/- - e8b - down 1500 auto/- - Node: ntap-rdu3-nv01-b Speed(Mbps) Health Port Broadcast Domain Link MTU Admin/Oper Status --------- ------------ ---------------- ---- ---- ----------- -------- e0M Management up 1500 auto/1000 healthy e1b - down 1500 auto/- - e2a nvidia up 9000 auto/200000 healthy e2b - up 1500 auto/100000 healthy e2b-710 nfs up 1500 -/- healthy e6a - down 1500 auto/- - e6b - down 1500 auto/- - e7b - down 1500 auto/- - e8a - down 1500 auto/- - e8b - down 1500 auto/- - 20 entries were displayed.

At this point we can exit out of the vserver and move onto configuring the Netapp Trident CSI operator.

Netapp Trident CSI Operator Configuration

Trident is an open-source and fully supported storage orchestrator for containers and Kubernetes distributions, including Red Hat OpenShift. Trident works with the entire NetApp storage portfolio, including the NetApp ONTAP and Element storage systems, and it also supports NFS and iSCSI connections. Trident accelerates the DevOps workflow by allowing end users to provision and manage storage from their NetApp storage systems without requiring intervention from a storage administrator.

We have made the assumption that the Trident Operator and the default Trident Orchestrator have already been deployed.  Our next step will be to configure the secret for the Netapp vfiler with the credentials so that Trident knows how which username and password to connect. 

$ cat <<EOF > netapp-phy-secret.yaml apiVersion: v1 kind: Secret metadata: name: netapp-phy-secret namespace: trident type: Opaque stringData: username: vserver-user password: verserv-password
Once we have our custom resource file generated we can create it on the cluster.
$ oc create -f netapp-phy-secret.yaml secret/netapp-phy-secret created
Next we need to configure the TridentBackendConfig so that Trident knows how to communicate with the Netapp from both a management and data perspective.  Note the credentials we created are referenced here.
$ cat <<EOF > netapp-phy-tridentbackendconfig.yaml apiVersion: trident.netapp.io/v1 kind: TridentBackendConfig metadata: name: netapp-phy-nfs-backend namespace: trident spec: version: 1 storageDriverName: ontap-nas-flexgroup managementLIF: 10.6.136.110 dataLIF: 192.168.10.101 backendName: phy-nfs-backend svm: ntap-rdu3-nv01-nvidia autoExportPolicy: true credentials: name: netapp-phy-secret
With the custom resource file generated we can create it on the cluster.
$ oc create -f netapp-phy-tridentbackendconfig.yaml tridentbackendconfig.trident.netapp.io/netapp-phy-nfs-backend created
We can validate the backend is there with the follow check.
$ oc get tridentbackend -n trident NAME BACKEND BACKEND UUID tbe-n59xq phy-nfs-backend 8d6b2406-551a-416b-bcce-22626ed60242
We can also describe the backend as well.
$ oc describe tridentbackend tbe-n59xq -n trident Name: tbe-n59xq Namespace: trident Labels: <none> Annotations: <none> API Version: trident.netapp.io/v1 Backend Name: phy-nfs-backend Backend UUID: 8d6b2406-551a-416b-bcce-22626ed60242 Config: ontap_config: Aggregate: Auto Export CID Rs: 0.0.0.0/0 ::/0 Auto Export Policy: true Backend Name: phy-nfs-backend Backend Pools: eyJzdm1VVUlEIjoiNjE2OTg1YTYtMjlkZi0xMWYwLWI4YzctZDAzOWVhYzA0MDUzIn0= Chap Initiator Secret: Chap Target Initiator Secret: Chap Target Username: Chap Username: Client Certificate: Client Private Key: Clone Split Delay: 10 Credentials: Name: netapp-phy-secret Data LIF: 192.168.10.101 Debug: false Debug Trace Flags: <nil> Defaults: LUKS Encryption: false Adaptive Qos Policy: Encryption: Export Policy: <automatic> File System Type: ext4 Format Options: Mirroring: false Name Template: Qos Policy: Security Style: unix Size: 1G Skip Recovery Queue: false Snapshot Dir: false Snapshot Policy: none Snapshot Reserve: Space Allocation: true Space Reserve: none Split On Clone: false Tiering Policy: Unix Permissions: ---rwxrwxrwx Deny New Volume Pools: false Disable Delete: false Empty Flexvol Deferred Delete Period: Flags: Disaggregated: false Personality: Unified San Optimized: false Flexgroup Aggregate List: Igroup Name: Labels: <nil> Limit Aggregate Usage: Limit Volume Pool Size: Limit Volume Size: Luns Per Flexvol: Management LIF: 10.6.136.110 Nas Type: nfs Nfs Mount Options: Password: secret:netapp-phy-secret Qtree Prune Flexvols Period: Qtree Quota Resize Period: Qtrees Per Flexvol: Region: Replication Policy: Replication Schedule: San Type: iscsi Smb Share: Storage: <nil> Storage Driver Name: ontap-nas-flexgroup Storage Prefix: Supported Topologies: <nil> Svm: ntap-rdu3-nv01-nvidia Trusted CA Certificate: Usage Heartbeat: Use CHAP: false Use REST: <nil> User State: Username: secret:netapp-phy-secret Version: 1 Zone: Config Ref: 9e1ff3f2-8a2d-4efa-859c-712b920d269b Kind: TridentBackend Metadata: Creation Timestamp: 2025-05-07T19:31:56Z Finalizers: trident.netapp.io Generate Name: tbe- Generation: 1 Resource Version: 38713504 UID: 6536970f-b10e-4e04-8a37-8da56deaf69e Online: true State: online User State: normal Version: 1 Events: <none>
We can also use the tridentctl command to validate the backend and confirm its online.
$ ./trident-installer/tridentctl get backend -n trident +-----------------+---------------------+--------------------------------------+--------+------------+---------+ | NAME | STORAGE DRIVER | UUID | STATE | USER-STATE | VOLUMES | +-----------------+---------------------+--------------------------------------+--------+------------+---------+ | phy-nfs-backend | ontap-nas-flexgroup | 8d6b2406-551a-416b-bcce-22626ed60242 | online | normal | 0 | +-----------------+---------------------+--------------------------------------+--------+------------+---------+
With the Trident backend configured we can move onto generating a storageclass resource file.  Note while this looks just like a standard Trident NFS storageclass the designation of the rdma makes it special.
$ cat <<EOF > netapp-phy-rdma-storageclass.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: netapp-phy-nfs provisioner: csi.trident.netapp.io parameters: backendType: "ontap-nas-flexgroup" mountOptions: - vers=4.1 - proto=rdma - max_connect=16 - rsize=262144 - wsize=262144 - write=eager EOF
Once we have generated the custom resource file we can create it on the cluster.
$ oc create -f netapp-phy-rdma-storageclass.yaml storageclass.storage.k8s.io/netapp-phy-nfs created
We can validate the storageclass by looking at the storage classes available.
$ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE netapp-phy-nfs csi.trident.netapp.io Delete Immediate false 4s
Now with the storagclass configured we can generate a persistent volume resource file.
$ cat <<EOF > netapp-phy-pvc.yaml kind: PersistentVolumeClaim apiVersion: v1 metadata: name: pvc-netapp-phy-test spec: accessModes: - ReadWriteOnce resources: requests: storage: 850Gi storageClassName: netapp-phy-nfs EOF
We can take the persistent volume resource and create it on the cluster.
$ oc create -f netapp-phy-pvc.yaml persistentvolumeclaim/pvc-netapp-phy-test created
We can validate the persistent volume by looking at the pvc.
$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE pvc-netapp-phy-test Bound pvc-ae477c5c-cf10-4bc0-bb71-39d214a237f0 850Gi RWO netapp-phy-nfs <unset> 45s

At this point we have completed the setup of the Trident storage side in preparation for GPU Direct Storage.

NVIDIA Network Operator Configuration

We assume the Network Operator has already been installed on the cluster but the NicClusterPolicy still needs to be created. The following NicClusterPolicy example will provide the needed configuration to ensure RDMA is properly loaded for NFS. The key option in this policy is the ENABLE_NFSRDMA variable and having it set to true. I want to note that this policy also optinonally has an rdmaSharedDevice and ENTRYPOINT_DEBUG set to true for more verbose logging.

$ cat <<EOF > network-sriovleg-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: 25.01-0.6.0.0-0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" - name: ENABLE_NFSRDMA value: "true" - name: ENTRYPOINT_DEBUG value: 'true' EOF

Before creating the NicClusterPolicy on the cluster we need to prepare a script which will allow us to workaround an issue with GPU Direct Storage in the NVIDIA Network Operator. This script when run right after creating the NicClusterPolicy will determine which nodes have mofed pods running on them and based on that node list will ssh as the core user into each node and unload the following modules: nvme, nvme_tcp, nvme_fabrics, nvme_core. By using the script to unload the modules while the mofed container is busying building the doca drivers we eliminate an issue where when the mofed container goes to install the compiled doca drivers there is a failure to load. One might ask what does NVMe have to do with NFS and unfortunately GPU Direct Storage enablement does both so we have to work around this issue.

$ cat <<EOF > nvme-fixer.sh #!/bin/bash ### Set array of modules to be unloaded declare -a modarr=("nvme" "nvme_tcp" "nvme_fabrics" "nvme_core") ### Determine which hosts have mofed container running on them declare -a hostarr=(`oc get pods -n nvidia-network-operator -o custom-columns=POD:.metadata.name,NODE:.spec..nodeName --no-headers|grep mofed|awk {'print $2'}`) ### Iterate through modules on each host and unload them for host in "${hostarr[@]}" do echo "Unloading nvme dependencies on $host..." for module in "${modarr[@]}" do echo "Unloading module $module..." ssh core@$host sudo rmmod $module done done EOF

Change the execute bit on the file.

$ chmod +x nvme-fixer.sh

Now we are ready to create the NicClusterPolicy on the cluster and follow it up by running the nvme-fixer.sh script. If there are any rmmod errors those can safely be ignored as the module was not loaded to start with. In the example below we had two workers nodes that had mofed pods running on them so the script went ahead and unloaded the nvme modules.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created $ ./nvme-fixer.sh Unloading nvme dependencies on nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... rmmod: ERROR: Module nvme_tcp is not currently loaded Unloading module nvme_fabrics... rmmod: ERROR: Module nvme_fabrics is not currently loaded Unloading module nvme_core... Unloading nvme dependencies on nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... Unloading module nvme_fabrics... Unloading module nvme_core...

Now we wait for the mofed pod to finish compiling and installed the GPU Direct Storage modules. We will know its complete when the pods are in a running state like below:

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE mofed-rhcos4.16-56c9d799bf-ds-bvhmj 2/2 Running 0 20h mofed-rhcos4.16-56c9d799bf-ds-jdzxj 2/2 Running 0 20h nvidia-network-operator-controller-manager-85b78c49f6-9lchx 1/1 Running 4 (3h26m ago) 3d14h

This completes the NVIDIA Network Operator portion of the configuration for GPU Direct Storage.

NVIDIA GPU Operator Configuration

Now that the NicClusterPolicy is defined and the proper nvme modules have been loaded we can move onto configuring our GPU ClusterPolicy. The below example is a policy that will enable GPU Direct Storage on the worker nodes that have a proper NVIDIA GPU.

$ cat <<EOF > gpu-cluster-policy.yaml apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' enabled: true serviceMonitor: enabled: true cdi: default: false enabled: false driver: licensingConfig: configMapName: '' nlsEnabled: true enabled: true kernelModuleType: open certConfig: name: '' useNvidiaDriverCRD: false kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' devicePlugin: config: default: '' name: '' enabled: true mps: root: /run/nvidia/mps gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: nvidia-fs repository: nvcr.io/nvidia/cloud-native version: 2.25.7 vgpuManager: enabled: false vfioManager: enabled: true toolkit: enabled: true installDir: /usr/local/nvidia EOF

Now let's create the policy on the cluster.

$ oc create -f gpu-cluster-policy.yaml clusterpolicy.nvidia.com/gpu-cluster-policy created

Once the policy is created let's validate the pods are running before we move onto the next step.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-nttht 1/1 Running 0 20h gpu-feature-discovery-r4ktv 1/1 Running 0 20h gpu-operator-7d7f694bfb-957mv 1/1 Running 0 20h nvidia-container-toolkit-daemonset-h96t6 1/1 Running 0 20h nvidia-container-toolkit-daemonset-hqtrl 1/1 Running 0 20h nvidia-cuda-validator-66ml7 0/1 Completed 0 20h nvidia-dcgm-exporter-hbk4r 1/1 Running 0 20h nvidia-dcgm-exporter-pgh4q 1/1 Running 0 20h nvidia-dcgm-nttds 1/1 Running 0 20h nvidia-dcgm-zb4fl 1/1 Running 0 20h nvidia-device-plugin-daemonset-d99md 1/1 Running 0 20h nvidia-device-plugin-daemonset-w7tc4 1/1 Running 0 20h nvidia-driver-daemonset-416.94.202504151456-0-8bdl5 4/4 Running 26 (20h ago) 2d2h nvidia-driver-daemonset-416.94.202504151456-0-j8gps 4/4 Running 20 (20h ago) 2d2h nvidia-node-status-exporter-b22hk 1/1 Running 4 2d2h nvidia-node-status-exporter-lwqhb 1/1 Running 3 2d2h nvidia-operator-validator-cvqn5 1/1 Running 0 20h nvidia-operator-validator-zxrpb 1/1 Running 0 20h

With the NVIDIA GPU Operator pods running we can rsh into the daemonset pods and confirm GDS is enabled by running the lsmod command and cat out the /proc/driver/nvidia-fs/stats file.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202504151456-0-8bdl5 sh-4.4# lsmod|grep nvidia nvidia_fs 327680 0 nvidia_modeset 1720320 0 video 73728 1 nvidia_modeset nvidia_uvm 4087808 12 nvidia 11665408 36 nvidia_uvm,nvidia_fs,gdrdrv,nvidia_modeset drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200 sh-4.4# cat /proc/driver/nvidia-fs/stats GDS Version: 1.10.0.4 NVFS statistics(ver: 4.0) NVFS Driver(version: 2.20.5) Mellanox PeerDirect Supported: False IO stats: Disabled, peer IO stats: Disabled Logging level: info Active Shadow-Buffer (MiB): 0 Active Process: 0 Reads : err=0 io_state_err=0 Sparse Reads : n=0 io=0 holes=0 pages=0 Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0 Mmap : n=0 ok=0 err=0 munmap=0 Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0 Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0 Ops : Read=0 Write=0 BatchIO=0

If everything looks good we can move onto an additional step to confirm GDS is ready for workload consumption.

GDS Cuda Workload Container

Once the GPU Direct Storage drivers are loaded we can use one more additional tool to check and confirm GDS capability. This involves building a container that contains the CUDA packages and then running it on a node.

Now let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > nvidiatools-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: nvidiatools namespace: default EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z nvidiatools clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "nvidiatools"

With the service account defined and our pod yaml ready we can create it on the cluster.

The following pod yaml defines this configuration.

$ cat <<EOF > nvidiatools-30-workload.yaml apiVersion: v1 kind: Pod metadata: name: nvidiatools-30-workload namespace: default annotations: # JSON list is the canonical form; adjust if your NAD lives in another namespace k8s.v1.cni.cncf.io/networks: '[{ "name": "sriov-network" }]' spec: serviceAccountName: nvidiatools nodeSelector: kubernetes.io/hostname: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com volumes: - name: rdma-pv-storage persistentVolumeClaim: claimName: pvc-netapp-phy-test - name: nordma-pv-storage persistentVolumeClaim: claimName: pvc-netapp-phy-nordma-test containers: - name: nvidiatools-30-workload image: quay.io/redhat_emp1/ecosys-nvidia/nvidia-tools:0.0.3 imagePullPolicy: IfNotPresent securityContext: privileged: true capabilities: add: ["IPC_LOCK"] resources: limits: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 requests: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 volumeMounts: - name: rdma-pv-storage mountPath: /nfsfast - name: nordma-pv-storage mountPath: /nfsslow EOF
$ oc create -f nvidiatools-30-workload.yaml nvidiatools-30-workload created $ oc get pods NAME READY STATUS RESTARTS AGE nvidiatools-30-workload 1/1 Running 0 3s

Once the pod is up and running we can rsh into the pod and run the gdscheck tool to confirm capabilities and configuration of GPU Direct Storage.

$ oc rsh nvidiatools-30-workload sh-5.1# /usr/local/cuda/gds/tools/gdscheck -p GDS release version: 1.13.1.3 nvidia_fs version: 2.20 libcufile version: 2.12 Platform: x86_64 ============ ENVIRONMENT: ============ ===================== DRIVER CONFIGURATION: ===================== NVMe P2PDMA : Unsupported NVMe : Supported NVMeOF : Supported SCSI : Unsupported ScaleFlux CSD : Unsupported NVMesh : Unsupported DDN EXAScaler : Unsupported IBM Spectrum Scale : Unsupported NFS : Supported BeeGFS : Unsupported WekaFS : Unsupported Userspace RDMA : Unsupported --Mellanox PeerDirect : Disabled --rdma library : Not Loaded (libcufile_rdma.so) --rdma devices : Not configured --rdma_device_status : Up: 0 Down: 0 ===================== CUFILE CONFIGURATION: ===================== properties.use_pci_p2pdma : false properties.use_compat_mode : true properties.force_compat_mode : false properties.gds_rdma_write_support : true properties.use_poll_mode : false properties.poll_mode_max_size_kb : 4 properties.max_batch_io_size : 128 properties.max_batch_io_timeout_msecs : 5 properties.max_direct_io_size_kb : 16384 properties.max_device_cache_size_kb : 131072 properties.max_device_pinned_mem_size_kb : 33554432 properties.posix_pool_slab_size_kb : 4 1024 16384 properties.posix_pool_slab_count : 128 64 64 properties.rdma_peer_affinity_policy : RoundRobin properties.rdma_dynamic_routing : 0 fs.generic.posix_unaligned_writes : false fs.lustre.posix_gds_min_kb: 0 fs.beegfs.posix_gds_min_kb: 0 fs.weka.rdma_write_support: false fs.gpfs.gds_write_support: false fs.gpfs.gds_async_support: true profile.nvtx : false profile.cufile_stats : 0 miscellaneous.api_check_aggressive : false execution.max_io_threads : 4 execution.max_io_queue_depth : 128 execution.parallel_io : true execution.min_io_threshold_size_kb : 8192 execution.max_request_parallelism : 4 properties.force_odirect_mode : false properties.prefer_iouring : false ========= GPU INFO: ========= GPU index 0 NVIDIA L40S bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled ============== PLATFORM INFO: ============== IOMMU: disabled Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed) Cuda Driver Version Installed: 12080 Platform: PowerEdge R760xa, Arch: x86_64(Linux 5.14.0-427.65.1.el9_4.x86_64) Platform verification succeeded

Now let's confirm our GPU Direct NFS mount is mounted. Notice in the output the proto is rdma.

sh-5.1# mount|grep nfs 192.168.10.101:/trident_pvc_ae477c5c_cf10_4bc0_bb71_39d214a237f0 on /mnt type nfs4 (rw,relatime,vers=4.1,rsize=262144,wsize=262144,namlen=255,hard,proto=rdma,max_connect=16,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=192.168.10.30,local_lock=none,write=eager,addr=192.168.10.101)

Next we can use gdsio to run some benchmarks across the GPU Direct NFS mount. Before we run the benchmarks let's familiarize ourselves with the all the gdsio switches and what they mean.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -h gdsio version :1.12 Usage [using config file]: gdsio rw-sample.gdsio Usage [using cmd line options]:/usr/local/cuda-12.8/gds/tools/gdsio -f <file name> -D <directory name> -d <gpu_index (refer nvidia-smi)> -n <numa node> -m <memory type(0 - (cudaMalloc), 1 - (cuMem), 2 - (cudaMallocHost), 3 - (malloc) 4 - (mmap))> -w <number of threads for a job> -s <file size(K|M|G)> -o <start offset(K|M|G)> -i <io_size(K|M|G)> <min_size:max_size:step_size> -p <enable nvlinks> -b <skip bufregister> -V <verify IO> -x <xfer_type> [0(GPU_DIRECT), 1(CPU_ONLY), 2(CPU_GPU), 3(CPU_ASYNC_GPU), 4(CPU_CACHED_GPU), 5(GPU_DIRECT_ASYNC), 6(GPU_BATCH), 7(GPU_BATCH_STREAM)] -B <batch size> -I <(read) 0|(write)1| (randread) 2| (randwrite) 3> -T <duration in seconds> -k <random_seed> (number e.g. 3456) to be used with random read/write> -U <use unaligned(4K) random offsets> -R <fill io buffer with random data> -F <refill io buffer with random data during each write> -a <alignment size in case of random IO> -M <mixed_rd_wr_percentage in case of regular batch mode> -P <rdma url> -J <per job statistics> xfer_type: 0 - Storage->GPU (GDS) 1 - Storage->CPU 2 - Storage->CPU->GPU 3 - Storage->CPU->GPU_ASYNC 4 - Storage->PAGE_CACHE->CPU->GPU 5 - Storage->GPU_ASYNC 6 - Storage->GPU_BATCH 7 - Storage->GPU_BATCH_STREAM Note: read test (-I 0) with verify option (-V) should be used with files written (-I 1) with -V option read test (-I 2) with verify option (-V) should be used with files written (-I 3) with -V option, using same random seed (-k), same number of threads(-w), offset(-o), and data size(-s) write test (-I 1/3) with verify option (-V) will perform writes followed by read

Before we begin running some tests I want to note that the tests are being run from a standard Dell R760xa and from the nvidia-smi topo output we can see we are dealing with a non optimal setup of NODE where the connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node. Ideally for peformant numbers we would want to run this on a H100 or B200 where the GPU and NIC are connected to the same PCIe switch and yield a PHB,PXB or PIX connection.

sh-5.1# nvidia-smi topo -mp GPU0 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE 0,2,4,6,8,10 0 N/A NIC0 NODE X NODE NODE NODE NODE NODE NODE NODE NODE NODE NIC1 NODE NODE X PIX PIX PIX PIX PIX PIX PIX PIX NIC2 NODE NODE PIX X PIX PIX PIX PIX PIX PIX PIX NIC3 NODE NODE PIX PIX X PIX PIX PIX PIX PIX PIX NIC4 NODE NODE PIX PIX PIX X PIX PIX PIX PIX PIX NIC5 NODE NODE PIX PIX PIX PIX X PIX PIX PIX PIX NIC6 NODE NODE PIX PIX PIX PIX PIX X PIX PIX PIX NIC7 NODE NODE PIX PIX PIX PIX PIX PIX X PIX PIX NIC8 NODE NODE PIX PIX PIX PIX PIX PIX PIX X PIX NIC9 NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9

Now let's run a few gdsio tests across our RDMA nfs mount. Please note these runs were not performance tuned in any way.  This is merely a demonstration to show the feature functionality.   

In this first example, gdsio is used to generate a random write load of small IOs (4k) to one of the NFS mount point

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 500M -i 4K -x 0 -I 3 -T 120 IoType: RANDWRITE XferType: GPUD Threads: 32 DataSetSize: 43222136/16384000(KiB) IOSize: 4(KiB) Throughput: 0.344940 GiB/sec, Avg_Latency: 352.314946 usecs ops: 10805534 total_time 119.498576 secs

Next we will repeat the same test but for random reads.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 500M -i 4K -x 0 -I 2 -T 120 IoType: RANDREAD XferType: GPUD Threads: 32 DataSetSize: 71313540/16384000(KiB) IOSize: 4(KiB) Throughput: 0.569229 GiB/sec, Avg_Latency: 214.448246 usecs ops: 17828385 total_time 119.477201 secs

Small and random IOs are all about IOPS and latency. For our next test we will determine throughput. We will use larger files sizes and much larger IO sizes.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 1G -i 1M -x 0 -I 1 -T 120 IoType: WRITE XferType: GPUD Threads: 32 DataSetSize: 320301056/33554432(KiB) IOSize: 1024(KiB) Throughput: 2.547637 GiB/sec, Avg_Latency: 12487.658159 usecs ops: 312794 total_time 119.900455 secs

This concludes the workflow of configuring and testing GPU Direct Storage on OpenShift over an RDMA NFS mount.

Monday, June 09, 2025

NVIDIA RDMA in OpenShift Virtualization


In this blog we want to explore using NVIDIA GPU Direct RDMA with OpenShift Virtualization.  The question is why would we want to do so?  Sometimes there might be some legacy applications that cannot run inside a container and therefore will need to run on a virtual machine. These virtual machines will most likely run on Openshift given it has the capability to do both container and virtual machine workloads. This unified environment makes management of IT infrastructure efficient because customers are not managing disparate systems for their various workloads.

Assumptions

We assume we already have a OpenShift cluster running with some kind of backend storage and OpenShift Virtualization installed. In our example environment we have 3 control planes and 3 worker nodes. The worker nodes have Mellanox BF3's and NVIDIA A40 GPUs. We are using OpenShift Data Foundation as the backing storage where needed and it is useful when live migration is a requirement.  With the assumptions covered we can begin configuring the system for GPU Direct RDMA.

Enable Device PassThrough

Before we can consume the devices in a virtual machine we need to enable device passthrough on the workers nodes for the Mellanox cards and the GPU devices so they can be used directly by the virtual machines. First we need to enable intel_iommu and can do so by creating the following MachineConfig.

$ cat <<EOF > 100-worker-iommu.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 100-worker-iommu spec: config: ignition: version: 3.2.0 kernelArguments: - intel_iommu=on EOF

Next we will create a butane file that contains the vendor/pci ids of the devices we wish to bind to vfio which enables them for passthrough.

$ cat <<EOF > 100-worker-vfiopci.bu variant: openshift version: 4.16.0 metadata: name: 100-worker-vfiopci labels: machineconfiguration.openshift.io/role: worker storage: files: - path: /etc/modprobe.d/vfio.conf mode: 0644 overwrite: true contents: inline: | options vfio-pci ids=10de:2235,10de:145a,15b3:a2dc,15b3:c2d5,15b3:1021,15b3:0237,15b3:0016 - path: /etc/modules-load.d/vfio-pci.conf mode: 0644 overwrite: true contents: inline: vfio-pci EOF

After building the butane file above we can pass it through the butane command to generate the corresponding custom resource file.

$ butane 100-worker-vfiopci.bu -o 100-worker-vfiopci.yaml

Next we need to generate a mlx_core blacklist so the driver does not load on the worker node. We do not need to do this for the GPU drivers because by default the nouveau driver is blacklisted in OpenShift.

$ cat <<EOF > 99-machine-config-blacklist-mlx5_core.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-blacklist-mlx5-core spec: kernelArguments: - "module_blacklist=mlx5_core" EOF

With the MachineConfigs generated we can go ahead and create them on the cluster.

$ oc create -f 100-worker-iommu.yaml $ oc create -f 100-worker-vfiopci.yaml $ oc create -f 99-machine-config-blacklist-mlx5_core.yaml

One by one the nodes will reboot as the MachineConfigs are applied to them. Wait for all the worker nodes in the cluster to reboot before proceeding and confirm with the output of the oc get mcp command.

$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-08f7504a24cb5e9734f3cfe995db08c6 True False False 3 3 3 0 122d worker rendered-worker-8c3ff0c3b0d16b30f7eb76992fd7d3b1 True False False 3 3 3 0 122d

Once confirmed we can proceed to the next section about exposing the devices to OpenShift.

Expose Devices to OpenShift

Now that devices have been configured for passthrough we need to expose them to the kubevirt-hyperconverged configuration. We can do this by editing that configuration.

$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv

Once we are in edit mode we can add our devices. Our environment example looks like the following below where we have A40 GPUs, Mellanox CX7s and Mellanox BF3s. The resourceName is arbitrary after the nvidia.com portion. Since both BF3 and CX7 cards show up as CX7 when looking at them via lspci I decided to put the BF3 prefix on the ones from a BF3 card so I could tell the difference. Another thing to note is that this setup should really have consistently configured cards in the workers. What I mean is that the cards should either be set for ethernet ports or infiniband ports as there is no way I could tell the difference.

permittedHostDevices: pciHostDevices: - pciDeviceSelector: 10de:2235 resourceName: nvidia.com/GA102GL_A40 - pciDeviceSelector: 15b3:a2dc resourceName: nvidia.com/BF3_CX7 - pciDeviceSelector: 15b3:1021 resourceName: nvidia.com/CX7 - pciDeviceSelector: 15b3:c2d5 resourceName: nvidia.com/BF3_DMA resourceRequirements:

Once the lines are added we can save and exit the edit command. We can use the following command to check that they are properly showing. Note some of my nodes has BF3 cards and some just had vanilla CX7 cards.

$ oc describe node | grep -E 'Capacity:|Allocatable:' -A14 (...) -- Allocatable: cpu: 127500m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445728Ki nvidia.com/BF3_CX7: 2 nvidia.com/BF3_DMA: 2 nvidia.com/GA102GL_A40: 2 -- Capacity: cpu: 128 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596676Ki nvidia.com/BF3_CX7: 2 nvidia.com/BF3_DMA: 2 nvidia.com/GA102GL_A40: 2 (...) Capacity: cpu: 128 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596668Ki nvidia.com/CX7: 2 nvidia.com/GA102GL_A40: 2 pods: 250 -- Allocatable: cpu: 127500m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445692Ki nvidia.com/CX7: 2 nvidia.com/GA102GL_A40: 2 pods: 250 --

If everything looks good we can proceed to launching our virtual machines.

Launch Virtual Machines

We need to launch a few virtual machines in order to test GPUDirect RDMA with our passthrough devices. The virtual machine custom resource files will look something like the examples below though they could be different depending on what one plans to test and what workloads will run inside the vm. We will need two virtual machines running on different compute nodes so we ensure each yaml has a defined nodeSelector. Note in these examples we are referencing one of the Mellnox devices and one of the NVIDIA GPU devices. The first machine is defined below.

$ cat <<EOF > rhel9-rdma1.yaml apiVersion: kubevirt.io/v1 kind: VirtualMachineInstance metadata: annotations: kubevirt.io/latest-observed-api-version: v1 kubevirt.io/storage-observed-api-version: v1 kubevirt.io/vm-generation: "5" vm.kubevirt.io/flavor: small vm.kubevirt.io/os: rhel9 vm.kubevirt.io/workload: server labels: kubevirt.io/domain: rhel9-lavender-ocelot-28 kubevirt.io/nodeName: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com kubevirt.io/size: small network.kubevirt.io/headlessService: headless name: rhel9-rdma1 namespace: default spec: architecture: amd64 domain: cpu: cores: 1 maxSockets: 16 model: host-model sockets: 4 threads: 1 devices: disks: - disk: bus: virtio name: rootdisk - disk: bus: virtio name: cloudinitdisk gpus: - deviceName: nvidia.com/GA102GL_A40 name: gpus-orange-porpoise-63 hostDevices: - deviceName: nvidia.com/BF3_CX7 name: hostDevices-turquoise-hornet-42 interfaces: - macAddress: 02:23:fc:00:00:11 masquerade: {} model: virtio name: default rng: {} features: acpi: enabled: true smm: enabled: true firmware: bootloader: efi: secureBoot: false uuid: e2ff2b46-096e-521f-8680-c99c6bbae5d8 machine: type: pc-q35-rhel9.4.0 memory: guest: 16Gi maxGuest: 64Gi resources: requests: memory: 16Gi evictionStrategy: LiveMigrate networks: - name: default pod: {} nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com terminationGracePeriodSeconds: 180 volumes: - dataVolume: name: rhel9-lavender-ocelot-28 name: rootdisk - cloudInitNoCloud: userData: | #cloud-config user: cloud-user password: password chpasswd: expire: false name: cloudinitdisk EOF

And then we have our second virtual machine defined.

apiVersion: kubevirt.io/v1 kind: VirtualMachineInstance metadata: annotations: kubevirt.io/latest-observed-api-version: v1 kubevirt.io/storage-observed-api-version: v1 kubevirt.io/vm-generation: "3" vm.kubevirt.io/flavor: small vm.kubevirt.io/os: rhel9 vm.kubevirt.io/workload: server labels: kubevirt.io/domain: rhel9-rdma2 kubevirt.io/nodeName: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com kubevirt.io/size: small network.kubevirt.io/headlessService: headless name: rhel9-rdma2 namespace: default spec: architecture: amd64 domain: cpu: cores: 1 maxSockets: 16 model: host-model sockets: 4 threads: 1 devices: disks: - disk: bus: virtio name: rootdisk - disk: bus: virtio name: cloudinitdisk gpus: - deviceName: nvidia.com/GA102GL_A40 name: gpus-amaranth-dormouse-37 hostDevices: - deviceName: nvidia.com/BF3_CX7 name: hostDevices-turquoise-reptile-50 interfaces: - macAddress: 02:23:fc:00:00:12 masquerade: {} model: virtio name: default rng: {} features: acpi: enabled: true smm: enabled: true firmware: bootloader: efi: secureBoot: false uuid: 875d71cc-b337-5209-ad37-b1611fa77ec2 machine: type: pc-q35-rhel9.4.0 memory: guest: 16Gi maxGuest: 64Gi resources: requests: memory: 16Gi evictionStrategy: LiveMigrate networks: - name: default pod: {} nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com terminationGracePeriodSeconds: 180 volumes: - dataVolume: name: rhel9-rdma2 name: rootdisk - cloudInitNoCloud: userData: | #cloud-config user: cloud-user password: password chpasswd: expire: false name: cloudinitdisk

Once we have generated the virtual machine custom resource files we can create them on the cluster.

$ oc create -f rhel9-rdma1.yaml $ oc create -f rhel9-rdma2.yaml

We can validate they are running by using oc get vmi.

$ oc get vmi NAME AGE PHASE IP NODENAME READY rhel9-rdma1 115m Running 10.128.2.66 nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com True rhel9-rdma2 114m Running 10.131.0.50 nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com True

If everything looks good we can proceed to configuring the NVIDIA drivers on the virtual machines.

Prepare for NVIDIA DOCA and GPU Drivers

Now that our virtual machines are up and running we will need to configure the NVIDIA DOCA and GPU drivers to take advantage of the devices we have passed up to them.

$ oc get vmi NAME AGE PHASE IP NODENAME READY rhel9-rdma1 115m Running 10.128.2.66 nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com True rhel9-rdma2 114m Running 10.131.0.50 nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com True

We can use virtctl to access the console of the virtual machines and login.

$ virtctl console rhel9-rdma1 Successfully connected to rhel9-rdma1 console. The escape sequence is ^] rhel9-rdma1 login: rhel9-rdma1 login: cloud-user Password: Last login: Fri Apr 11 18:55:51 on ttyS0 [cloud-user@rhel9-rdma1 ~]$

Once we are logged in we need to register the host to Red Hat.

$ sudo subscription-manager register Registering to: subscription.rhsm.redhat.com:443/subscription Username: schmaustech Password: The system has been registered with ID: 8d91ad2e-8d3a-4919-9030-4bd32292cc5b The registered system name is: rhel9-rdma3 $

Next we need to enable the CodeReady repository.

$ sudo subscription-manager repos --enable=codeready-builder-for-rhel-9-x86_64-rpms Repository 'codeready-builder-for-rhel-9-x86_64-rpms' is enabled for this system.

For NVIDIA's drivers there are dependencies to EPEL so we will need to enable that repository as well.

$ sudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm -y Red Hat CodeReady Linux Builder for RHEL 9 x86_ 19 MB/s | 12 MB 00:00 Last metadata expiration check: 0:00:01 ago on Fri Apr 11 19:04:49 2025. epel-release-latest-9.noarch.rpm 346 kB/s | 19 kB 00:00 Dependencies resolved. ================================================================================ Package Architecture Version Repository Size ================================================================================ Installing: epel-release noarch 9-9.el9 @commandline 19 k Transaction Summary ================================================================================ Install 1 Package Total size: 19 k Installed size: 26 k Downloading Packages: Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : epel-release-9-9.el9.noarch 1/1 Running scriptlet: epel-release-9-9.el9.noarch 1/1 Many EPEL packages require the CodeReady Builder (CRB) repository. It is recommended that you run /usr/bin/crb enable to enable the CRB repository. [ 7451.044946] systemd-rc-local-generator[5104]: /etc/rc.d/rc.local is not marked executable, skipping. Verifying : epel-release-9-9.el9.noarch 1/1 Installed products updated. Installed: epel-release-9-9.el9.noarch Complete! $

Next we need to enable the NVIDIA CUDA repository.

$ sudo cat <<EOF > /etc/yum.repos.d/cuda-rhel9.repo [cuda-rhel9-x86_64] name=cuda-rhel9-x86_64 baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64 enabled=1 gpgcheck=1 gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/D42D0685.pub EOF

We will also need to enable the NVIDIA Doca repository.

$ sudo cat <<EOF > /etc/yum.repos.d/doca-rhel9.repo [doca-rhel9-x86_64] name=doca-rhel9-x86_64 baseurl=https://linux.mellanox.com/public/repo/doca/2.10.0/rhel9.4/x86_64 enabled=1 gpgcheck=0 EOF

After adding the required repositories we also need to make sure the nouveau driver is properly blacklisted. To do this we will need to edit the default grub file.

$ sudo vi /etc/default/grub

We just want to append the modprobe.blacklist=nouveau on the GRUB_CMDLINE_LINUX line like the example below.

GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet modprobe.blacklist=nouveau" GRUB_DISABLE_RECOVERY="true" GRUB_ENABLE_BLSCFG=true

We also need to create two denylist.conf files under /etc/modprobe.d.

$ sudo echo "blacklist nouveau" > /etc/modprobe.d/denylist.conf $ sudo echo "options nouveau modeset=0" >> /etc/modprobe.d/denylist.conf

Finally we can rebuild the dracut image and generate the new grub file.

$ sudo dracut --force $ sudo grub2-mkconfig -o /boot/grub2/grub.cfg Generating grub configuration file ... Adding boot menu entry for UEFI Firmware Settings ... done

With our repos and blacklists in place we can validate the repolist looks correct.

$ sudo dnf repolist Updating Subscription Management repositories. repo id repo name codeready-builder-for-rhel-9-x86_64-rpms Red Hat CodeReady Linux Builder for RHEL 9 x86_64 (RPMs) cuda-rhel9-x86_64 cuda-rhel9-x86_64 doca-rhel9-x86_64 doca-rhel9-x86_64 epel Extra Packages for Enterprise Linux 9 - x86_64 epel-cisco-openh264 Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - x86_64 rhel-9-for-x86_64-appstream-rpms Red Hat Enterprise Linux 9 for x86_64 - AppStream (RPMs) rhel-9-for-x86_64-baseos-rpms Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)

Before we proceed to installing the NVIDIA DOCA drivers we should reboot the VMs so our blacklist of the drivers takes effect. One rebooted we can proceed.

Install NVIDIA DOCA Drivers

Now that we have our repositories setup we can begin to install the DOCA drivers. This is done by the following command.

$ sudo dnf install doca-all -y Updating Subscription Management repositories. cuda-rhel9-x86_64 7.4 MB/s | 2.6 MB 00:00 doca-rhel9-x86_64 208 kB/s | 214 kB 00:01 Extra Packages for Enterprise Linux 9 - x86_64 33 MB/s | 23 MB 00:00 Extra Packages for Enterprise Linux 9 openh264 8.5 kB/s | 2.5 kB 00:00 Dependencies resolved. ========================================================================================================================================= Package Arch Version Repository Size ========================================================================================================================================= Installing: doca-all x86_64 2.10.0-0.5.2 doca-rhel9-x86_64 6.6 k doca-sosreport noarch 4.8.1-1.el9 doca-rhel9-x86_64 862 k replacing sos.noarch 4.7.2-3.el9 kernel-core x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 19 M kernel-core x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-baseos-rpms 18 M kernel-modules x86_64 5.14.0-503.35.1.el9_5 rhel-9-for-x86_64-baseos-rpms 37 M (...) unbound-libs x86_64 1.16.2-8.el9_5.1 rhel-9-for-x86_64-appstream-rpms 552 k vim-filesystem noarch 2:8.2.2637-21.el9 rhel-9-for-x86_64-baseos-rpms 17 k xpmem x86_64 2.7.4-1.2501056.rhel9u4 doca-rhel9-x86_64 20 k xz-devel x86_64 5.2.5-8.el9_0 rhel-9-for-x86_64-appstream-rpms 59 k zlib-devel x86_64 1.2.11-40.el9 rhel-9-for-x86_64-appstream-rpms 47 k Installing weak dependencies: perl-NDBM_File x86_64 1.15-481.el9 rhel-9-for-x86_64-appstream-rpms 23 k python3-boto3 noarch 1.28.62-1.el9 epel 164 k Transaction Summary ========================================================================================================================================= Install 223 Packages Upgrade 1 Package Total download size: 452 M Is this ok [y/N]: y Downloading Packages: (1/224): collectx_1.20.2-23151356-rhel9.1-x86_6 323 kB/s | 222 kB 00:00 (2/224): clusterkit-1.15.469-1.2501056.x86_64.r 200 kB/s | 138 kB 00:00 (3/224): doca-all-2.10.0-0.5.2.x86_64.rpm 19 kB/s | 6.6 kB 00:00 (...) (221/224): meson-0.63.3-1.el9.noarch.rpm 11 MB/s | 1.5 MB 00:00 (222/224): ninja-build-1.10.2-6.el9.x86_64.rpm 1.0 MB/s | 150 kB 00:00 (223/224): libzip-devel-1.7.3-8.el9.x86_64.rpm 3.1 MB/s | 212 kB 00:00 (224/224): libibverbs-2501mlnx56-1.2501056.x86_ 474 kB/s | 358 kB 00:00 -------------------------------------------------------------------------------- Total 14 MB/s | 452 MB 00:33 Extra Packages for Enterprise Linux 9 - x86_64 1.6 MB/s | 1.6 kB 00:00 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Upgrading : libibverbs-2501mlnx56-1.2501056.x86_64 1/226 Running scriptlet: libibverbs-2501mlnx56-1.2501056.x86_64 1/226 Installing : doca-sdk-common-2.10.0087-1.el9.x86_64 2/226 Running scriptlet: doca-sdk-common-2.10.0087-1.el9.x86_64 2/226 Installing : ucx-1.18.0-1.2501056.x86_64 3/226 Running scriptlet: ucx-1.18.0-1.2501056.x86_64 3/226 Installing : libibumad-2501mlnx56-1.2501056.x86_64 4/226 (...) Installing : doca-all-2.10.0-0.5.2.x86_64 224/226 Obsoleting : sos-4.7.2-3.el9.noarch 225/226 Cleanup : libibverbs-51.0-1.el9.x86_64 226/226 Running scriptlet: kernel-modules-core-5.14.0-427.42.1.el9_4.x86_64 226/226 Running scriptlet: kernel-core-5.14.0-427.42.1.el9_4.x86_64 226/226 Running scriptlet: kernel-modules-core-5.14.0-503.35.1.el9_5.x86_64 226/226 Running scriptlet: kernel-core-5.14.0-503.35.1.el9_5.x86_64 226/226 Running scriptlet: kernel-modules-5.14.0-503.35.1.el9_5.x86_64 226/226 Running scriptlet: mlnx-ofa_kernel-devel-25.01-OFED.25.01.0.5.6.1.r 226/226 Running scriptlet: libibverbs-51.0-1.el9.x86_64 226/226 [ 1456.075588] systemd-rc-local-generator[76691]: /etc/rc.d/rc.local is not marked executable, skipping. Verifying : clusterkit-1.15.469-1.2501056.x86_64 1/226 Verifying : collectx-clxapi-1.20.2-1.x86_64 2/226 Verifying : collectx-clxapidev-1.20.2-1.x86_64 3/226 Verifying : doca-all-2.10.0-0.5.2.x86_64 4/226 (...) Verifying : meson-0.63.3-1.el9.noarch 223/226 Verifying : libzip-devel-1.7.3-8.el9.x86_64 224/226 Verifying : libibverbs-2501mlnx56-1.2501056.x86_64 225/226 Verifying : libibverbs-51.0-1.el9.x86_64 226/226 Installed products updated. Upgraded: libibverbs-2501mlnx56-1.2501056.x86_64 Installed: bzip2-devel-1.0.8-8.el9.x86_64 clusterkit-1.15.469-1.2501056.x86_64 (..) unbound-1.16.2-8.el9_5.1.x86_64 unbound-libs-1.16.2-8.el9_5.1.x86_64 vim-filesystem-2:8.2.2637-21.el9.noarch xpmem-2.7.4-1.2501056.rhel9u4.x86_64 xz-devel-5.2.5-8.el9_0.x86_64 zlib-devel-1.2.11-40.el9.x86_64 Complete!

One the drivers are installed we can confirm that mst status is reporting properly

$ sudo mst status -v MST modules: ------------ MST PCI module is not loaded MST PCI configuration module is not loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField3(rev:1) NA 09:00.0 mlx5_0 net-eth1 -1

Install RHEL Dependency Packages

There are some dependencies from RHEL that will need to be installed so we can do that now.

$ dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool pciutils-devel -y Updating Subscription Management repositories Last metadata expiration check: 0:59:19 ago on Sat Apr 12 18:06:50 2025. Package procps-ng-3.3.17-14.el9.x86_64 is already installed. Package pciutils-3.7.0-5.el9.x86_64 is already installed. Package jq-1.6-17.el9.x86_64 is already installed. Package iputils-20210202-9.el9.x86_64 is already installed. Package ethtool-2:6.2-1.el9.x86_64 is already installed. Dependencies resolved. =============================================================================================== Package Arch Version Repository Size =============================================================================================== Installing: autoconf noarch 2.69-39.el9 rhel-9-for-x86_64-appstream-rpms 685 k automake noarch 1.16.2-8.el9 rhel-9-for-x86_64-appstream-rpms 693 k git x86_64 2.43.5-2.el9_5 rhel-9-for-x86_64-appstream-rpms 55 k libtool x86_64 2.4.6-46.el9 rhel-9-for-x86_64-appstream-rpms 585 k net-tools x86_64 2.0-0.64.20160912git.el9 rhel-9-for-x86_64-baseos-rpms 312 k wget x86_64 1.21.1-8.el9_4 rhel-9-for-x86_64-appstream-rpms 789 k Upgrading: iputils x86_64 20210202-10.el9_5 rhel-9-for-x86_64-baseos-rpms 179 k Installing dependencies: cpp x86_64 11.5.0-2.el9 rhel-9-for-x86_64-appstream-rpms 11 M gcc x86_64 11.5.0-2.el9 rhel-9-for-x86_64-appstream-rpms 32 M git-core x86_64 2.43.5-2.el9_5 rhel-9-for-x86_64-appstream-rpms 4.4 M git-core-doc noarch 2.43.5-2.el9_5 rhel-9-for-x86_64-appstream-rpms 2.9 M glibc-devel x86_64 2.34-125.el9_5.1 rhel-9-for-x86_64-appstream-rpms 37 k glibc-headers x86_64 2.34-125.el9_5.1 rhel-9-for-x86_64-appstream-rpms 543 k libmpc x86_64 1.2.1-4.el9 rhel-9-for-x86_64-appstream-rpms 65 k libxcrypt-devel x86_64 4.4.18-3.el9 rhel-9-for-x86_64-appstream-rpms 32 k m4 x86_64 1.4.19-1.el9 rhel-9-for-x86_64-appstream-rpms 304 k make x86_64 1:4.3-8.el9 rhel-9-for-x86_64-baseos-rpms 541 k perl-DynaLoader x86_64 1.47-481.el9 rhel-9-for-x86_64-appstream-rpms 26 k perl-Error noarch 1:0.17029-7.el9 rhel-9-for-x86_64-appstream-rpms 46 k perl-File-Compare noarch 1.100.600-481.el9 rhel-9-for-x86_64-appstream-rpms 14 k perl-File-Copy noarch 2.34-481.el9 rhel-9-for-x86_64-appstream-rpms 20 k perl-File-Find noarch 1.37-481.el9 rhel-9-for-x86_64-appstream-rpms 26 k perl-Git noarch 2.43.5-2.el9_5 rhel-9-for-x86_64-appstream-rpms 39 k perl-TermReadKey x86_64 2.38-11.el9 rhel-9-for-x86_64-appstream-rpms 40 k perl-Thread-Queue noarch 3.14-460.el9 rhel-9-for-x86_64-appstream-rpms 24 k perl-lib x86_64 0.65-481.el9 rhel-9-for-x86_64-appstream-rpms 15 k perl-threads x86_64 1:2.25-460.el9 rhel-9-for-x86_64-appstream-rpms 61 k perl-threads-shared x86_64 1.61-460.el9 rhel-9-for-x86_64-appstream-rpms 48 k Transaction Summary =============================================================================================== Install 27 Packages Upgrade 1 Package Total download size: 56 M Downloading Packages: (1/28): net-tools-2.0-0.64.20160912git.el9.x86_ 2.0 MB/s | 312 kB 00:00 (2/28): perl-Error-0.17029-7.el9.noarch.rpm 296 kB/s | 46 kB 00:00 (3/28): make-4.3-8.el9.x86_64.rpm 2.9 MB/s | 541 kB 00:00 (4/28): libmpc-1.2.1-4.el9.x86_64.rpm 950 kB/s | 65 kB 00:00 (5/28): perl-TermReadKey-2.38-11.el9.x86_64.rpm 403 kB/s | 40 kB 00:00 (6/28): perl-threads-2.25-460.el9.x86_64.rpm 743 kB/s | 61 kB 00:00 (7/28): m4-1.4.19-1.el9.x86_64.rpm 2.4 MB/s | 304 kB 00:00 (8/28): libxcrypt-devel-4.4.18-3.el9.x86_64.rpm 162 kB/s | 32 kB 00:00 (9/28): perl-threads-shared-1.61-460.el9.x86_64 630 kB/s | 48 kB 00:00 (10/28): perl-Thread-Queue-3.14-460.el9.noarch. 159 kB/s | 24 kB 00:00 (11/28): perl-File-Compare-1.100.600-481.el9.no 191 kB/s | 14 kB 00:00 (12/28): automake-1.16.2-8.el9.noarch.rpm 3.2 MB/s | 693 kB 00:00 (13/28): perl-File-Copy-2.34-481.el9.noarch.rpm 144 kB/s | 20 kB 00:00 (14/28): perl-File-Find-1.37-481.el9.noarch.rpm 120 kB/s | 26 kB 00:00 (15/28): perl-lib-0.65-481.el9.x86_64.rpm 85 kB/s | 15 kB 00:00 (16/28): perl-DynaLoader-1.47-481.el9.x86_64.rp 134 kB/s | 26 kB 00:00 (17/28): wget-1.21.1-8.el9_4.x86_64.rpm 5.1 MB/s | 789 kB 00:00 (18/28): autoconf-2.69-39.el9.noarch.rpm 2.8 MB/s | 685 kB 00:00 (19/28): glibc-devel-2.34-125.el9_5.1.x86_64.rp 319 kB/s | 37 kB 00:00 (20/28): libtool-2.4.6-46.el9.x86_64.rpm 7.5 MB/s | 585 kB 00:00 (21/28): glibc-headers-2.34-125.el9_5.1.x86_64. 9.0 MB/s | 543 kB 00:00 (22/28): cpp-11.5.0-2.el9.x86_64.rpm 59 MB/s | 11 MB 00:00 (23/28): git-2.43.5-2.el9_5.x86_64.rpm 508 kB/s | 55 kB 00:00 (24/28): gcc-11.5.0-2.el9.x86_64.rpm 55 MB/s | 32 MB 00:00 (25/28): git-core-doc-2.43.5-2.el9_5.noarch.rpm 19 MB/s | 2.9 MB 00:00 (26/28): git-core-2.43.5-2.el9_5.x86_64.rpm 17 MB/s | 4.4 MB 00:00 (27/28): perl-Git-2.43.5-2.el9_5.noarch.rpm 451 kB/s | 39 kB 00:00 (28/28): iputils-20210202-10.el9_5.x86_64.rpm 2.5 MB/s | 179 kB 00:00 -------------------------------------------------------------------------------- Total 38 MB/s | 56 MB 00:01 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : perl-DynaLoader-1.47-481.el9.x86_64 1/29 Installing : git-core-2.43.5-2.el9_5.x86_64 2/29 Installing : perl-File-Find-1.37-481.el9.noarch 3/29 Installing : perl-File-Copy-2.34-481.el9.noarch 4/29 Installing : perl-File-Compare-1.100.600-481.el9.noarch 5/29 Installing : perl-threads-1:2.25-460.el9.x86_64 6/29 Installing : libmpc-1.2.1-4.el9.x86_64 7/29 Installing : cpp-11.5.0-2.el9.x86_64 8/29 Installing : perl-threads-shared-1.61-460.el9.x86_64 9/29 Installing : perl-Thread-Queue-3.14-460.el9.noarch 10/29 Installing : git-core-doc-2.43.5-2.el9_5.noarch 11/29 Installing : perl-TermReadKey-2.38-11.el9.x86_64 12/29 Installing : glibc-headers-2.34-125.el9_5.1.x86_64 13/29 Installing : glibc-devel-2.34-125.el9_5.1.x86_64 14/29 Installing : libxcrypt-devel-4.4.18-3.el9.x86_64 15/29 Installing : perl-lib-0.65-481.el9.x86_64 16/29 Installing : m4-1.4.19-1.el9.x86_64 17/29 Installing : autoconf-2.69-39.el9.noarch 18/29 Installing : automake-1.16.2-8.el9.noarch 19/29 Installing : perl-Error-1:0.17029-7.el9.noarch 20/29 Installing : git-2.43.5-2.el9_5.x86_64 21/29 Installing : perl-Git-2.43.5-2.el9_5.noarch 22/29 Installing : make-1:4.3-8.el9.x86_64 23/29 Installing : gcc-11.5.0-2.el9.x86_64 24/29 Installing : libtool-2.4.6-46.el9.x86_64 25/29 Upgrading : iputils-20210202-10.el9_5.x86_64 26/29 Running scriptlet: iputils-20210202-10.el9_5.x86_64 26/29 Installing : wget-1.21.1-8.el9_4.x86_64 27/29 Installing : net-tools-2.0-0.64.20160912git.el9.x86_64 28/29 Running scriptlet: net-tools-2.0-0.64.20160912git.el9.x86_64 28/29 Running scriptlet: iputils-20210202-9.el9.x86_64 29/29 Cleanup : iputils-20210202-9.el9.x86_64 29/29 Running scriptlet: iputils-20210202-9.el9.x86_64 29/29 [ 4396.017542] systemd-rc-local-generator[23097]: /etc/rc.d/rc.local is not marked executable, skipping. Verifying : make-1:4.3-8.el9.x86_64 1/29 Verifying : net-tools-2.0-0.64.20160912git.el9.x86_64 2/29 Verifying : perl-Error-1:0.17029-7.el9.noarch 3/29 Verifying : perl-TermReadKey-2.38-11.el9.x86_64 4/29 Verifying : libmpc-1.2.1-4.el9.x86_64 5/29 Verifying : libxcrypt-devel-4.4.18-3.el9.x86_64 6/29 Verifying : perl-threads-1:2.25-460.el9.x86_64 7/29 Verifying : m4-1.4.19-1.el9.x86_64 8/29 Verifying : perl-Thread-Queue-3.14-460.el9.noarch 9/29 Verifying : perl-threads-shared-1.61-460.el9.x86_64 10/29 Verifying : automake-1.16.2-8.el9.noarch 11/29 Verifying : perl-File-Compare-1.100.600-481.el9.noarch 12/29 Verifying : perl-File-Copy-2.34-481.el9.noarch 13/29 Verifying : perl-File-Find-1.37-481.el9.noarch 14/29 Verifying : perl-lib-0.65-481.el9.x86_64 15/29 Verifying : perl-DynaLoader-1.47-481.el9.x86_64 16/29 Verifying : wget-1.21.1-8.el9_4.x86_64 17/29 Verifying : autoconf-2.69-39.el9.noarch 18/29 Verifying : gcc-11.5.0-2.el9.x86_64 19/29 Verifying : glibc-devel-2.34-125.el9_5.1.x86_64 20/29 Verifying : libtool-2.4.6-46.el9.x86_64 21/29 Verifying : cpp-11.5.0-2.el9.x86_64 22/29 Verifying : glibc-headers-2.34-125.el9_5.1.x86_64 23/29 Verifying : git-2.43.5-2.el9_5.x86_64 24/29 Verifying : git-core-2.43.5-2.el9_5.x86_64 25/29 Verifying : git-core-doc-2.43.5-2.el9_5.noarch 26/29 Verifying : perl-Git-2.43.5-2.el9_5.noarch 27/29 Verifying : iputils-20210202-10.el9_5.x86_64 28/29 Verifying : iputils-20210202-9.el9.x86_64 29/29 Installed products updated. Upgraded: iputils-20210202-10.el9_5.x86_64 Installed: autoconf-2.69-39.el9.noarch automake-1.16.2-8.el9.noarch cpp-11.5.0-2.el9.x86_64 gcc-11.5.0-2.el9.x86_64 git-2.43.5-2.el9_5.x86_64 git-core-2.43.5-2.el9_5.x86_64 git-core-doc-2.43.5-2.el9_5.noarch glibc-devel-2.34-125.el9_5.1.x86_64 glibc-headers-2.34-125.el9_5.1.x86_64 libmpc-1.2.1-4.el9.x86_64 libtool-2.4.6-46.el9.x86_64 libxcrypt-devel-4.4.18-3.el9.x86_64 m4-1.4.19-1.el9.x86_64 make-1:4.3-8.el9.x86_64 net-tools-2.0-0.64.20160912git.el9.x86_64 perl-DynaLoader-1.47-481.el9.x86_64 perl-Error-1:0.17029-7.el9.noarch perl-File-Compare-1.100.600-481.el9.noarch perl-File-Copy-2.34-481.el9.noarch perl-File-Find-1.37-481.el9.noarch perl-Git-2.43.5-2.el9_5.noarch perl-TermReadKey-2.38-11.el9.x86_64 perl-Thread-Queue-3.14-460.el9.noarch perl-lib-0.65-481.el9.x86_64 perl-threads-1:2.25-460.el9.x86_64 perl-threads-shared-1.61-460.el9.x86_64 wget-1.21.1-8.el9_4.x86_64 Complete!

Install NVIDIA GPU Drivers

Next we need to install the NVIDIA GPU drivers.

$ sudo dnf -y module install nvidia-driver:570-open Updating Subscription Management repositories. Last metadata expiration check: 2:17:53 ago on Sat Apr 12 17:25:39 2025. Dependencies resolved. ========================================================================================================== Package Arch Version Repository Size ========================================================================================================== Installing group/module packages: kmod-nvidia-open-dkms noarch 3:570.124.06-1.el9 cuda-rhel9-x86_64 12 M libnvidia-cfg x86_64 3:570.124.06-1.el9 cuda-rhel9-x86_64 151 k libnvidia-fbc x86_64 3:570.124.06-1.el9 cuda-rhel9-x86_64 102 k (...) xorg-x11-drv-libinput x86_64 1.0.1-3.el9 rhel-9-for-x86_64-appstream-rpms 49 k xorg-x11-nvidia x86_64 3:570.124.06-1.el9 cuda-rhel9-x86_64 2.4 M xorg-x11-proto-devel noarch 2024.1-1.el9 rhel-9-for-x86_64-appstream-rpms 314 k xorg-x11-server-Xorg x86_64 1.20.11-26.el9 rhel-9-for-x86_64-appstream-rpms 1.5 M xorg-x11-server-common x86_64 1.20.11-26.el9 rhel-9-for-x86_64-appstream-rpms 37 k Installing module profiles: nvidia-driver/default Transaction Summary ========================================================================================================== Install 51 Packages Total download size: 335 M Installed size: 1.1 G Downloading Packages: (1/51): egl-gbm-1.1.2.1-1.el9.x86_64.rpm 90 kB/s | 22 kB 00:00 (2/51): egl-wayland-1.1.19~20250313gitf1fd514-1 162 kB/s | 44 kB 00:00 (3/51): egl-x11-1.0.1~20250324git0558d54-5.el9. 206 kB/s | 56 kB 00:00 (...) (50/51): info-6.7-15.el9.x86_64.rpm 1.5 MB/s | 228 kB 00:00 (51/51): kernel-devel-matched-5.14.0-503.35.1.e 5.8 MB/s | 2.0 MB 00:00 -------------------------------------------------------------------------------- Total 68 MB/s | 335 MB 00:04 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : libnvidia-ml-3:570.124.06-1.el9.x86_64 1/51 (...) Installing : nvidia-modprobe-3:570.124.06-1.el9.x86_64 38/51 Installing : nvidia-kmod-common-3:570.124.06-1.el9.noarch 39/51 Running scriptlet: nvidia-kmod-common-3:570.124.06-1.el9.noarch 39/51 Installing : kmod-nvidia-open-dkms-3:570.124.06-1.el9.noarch 40/51 Running scriptlet: kmod-nvidia-open-dkms-3:570.124.06-1.el9.noarch 40/51 [ 474.894705] nvidia: loading out-of-tree module taints kernel. [ 474.894734] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 474.934517] nvidia-nvlink: Nvlink Core is being initialized, major device number 235 [ 474.934584] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 570.124.06 Release Build (root@rhel9-rdma3) Sat Apr 12 19:44:54 EDT 2025 [ 475.033439] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 570.124.06 Release Build (root@rhel9-rdma3) Sat Apr 12 19:44:24 EDT 2025 [ 475.041093] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver [ 475.119117] ACPI Warning: \_SB.PCI0.S19.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230331/nsarguments-61) [ 476.482834] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 1 [ 476.482930] nvidia 0000:0a:00.0: [drm] No compatible format found [ 476.482938] nvidia 0000:0a:00.0: [drm] Cannot find any crtc or sizes [ 476.679230] nvidia-uvm: Loaded the UVM driver, major device number 511. Installing : egl-x11-1.0.1~20250324git0558d54-5.el9.x86_64 41/51 Installing : egl-wayland-1.1.19~20250313gitf1fd514-1.el9.x86_64 42/51 Installing : egl-gbm-2:1.1.2.1-1.el9.x86_64 43/51 Installing : nvidia-driver-libs-3:570.124.06-1.el9.x86_64 44/51 Installing : nvidia-driver-3:570.124.06-1.el9.x86_64 45/51 Running scriptlet: nvidia-driver-3:570.124.06-1.el9.x86_64 45/51 Created symlink /etc/systemd/system/systemd-hibernate.service.wants/nvidia-hibernate.service → /usr/lib/systemd/system/nvidia-hibernate.service. Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-powerd.service → /usr/lib/systemd/system/nvidia-powerd.service. Created symlink /etc/systemd/system/systemd-suspend.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service. Created symlink /etc/systemd/system/systemd-hibernate.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service. Created symlink /etc/systemd/system/systemd-suspend-then-hibernate.service.wants/nvidia-resume.service → /usr/lib/systemd/system/nvidia-resume.service. Created symlink /etc/systemd/system/systemd-suspend.service.wants/nvidia-suspend.service → /usr/lib/systemd/system/nvidia-suspend.service. Created symlink /etc/systemd/system/systemd-suspend-then-hibernate.service.wants/nvidia-suspend-then-hibernate.service → /usr/lib/systemd/system/nvidia-suspend-then-hibernate.service. Installing : xorg-x11-nvidia-3:570.124.06-1.el9.x86_64 46/51 Installing : nvidia-xconfig-3:570.124.06-1.el9.x86_64 47/51 Installing : nvidia-settings-3:570.124.06-1.el9.x86_64 48/51 Installing : nvidia-driver-cuda-3:570.124.06-1.el9.x86_64 49/51 Installing : nvidia-libXNVCtrl-devel-3:570.124.06-1.el9.x86_64 50/51 Installing : libnvidia-fbc-3:570.124.06-1.el9.x86_64 51/51 Running scriptlet: libnvidia-fbc-3:570.124.06-1.el9.x86_64 51/51 [ 478.773624] systemd-rc-local-generator[42506]: /etc/rc.d/rc.local is not marked executable, skipping. Verifying : egl-gbm-2:1.1.2.1-1.el9.x86_64 1/51 Verifying : egl-wayland-1.1.19~20250313gitf1fd514-1.el9.x86_64 2/51 Verifying : egl-x11-1.0.1~20250324git0558d54-5.el9.x86_64 3/51 Verifying : kmod-nvidia-open-dkms-3:570.124.06-1.el9.noarch 4/51 (...) Verifying : kernel-devel-5.14.0-503.35.1.el9_5.x86_64 48/51 Verifying : kernel-devel-matched-5.14.0-503.35.1.el9_5.x86_64 49/51 Verifying : ed-1.14.2-12.el9.x86_64 50/51 Verifying : info-6.7-15.el9.x86_64 51/51 Installed products updated. Installed: bison-3.7.4-5.el9.x86_64 dkms-3.1.6-1.el9.noarch ed-1.14.2-12.el9.x86_64 egl-gbm-2:1.1.2.1-1.el9.x86_64 egl-wayland-1.1.19~20250313gitf1fd514-1.el9.x86_64 (...) xorg-x11-nvidia-3:570.124.06-1.el9.x86_64 xorg-x11-proto-devel-2024.1-1.el9.noarch xorg-x11-server-Xorg-1.20.11-26.el9.x86_64 xorg-x11-server-common-1.20.11-26.el9.x86_64 Complete!

Validate GPU Drivers

We can validate the NVIDIA GPU Drivers installed by running the nvidia-smi command and listing out the modules.

$ sudo nvidia-smi Sat Apr 12 19:46:06 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A40 Off | 00000000:0A:00.0 Off | 0 | | 0% 27C P0 70W / 300W | 1MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ $ sudo lsmod|grep nvidia nvidia_uvm 4100096 0 nvidia_drm 143360 0 nvidia_modeset 1720320 1 nvidia_drm nvidia 11669504 2 nvidia_uvm,nvidia_modeset video 73728 1 nvidia_modeset drm_kms_helper 274432 4 bochs,drm_vram_helper,nvidia_drm drm 782336 8 drm_kms_helper,bochs,drm_vram_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm

Install CUDA Libraries

Next we need to install the NVIDIA CUDA libraries.

$ sud dnf -y install cuda-toolkit-12-8 Updating Subscription Management repositories. Last metadata expiration check: 1:01:06 ago on Sat Apr 12 18:06:50 2025. Dependencies resolved. =========================================================================================================================== Package Arch Version Repository Size =========================================================================================================================== Installing: cuda-toolkit-12-8 x86_64 12.8.1-1 cuda-rhel9-x86_64 8.8 k Installing dependencies: ModemManager-glib x86_64 1.20.2-1.el9 rhel-9-for-x86_64-baseos-rpms 337 k adwaita-cursor-theme noarch 40.1.1-3.el9 rhel-9-for-x86_64-appstream-rpms 655 k adwaita-icon-theme noarch 40.1.1-3.el9 rhel-9-for-x86_64-appstream-rpms 12 M alsa-lib x86_64 1.2.12-1.el9 rhel-9-for-x86_64-appstream-rpms 527 k at-spi2-atk x86_64 2.38.0-4.el9 rhel-9-for-x86_64-appstream-rpms 90 k (...) pipewire-alsa x86_64 1.0.1-1.el9 rhel-9-for-x86_64-appstream-rpms 59 k pipewire-jack-audio-connection-kit x86_64 1.0.1-1.el9 rhel-9-for-x86_64-appstream-rpms 9.4 k pipewire-pulseaudio x86_64 1.0.1-1.el9 rhel-9-for-x86_64-appstream-rpms 196 k tracker-miners x86_64 3.1.2-4.el9_3 rhel-9-for-x86_64-appstream-rpms 942 k xdg-desktop-portal-gtk x86_64 1.12.0-3.el9 rhel-9-for-x86_64-appstream-rpms 139 k Transaction Summary =========================================================================================================================== Install 232 Packages Total download size: 5.1 G Installed size: 9.7 G Downloading Packages: (1/232): cuda-compiler-12-8-12.8.1-1.x86_64.rpm 34 kB/s | 7.4 kB 00:00 (2/232): cuda-command-line-tools-12-8-12.8.1-1. 29 kB/s | 7.5 kB 00:00 (3/232): cuda-cccl-12-8-12.8.90-1.x86_64.rpm 4.2 MB/s | 1.6 MB 00:00 (4/232): cuda-crt-12-8-12.8.93-1.x86_64.rpm 705 kB/s | 118 kB 00:00 (5/232): cuda-cudart-12-8-12.8.90-1.x86_64.rpm 1.0 MB/s | 233 kB 00:00 (6/232): cuda-cuobjdump-12-8-12.8.90-1.x86_64.r 1.5 MB/s | 265 kB 00:00 (...) (228/232): ostree-libs-2024.9-1.el9_5.x86_64.rp 6.0 MB/s | 470 kB 00:00 (229/232): nss-util-3.101.0-10.el9_2.x86_64.rpm 719 kB/s | 92 kB 00:00 (230/232): tzdata-java-2025b-1.el9.noarch.rpm 3.6 MB/s | 228 kB 00:00 (231/232): libxslt-1.1.34-9.el9_5.1.x86_64.rpm 3.5 MB/s | 245 kB 00:00 (232/232): java-17-openjdk-headless-17.0.14.0.7 58 MB/s | 45 MB 00:00 -------------------------------------------------------------------------------- Total 112 MB/s | 5.1 GB 00:46 cuda-rhel9-x86_64 8.4 kB/s | 1.6 kB 00:00 Importing GPG key 0xD42D0685: Userid : "cudatools <cudatools@nvidia.com>" Fingerprint: 610C 7B14 E068 A878 070D A4E9 9CD0 A493 D42D 0685 From : https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/D42D0685.pub Key imported successfully Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Running scriptlet: copy-jdk-configs-4.0-3.el9.noarch 1/1 Running scriptlet: java-17-openjdk-headless-1:17.0.14.0.7-2.el9.x86_64 1/1 Preparing : 1/1 Installing : cuda-toolkit-config-common-12.8.90-1.noarch 1/232 Installing : cuda-toolkit-12-config-common-12.8.90-1.noarch 2/232 Installing : cuda-toolkit-12-8-config-common-12.8.90-1.noarch 3/232 Installing : nspr-4.35.0-17.el9_2.x86_64 4/232 Installing : alsa-lib-1.2.12-1.el9.x86_64 5/232 Installing : libogg-2:1.3.4-6.el9.x86_64 6/232 Installing : avahi-libs-0.8-21.el9.x86_64 7/232 Installing : libvorbis-1:1.3.7-5.el9.x86_64 8/232 (...) Running scriptlet: copy-jdk-configs-4.0-3.el9.noarch 232/232 Running scriptlet: wireplumber-0.4.14-1.el9.x86_64 232/232 Created symlink /etc/systemd/user/pipewire-session-manager.service → /usr/lib/systemd/user/wireplumber.service. Created symlink /etc/systemd/user/pipewire.service.wants/wireplumber.service → /usr/lib/systemd/user/wireplumber.service. Running scriptlet: java-17-openjdk-headless-1:17.0.14.0.7-2.el9.x86 232/232 Running scriptlet: fontconfig-2.14.0-2.el9_1.x86_64 232/232 Running scriptlet: java-17-openjdk-1:17.0.14.0.7-2.el9.x86_64 232/232 Running scriptlet: cuda-nvvp-12-8-12.8.93-1.x86_64 232/232 Running scriptlet: nsight-compute-2025.1.1-2025.1.1.2-1.x86_64 232/232 Running scriptlet: pipewire-pulseaudio-1.0.1-1.el9.x86_64 232/232 [ 4710.758477] systemd-rc-local-generator[27383]: /etc/rc.d/rc.local is not marked executable, skipping. Verifying : cuda-cccl-12-8-12.8.90-1.x86_64 1/232 Verifying : cuda-command-line-tools-12-8-12.8.1-1.x86_64 2/232 Verifying : cuda-compiler-12-8-12.8.1-1.x86_64 3/232 Verifying : cuda-crt-12-8-12.8.93-1.x86_64 4/232 Verifying : cuda-cudart-12-8-12.8.90-1.x86_64 5/232 Verifying : cuda-cudart-devel-12-8-12.8.90-1.x86_64 6/232 Verifying : cuda-cuobjdump-12-8-12.8.90-1.x86_64 7/232 (...) Verifying : nss-util-3.101.0-10.el9_2.x86_64 229/232 Verifying : ostree-libs-2024.9-1.el9_5.x86_64 230/232 Verifying : libxslt-1.1.34-9.el9_5.1.x86_64 231/232 Verifying : tzdata-java-2025b-1.el9.noarch 232/232 Installed products updated. Installed: ModemManager-glib-1.20.2-1.el9.x86_64 adwaita-cursor-theme-40.1.1-3.el9.noarch adwaita-icon-theme-40.1.1-3.el9.noarch alsa-lib-1.2.12-1.el9.x86_64 at-spi2-atk-2.38.0-4.el9.x86_64 at-spi2-core-2.40.3-1.el9.x86_64 (...) xcb-util-keysyms-0.4.0-17.el9.x86_64 xcb-util-renderutil-0.3.9-20.el9.x86_64 xcb-util-wm-0.4.1-22.el9.x86_64 xdg-dbus-proxy-0.1.3-1.el9.x86_64 xdg-desktop-portal-1.12.6-1.el9.x86_64 xdg-desktop-portal-gtk-1.12.0-3.el9.x86_64 xkeyboard-config-2.33-2.el9.noarch xml-common-0.6.3-58.el9.noarch xorg-x11-fonts-Type1-7.5-33.el9.noarch Complete!

Build Perftest with CUDA Libraries

Now we can move onto building perftest binaries by first cloning the repository.

$ git clone https://github.com/linux-rdma/perftest.git Cloning into 'perftest'...er]# git clone https://github.com/linux-rdma/perftest.git remote: Enumerating objects: 6237, done. remote: Counting objects: 100% (2711/2711), done. remote: Compressing objects: 100% (163/163), done. remote: Total 6237 (delta 2600), reused 2548 (delta 2548), pack-reused 3526 (from 2)

Next we need to export our paths for the CUDA libraries.

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH $ export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH ~

Now we can change directories into the perftest project and run autogen.sh.

$ cd ./perftest $ ./autogen.sh libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'. libtoolize: copying file 'config/ltmain.sh' libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'. libtoolize: copying file 'm4/libtool.m4' libtoolize: copying file 'm4/ltoptions.m4' libtoolize: copying file 'm4/ltsugar.m4' libtoolize: copying file 'm4/ltversion.m4' libtoolize: copying file 'm4/lt~obsolete.m4' libtoolize: 'AC_PROG_RANLIB' is rendered obsolete by 'LT_INIT' configure.ac:55: installing 'config/compile' configure.ac:59: installing 'config/config.guess' configure.ac:59: installing 'config/config.sub' configure.ac:36: installing 'config/install-sh' configure.ac:36: installing 'config/missing' Makefile.am: installing 'config/depcomp'

Next we need to run the configure but also pass it the CUDA header paths.

$ ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h configure: loading site script /usr/share/config.site checking for a BSD-compatible install... /bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... gawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking whether make supports nested variables... (cached) yes checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking whether gcc understands -c and -o together... yes checking whether make supports the include directive... yes (GNU style) checking dependency style of gcc... gcc3 checking for g++... g++ checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking dependency style of g++... gcc3 checking dependency style of gcc... gcc3 checking build system type... x86_64-pc-linux-gnu checking host system type... x86_64-pc-linux-gnu checking how to print strings... printf checking for a sed that does not truncate output... /bin/sed checking for grep that handles long lines and -e... /bin/grep checking for egrep... /bin/grep -E checking for fgrep... /bin/grep -F checking for ld used by gcc... /bin/ld checking if the linker (/bin/ld) is GNU ld... yes checking for BSD- or MS-compatible name lister (nm)... /bin/nm -B checking the name lister (/bin/nm -B) interface... BSD nm checking whether ln -s works... yes checking the maximum length of command line arguments... 1572864 checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop checking for /bin/ld option to reload object files... -r checking for objdump... objdump checking how to recognize dependent libraries... pass_all checking for dlltool... no checking how to associate runtime and link libraries... printf %s\n checking for ar... ar checking for archiver @FILE support... @ checking for strip... strip checking for ranlib... ranlib checking command to parse /bin/nm -B output from gcc object... ok checking for sysroot... no checking for a working dd... /bin/dd checking how to truncate binary pipes... /bin/dd bs=4096 count=1 checking for mt... no checking if : is a manifest tool... no checking how to run the C preprocessor... gcc -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking for dlfcn.h... yes checking for objdir... .libs checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC -DPIC checking if gcc PIC flag -fPIC -DPIC works... yes checking if gcc static flag -static works... no checking if gcc supports -c -o file.o... yes checking if gcc supports -c -o file.o... (cached) yes checking whether the gcc linker (/bin/ld -m elf_x86_64) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking how to run the C++ preprocessor... g++ -E checking for ld used by g++... /bin/ld -m elf_x86_64 checking if the linker (/bin/ld -m elf_x86_64) is GNU ld... yes checking whether the g++ linker (/bin/ld -m elf_x86_64) supports shared libraries... yes checking for g++ option to produce PIC... -fPIC -DPIC checking if g++ PIC flag -fPIC -DPIC works... yes checking if g++ static flag -static works... no checking if g++ supports -c -o file.o... yes checking if g++ supports -c -o file.o... (cached) yes checking whether the g++ linker (/bin/ld -m elf_x86_64) supports shared libraries... yes checking dynamic linker characteristics... (cached) GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking for ranlib... (cached) ranlib checking for ANSI C header files... (cached) yes checking infiniband/verbs.h usability... yes checking infiniband/verbs.h presence... yes checking for infiniband/verbs.h... yes checking for ibv_get_device_list in -libverbs... yes checking for rdma_create_event_channel in -lrdmacm... yes checking for umad_init in -libumad... yes checking for log in -lm... yes checking for ibv_reg_dmabuf_mr in -libverbs... yes checking pci/pci.h usability... yes checking pci/pci.h presence... yes checking for pci/pci.h... yes checking for pci_init in -lpci... yes checking for cuMemGetHandleForAddressRange in -lcuda... yes checking for efadv_create_qp_ex in -lefa... yes checking for mlx5dv_create_qp in -lmlx5... yes checking for hnsdv_query_device in -lhns... no checking that generated files are newer than configure... done configure: creating ./config.status config.status: creating Makefile config.status: creating config.h config.status: executing depfiles commands config.status: executing libtool commands config.status: executing man commands

Finally we can run make to build the binaries.

$ make -j make all-am make[1]: Entering directory '/home/cloud-user/perftest' ln -s .././man/perftest.1 man/ib_read_bw.1 ln -s .././man/perftest.1 man/ib_write_bw.1 ln -s .././man/perftest.1 man/ib_send_bw.1 ln -s .././man/perftest.1 man/ib_atomic_bw.1 ln -s .././man/perftest.1 man/ib_read_lat.1 ln -s .././man/perftest.1 man/ib_write_lat.1 ln -s .././man/perftest.1 man/ib_send_lat.1 ln -s .././man/perftest.1 man/raw_ethernet_bw.1 ln -s .././man/perftest.1 man/ib_atomic_lat.1 ln -s .././man/perftest.1 man/raw_ethernet_lat.1 ln -s .././man/perftest.1 man/raw_ethernet_burst_lat.1 CC src/send_bw.o ln -s .././man/perftest.1 man/raw_ethernet_fs_rate.1 CC src/multicast_resources.o CC src/perftest_communication.o CC src/get_clock.o CC src/perftest_parameters.o CC src/perftest_resources.o CC src/perftest_counters.o CC src/host_memory.o CC src/mmap_memory.o CC src/cuda_memory.o CC src/raw_ethernet_resources.o CC src/send_lat.o CC src/write_lat.o CC src/write_bw.o CC src/read_lat.o CC src/read_bw.o CC src/atomic_lat.o CC src/atomic_bw.o CC src/raw_ethernet_send_bw.o CC src/raw_ethernet_send_lat.o CC src/raw_ethernet_send_burst_lat.o CC src/raw_ethernet_fs_rate.o AR libperftest.a CCLD ib_send_bw CCLD ib_send_lat CCLD ib_write_bw CCLD ib_write_lat CCLD ib_read_lat CCLD ib_read_bw CCLD ib_atomic_bw CCLD raw_ethernet_bw CCLD ib_atomic_lat CCLD raw_ethernet_lat CCLD raw_ethernet_burst_lat CCLD raw_ethernet_fs_rate make[1]: Leaving directory '/home/cloud-user/perftest'

Configure Secondary Interface in Virtual Machines

Inside our virtual machines we need to confirm the device is showing and then configure ip addresses on the interfaces. First we can look at the mst status.

$ mst status -v MST modules: ------------ MST PCI module is not loaded MST PCI configuration module is not loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField3(rev:1) NA 09:00.0 mlx5_0 net-eth1 -1

Next we can find our eth1 interface.

$ nmcli con show NAME UUID TYPE DEVICE System eth0 5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03 ethernet eth0 Wired connection 1 6ca36168-6830-3427-8853-c89c61c8b70b ethernet eth1 lo 7d903415-466e-41c0-9e52-062f9a33270c loopback lo

We will bring down the interface.

$ nmcli con down "Wired connection 1" Connection 'Wired connection 1' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/3)

Modofy the interface to add static ipaddress and mtu of 9000.

$ nmcli con modify "Wired connection 1" ipv4.method manual ipv4.addresses 192.168.12.2/24 mtu 9000

Then bring the interface back up.

$ nmcli con up "Wired connection 1" Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/4)

This completes setting up the network connectivity inside the virtual machines.

Run Performance Tests

Now that we have configured our virtual machines with all the requirements we can run some tests to confirm that GPUDirect RDMA is working properly. To do this we will use the perftest tooling we built and run the ib_write_bw command. This will require that we open two console sessions into each virtual machine. The first consolse session we will run the listener ib_write_bw command and in the second we will run the initiator. In the first VM we will run the following command.

$ sudo /home/cloud-user/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_0 -p 10000 --source_ip 192.168.12.1 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************

The second VM should have the following command.

$ sudo /home/cloud-user/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_0 -p 10000 --source_ip 192.168.12.2 192.168.12.1

If we go back to the first VMs console screen we should see output similar to the results of our run below.

************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF CQ Moderation : 1 CQE Poll Batch : 16 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- Waiting for client rdma_cm QP to connect Please run the same command with the IB/RoCE interface IP --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0129 PSN 0x40432d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x012a PSN 0xfa7213 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x012b PSN 0x152561 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x012d PSN 0x28af9c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x012e PSN 0x5aa37f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x012f PSN 0x280c75 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0130 PSN 0x42d30 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0131 PSN 0x659969 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0132 PSN 0xb18159 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0133 PSN 0x9c8667 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0134 PSN 0x6af97f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0135 PSN 0x315a6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0136 PSN 0xd4499a GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0137 PSN 0xc79c3f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0138 PSN 0xf1a591 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0139 PSN 0x999481 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 remote address: LID 0000 QPN 0x004b PSN 0xe23c56 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x004c PSN 0x985e88 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x004d PSN 0x9ff132 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x004e PSN 0xb3d99 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0050 PSN 0x9e6638 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0051 PSN 0x802b3a GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0052 PSN 0xca4511 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0053 PSN 0x9dea36 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0054 PSN 0x4016a2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0055 PSN 0xbdac7c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0056 PSN 0x1b0e70 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0057 PSN 0xf88643 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0058 PSN 0x3e4a73 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0059 PSN 0xb8eea4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x005a PSN 0xd47892 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x005b PSN 0xac51ee GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 22422716 0.00 391.87 0.747423 ---------------------------------------------------------------------------------------

Next we can run another test in our corresponding virtual machines where in the first one we add the following switches to our original command --use_cuda=0 --use_cuda_dmabuf. This will ensure we are now testing with GPU and using DMA-BUF. Note that DMA-BUF is now preferred over using nvidia-peermem module. The first VM should have the following command.

$ sudo /home/cloud-user/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_0 -p 10000 --source_ip 192.168.12.1 --use_cuda=0 --use_cuda_dmabuf WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************

The second VM should have the following command.

$ sudo /home/cloud-user/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_0 -p 10000 --source_ip 192.168.12.2 192.168.12.1 --use_cuda=0 --use_cuda_dmabuf

If we go back to the first VMs console screen we should see output similar to the results of our run below.

************************************ * Waiting for client to connect... * ************************************ initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 0A:00 Picking device No. 0 [pid = 2206, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007fcdd8600000 pointer=0x7fcdd8600000 using DMA-BUF for GPU buffer address at 0x7fcdd8600000 aligned at 0x7fcdd8600000 with aligned size 2097152 Calling ibv_reg_dmabuf_mr(offset=0, size=2097152, addr=0x7fcdd8600000, fd=40) for QP #0 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF CQ Moderation : 1 CQE Poll Batch : 16 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- Waiting for client rdma_cm QP to connect Please run the same command with the IB/RoCE interface IP --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x013c PSN 0xcf0fd GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x013d PSN 0x953a3 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x013e PSN 0xd28fb1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x013f PSN 0x12d3ac GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0140 PSN 0x325e4f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0141 PSN 0x997705 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0142 PSN 0xcbec80 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0143 PSN 0x13ee79 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0144 PSN 0x181929 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0145 PSN 0x7009f7 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0146 PSN 0x6d5dcf GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0147 PSN 0xe7abb6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0148 PSN 0xba8e6a GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x0149 PSN 0x65c8cf GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x014a PSN 0x13fee1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 local address: LID 0000 QPN 0x014b PSN 0x377b91 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:01 remote address: LID 0000 QPN 0x004a PSN 0x70c5e6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x004b PSN 0xa050d8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x004c PSN 0x75fd42 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x004d PSN 0xc4c069 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x004e PSN 0xe1f8c8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0050 PSN 0xb2f28a GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0051 PSN 0x3b0221 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0052 PSN 0x3aca06 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0053 PSN 0xe04232 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0054 PSN 0xd398cc GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0055 PSN 0x808c80 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0056 PSN 0x319313 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0057 PSN 0xcb9f03 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0058 PSN 0x9f4ff4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x0059 PSN 0x19c7a2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 remote address: LID 0000 QPN 0x005a PSN 0xf75bbe GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:12:02 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10383932 0.00 181.47 0.346131 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007fcdd8600000 destroying current CUDA Ctx

This concludes the workflow of testing RDMA inside of OpenShift virtualization virtual machines.