Showing posts with label ai. Show all posts
Showing posts with label ai. Show all posts

Tuesday, February 24, 2026

OpenShift Network Card Rail Mapping

The goal of this writeup is to provide a simple mechanism to map which GPUs are associated to which NICs on the same PCIe switch inside a physical system. This mapped information can then assist in generating a OpenShift MachineConfig that can identify one network card per GPU on the same PCI root complex and persistently name that network device a rail(some number) while marking any others as secondary. This is primarily for NVIDIA's Spectrum-X stack but could be used across any platform where GPU to NIC coherency is important in regards to configuration for OpenShift.

Why?

For optimal cluster performance and minimal latency, it’s essential to align each GPU with its nearest high-speed network card, ideally on the same NUMA node and PCIe root complex. This ensures that data traveling to and from each GPU takes the shortest, most efficient path, which is especially critical for GPUDirect RDMA and high-throughput AI/HPC workloads.

While there are tools that can provide pieces of this view all the commands have to be run manually and then its up to the user to fit it all together. Ideally there should be one solution that can provide all the details in a concise manner.

Hwloc

The Portable Hardware Locality (hwloc) software package provides a portable abstraction of the hierarchical topology of modern architectures, including NUMA memory nodes (DRAM, HBM, non-volatile memory, CXL, etc.), processor packages, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs.  A sample image that it can generate is shown below.

Hwloc primarily aims at helping applications with gathering information about increasingly complex parallel computing platforms so as to exploit them accordingly and efficiently. For instance, two tasks that tightly cooperate should probably be placed onto cores sharing a cache. However, two independent memory-intensive tasks should better be spread out onto different processor packages so as to maximize their memory throughput.

However Hwloc does not ship in OpenShift today.  Further it does not generate UDEV rules, MachineConfigs and seems heavy handed for the task at hand.

Rail Mappings

The gpu-nic-rail-mapping script aims to provide a simple example to identify the GPU to NIC relationship and then generates the MachineConfig for OpenShift to ensure there is one rail per GPU marked. Below is an example run on a Dell 9680 (H200) system with the following devices in it:

  • 8 x H200 GPUs - Device ID 10de:2335
  • 14 x BF3 Cards - Device ID 15b3:a2dc
sh-5.1# ./gpu-nic-rail-mapping -g 10de:2335 -n 15b3:a2dc -u 70-persistent-net.rules -r worker GPU BusAddr NIC BusAddr PCIe Switch NIC Slot NIC Port UDEV Eth UDEV IB ==================================================================================================== 1b:00.0 18:00.0 15:01.0/16:00.0 40 1 eth_rail0 roce_rail0 1b:00.0 1a:00.0 15:01.0/16:00.0 42 1 eth_sec0 roce_sec0 3c:00.0 3a:00.0 37:01.0/38:00.0 41 1 eth_rail1 roce_rail1 4b:00.0 4d:00.0 48:01.0/49:00.0 38 1 eth_rail2 roce_rail2 5c:00.0 5d:00.0 59:01.0/5a:00.0 37 1 eth_rail3 roce_rail3 5c:00.0 5f:00.0 59:01.0/5a:00.0 39 1 eth_sec1 roce_sec1 5c:00.0 5f:00.1 59:01.0/5a:00.0 39 2 eth_sec2 roce_sec2 9a:00.0 9b:00.0 97:01.0/98:00.0 32 1 eth_rail4 roce_rail4 bb:00.0 ba:00.0 b7:01.0/b8:00.0 31 1 eth_rail5 roce_rail5 bb:00.0 bc:00.0 b7:01.0/b8:00.0 33 1 eth_sec3 roce_sec3 bb:00.0 bc:00.1 b7:01.0/b8:00.0 33 2 eth_sec4 roce_sec4 cd:00.0 ca:00.0 c7:01.0/c8:00.0 36 1 eth_rail6 roce_rail6 cd:00.0 cc:00.0 c7:01.0/c8:00.0 34 1 eth_sec5 roce_sec5 dc:00.0 db:00.0 d7:01.0/d8:00.0 35 1 eth_rail7 roce_rail7 Generated 99-machine-config-udev-network.yaml file for OpenShift

Here was the 70-persistent-net.rules file generated.

sh-5.1# cat 70-persistent-net.rules ACTION=="add", KERNELS=="0000:18:00.0", SUBSYSTEM=="net", NAME="eth_rail0" ACTION=="add", KERNELS=="0000:18:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail0" ACTION=="add", KERNELS=="0000:1a:00.0", SUBSYSTEM=="net", NAME="eth_sec0" ACTION=="add", KERNELS=="0000:1a:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec0" ACTION=="add", KERNELS=="0000:3a:00.0", SUBSYSTEM=="net", NAME="eth_rail1" ACTION=="add", KERNELS=="0000:3a:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail1" ACTION=="add", KERNELS=="0000:4d:00.0", SUBSYSTEM=="net", NAME="eth_rail2" ACTION=="add", KERNELS=="0000:4d:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail2" ACTION=="add", KERNELS=="0000:5d:00.0", SUBSYSTEM=="net", NAME="eth_rail3" ACTION=="add", KERNELS=="0000:5d:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail3" ACTION=="add", KERNELS=="0000:5f:00.0", SUBSYSTEM=="net", NAME="eth_sec1" ACTION=="add", KERNELS=="0000:5f:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec1" ACTION=="add", KERNELS=="0000:5f:00.1", SUBSYSTEM=="net", NAME="eth_sec2" ACTION=="add", KERNELS=="0000:5f:00.1", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec2" ACTION=="add", KERNELS=="0000:9b:00.0", SUBSYSTEM=="net", NAME="eth_rail4" ACTION=="add", KERNELS=="0000:9b:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail4" ACTION=="add", KERNELS=="0000:ba:00.0", SUBSYSTEM=="net", NAME="eth_rail5" ACTION=="add", KERNELS=="0000:ba:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail5" ACTION=="add", KERNELS=="0000:bc:00.0", SUBSYSTEM=="net", NAME="eth_sec3" ACTION=="add", KERNELS=="0000:bc:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec3" ACTION=="add", KERNELS=="0000:bc:00.1", SUBSYSTEM=="net", NAME="eth_sec4" ACTION=="add", KERNELS=="0000:bc:00.1", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec4" ACTION=="add", KERNELS=="0000:ca:00.0", SUBSYSTEM=="net", NAME="eth_rail6" ACTION=="add", KERNELS=="0000:ca:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail6" ACTION=="add", KERNELS=="0000:cc:00.0", SUBSYSTEM=="net", NAME="eth_sec5" ACTION=="add", KERNELS=="0000:cc:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec5" ACTION=="add", KERNELS=="0000:db:00.0", SUBSYSTEM=="net", NAME="eth_rail7" ACTION=="add", KERNELS=="0000:db:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail7"

And finally the OpenShift MachineConfig 99-machine-config-udev-network.yaml for the udev rule naming.

sh-5.1# cat 99-machine-config-udev-network.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-machine-config-udev-network spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,QUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MTg6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9yYWlsMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MTg6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2VfcmFpbDAiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjFhOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfc2VjMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MWE6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2Vfc2VjMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6M2E6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9yYWlsMSIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6M2E6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2VfcmFpbDEiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjRkOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjRkOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWwyIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZDowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3JhaWwzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZDowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9yYWlsMyIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6NWY6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9zZWMxIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZjowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9zZWMxIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZjowMC4xIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjVmOjAwLjEiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjliOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjliOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw0IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYTowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3JhaWw1IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYTowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9yYWlsNSIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6YmM6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9zZWMzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYzowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9zZWMzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYzowMC4xIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmJjOjAwLjEiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNhOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDYiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNhOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw2IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpjYzowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzUiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNjOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzUiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmRiOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDciCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmRiOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw3Igo= filesystem: root mode: 420 path: /etc/udev/rules.d/70-persistent-net.rules

The above MachineConfig can now be applied on the worker nodes of an OpenShift cluster of homogeneous nodes and persistently name the the rail devices mapped to the according GPUs.

In this next example we tried this on an SuperMicro AMD Instinct type system which had the following devices in it:

  • 8 x MI325X - Device ID 1002:74a5
  • 7 x AMD Pensando Systems POLLARA-1Q400 100/200/400G 1-port Card - Device ID 1dd8:1002
  • 1 x NVIDIA ConnectX-7 - Device ID 15b3:1021

This system was interesting because it had multiple network card types associated with GPUs which allowed us to test the script behavior in that scenario.   One caveat on this system was that dmidecode and lspci both failed to show the physical slot number for the Pollara cards while the CX7 card showed its physical slot just fine.

# ./gpu-nic-rail-mapping -g 1002:74a5 -n 1dd8:1002,15b3:1021 -u 70-persistent-net.rules -r worker GPU BusAddr NIC BusAddr PCIe Switch NIC Slot NIC Port UDEV Eth UDEV IB ==================================================================================================== 05:00.0 09:00.0 00:01.1/01:00.0 NA 1 eth_rail0 roce_rail0 15:00.0 19:00.0 10:01.1/11:00.0 NA 1 eth_rail1 roce_rail1 65:00.0 69:00.0 60:01.1/61:00.0 NA 1 eth_rail2 roce_rail2 75:00.0 79:00.0 70:01.1/71:00.0 NA 1 eth_rail3 roce_rail3 85:00.0 89:00.0 80:01.1/81:00.0 NA 1 eth_rail4 roce_rail4 95:00.0 99:00.0 90:01.1/91:00.0 NA 1 eth_rail5 roce_rail5 e5:00.0 e6:00.0 e0:01.1/e1:00.0 1 1 eth_rail6 roce_rail6 f5:00.0 f9:00.0 f0:01.1/f1:00.0 NA 1 eth_rail7 roce_rail7 Generated 99-machine-config-udev-network.yaml file for OpenShift

Whilst a 70-persistent-net.rules file and 99-machine-config-udev-network.yaml machineconfig were generated here as well they look very much like the H200 example.

The overall idea here was to automate an otherwise tedious task when it came to identifying and mapping the same GPU and network devices on the same pcie root complex.   Hopefully this provided a simple example to accomplish that task.  For those interested in seeing the script the repository is here.

Tuesday, April 01, 2025

NVIDIA GPU Direct Storage on OpenShift


Welcome to the NVIDIA GPU Direct Storage on OpenShift workflow.  The goal of this workflow is to understand and configure NVIDIA GPU Direct Storage for NVME devices in worker nodes of an OpenShift cluster.

What Is NVIDIA GPU Direct Storage?

GPU Direct Storage enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. Using this direct path can relieve system bandwidth bottlenecks and decrease the latency and utilization load on the CPU.

Assumptions

This document assumes that we have already deployed a OpenShift Cluster and have installed the necessary operators required for GPU Direct Storage. Those operators would be Node Feature Discover which should also be configured along with the base installation of the NVIDIA Network Operator (no NicClusterPolicy yet) and the NVIDIA GPU Operator (no GpuClusterPolicy yet).

Considerations

If any of the NVME devices in the system participate in either the operating system or other services (machine configs for LVMs or other customized access) the NVME kernel modules will not be able to unload properly even with the workaround defined in this documentation. Any use of GDS requires that the NVME drives are not in use during the deployment of the Network Operator in order for the Network Operator to be able to unload in-tree drivers and then load NVIDIA's out of tree drivers in place.

NVIDIA Network Operator Configuration

We assume the Network Operator has already been installed on the cluster but the NicClusterPolicy still needs to be created. The following NicClusterPolicy example will provide the needed configuration to ensure RDMA is properly loaded for nvme. The key option in this policy is the ENABLE_NFSRDMA variable and having it set to true. I want to note that this policy also optinonally has an rdmaSharedDevice and ENTRYPOINT_DEBUG set to true for more verbose logging.

$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: nicFeatureDiscovery: image: nic-feature-discovery repository: ghcr.io/mellanox version: v0.0.1 docaTelemetryService: image: doca_telemetry repository: nvcr.io/nvidia/doca version: 1.16.5-doca2.6.0-host rdmaSharedDevicePlugin: config: | { "configList": [ { "resourceName": "rdma_shared_device_eth", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ens1f0np0"] } } ] } image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: 'sha256:9f468fdc4449e65e4772575f83aa85840a00f97165f9a00ba34695c91d610fbd' secondaryNetwork: ipoib: image: ipoib-cni repository: ghcr.io/mellanox version: v1.2.0 nvIpam: enableWebhook: false image: nvidia-k8s-ipam repository: ghcr.io/mellanox version: v0.2.0 ofedDriver: readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 forcePrecompiled: false terminationGracePeriodSeconds: 300 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: true enable: true force: true timeoutSeconds: 300 podSelector: '' maxParallelUpgrades: 1 safeLoad: false waitForCompletion: timeoutSeconds: 0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 image: doca-driver repository: nvcr.io/nvidia/mellanox version: 25.01-0.6.0.0-0 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" - name: ENABLE_NFSRDMA value: "true" - name: ENTRYPOINT_DEBUG value: 'true' EOF

Before creating the NicClusterPolicy on the cluster we need to prepare a script which will allow us to workaround an issue with GPU Direct Storage in the NVIDIA Network Operator. This script when run right after creating the NicClusterPolicy will determine which nodes have mofed pods running on them and based on that node list will ssh as the core user into each node and unload the following modules: nvme, nvme_tcp, nvme_fabrics, nvme_core. By using the script to unload the modules while the mofed container is busying building the doca drivers we eliminate an issue where when the mofed container goes to install the compiled doca drivers there is a failure to load. This issue is being investigated by NVIDIA.

$ cat <<EOF > nvme-fixer.sh #!/bin/bash ### Set array of modules to be unloaded declare -a modarr=("nvme" "nvme_tcp" "nvme_fabrics" "nvme_core") ### Determine which hosts have mofed container running on them declare -a hostarr=(`oc get pods -n nvidia-network-operator -o custom-columns=POD:.metadata.name,NODE:.spec..nodeName --no-headers|grep mofed|awk {'print $2'}`) ### Iterate through modules on each host and unload them for host in "${hostarr[@]}" do echo "Unloading nvme dependencies on $host..." for module in "${modarr[@]}" do echo "Unloading module $module..." ssh core@$host sudo rmmod $module done done

Change the execute bit on the script.

$ chmod +x nvme-fixer.sh

Now we are ready to create the NicClusterPolicy on the cluster and follow it up by running the nvme-fixer.sh script. If there are any rmmod "not currently loaded" errors those can safely be ignored as the module was not loaded to start with. In the example below we had two workers nodes that had mofed pods running on them so the script went ahead and unloaded the nvme modules.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created $ ./nvme-fixer.sh Unloading nvme dependencies on nvd-srv-22.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... rmmod: ERROR: Module nvme_tcp is not currently loaded Unloading module nvme_fabrics... rmmod: ERROR: Module nvme_fabrics is not currently loaded Unloading module nvme_core... Unloading nvme dependencies on nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... Unloading module nvme_fabrics... Unloading module nvme_core... $

Now we wait for the mofed pod to finish compiling and installed the GPU Direct Storage modules. We will know its complete when the pods are in a running state like below:

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE kube-ipoib-cni-ds-5f8wk 1/1 Running 0 38s kube-ipoib-cni-ds-956nv 1/1 Running 0 38s kube-ipoib-cni-ds-jpbph 1/1 Running 0 38s kube-ipoib-cni-ds-jwtw2 1/1 Running 0 38s kube-ipoib-cni-ds-v4sb8 1/1 Running 0 38s mofed-rhcos4.17-69fb4cd685-ds-j77vl 2/2 Running 0 37s mofed-rhcos4.17-69fb4cd685-ds-lw7t9 2/2 Running 0 37s nic-feature-discovery-ds-527wc 1/1 Running 0 36s nic-feature-discovery-ds-fnn9v 1/1 Running 0 36s nic-feature-discovery-ds-l9lkf 1/1 Running 0 36s nic-feature-discovery-ds-qn4m9 1/1 Running 0 36s nic-feature-discovery-ds-w7vw4 1/1 Running 0 36s nv-ipam-controller-67556c846b-c4sfq 1/1 Running 0 36s nv-ipam-controller-67556c846b-wvm59 1/1 Running 0 36s nv-ipam-node-22rw9 1/1 Running 0 36s nv-ipam-node-6w4x4 1/1 Running 0 36s nv-ipam-node-f2p96 1/1 Running 0 36s nv-ipam-node-jssjh 1/1 Running 0 36s nv-ipam-node-z2mws 1/1 Running 0 36s nvidia-network-operator-controller-manager-57c7cfddc8-6nw6j 1/1 Running 16 (10h ago) 14d

We can validate things look correct from a module perspective by logging into one of nodes either via SSH or even debug pod and listing out the nvme modules. The results should look like the following output below. Note I ran a lsblk to also show that my nvme device is visible as well.

$ ssh core@nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com Red Hat Enterprise Linux CoreOS 417.94.202502051822-0 Part of OpenShift 4.17, RHCOS is a Kubernetes-native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.17/architecture/architecture-rhcos.html Last login: Fri Mar 21 17:48:41 2025 from 10.22.81.26 [systemd] Failed Units: 1 NetworkManager-wait-online.service [core@nvd-srv-23 ~]$ sudo bash [root@nvd-srv-23 core]# lsmod|grep nvme nvme_rdma 57344 0 nvme_fabrics 45056 1 nvme_rdma nvme 73728 0 nvme_core 204800 3 nvme,nvme_rdma,nvme_fabrics rdma_cm 155648 3 rpcrdma,nvme_rdma,rdma_ucm ib_core 557056 10 rdma_cm,ib_ipoib,rpcrdma,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm mlx_compat 20480 17 rdma_cm,ib_ipoib,mlxdevm,rpcrdma,nvme,nvme_rdma,mlxfw,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core nvme_common 24576 0 t10_pi 24576 2 sd_mod,nvme_core [root@nvd-srv-23 core]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 1.5T 0 disk ├─sda1 8:1 0 1M 0 part ├─sda2 8:2 0 127M 0 part ├─sda3 8:3 0 384M 0 part /boot └─sda4 8:4 0 1.5T 0 part /var /sysroot/ostree/deploy/rhcos/var /usr /etc / /sysroot sdb 8:16 0 1.5T 0 disk sdc 8:32 0 1.5T 0 disk sdd 8:48 0 1.5T 0 disk nvme0n1 259:1 0 894.2G 0 disk

This completes the NVIDIA Network Operator portion of the configuration for GPU Direct Storage.

NVIDIA GPU Operator Configuration

Now that the NicClusterPolicy is defined and the proper NVME modules have been loaded we can move into configuring our GPU ClusterPolicy. The below example is a policy that will enable GPU Direct Storage on the worker nodes that have a proper NVIDIA GPU.

$ cat <<EOF > gpu-cluster-policy.yaml apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' serviceMonitor: enabled: true enabled: true cdi: default: false enabled: false driver: licensingConfig: nlsEnabled: true configMapName: '' certConfig: name: '' rdma: enabled: true kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' enabled: true useNvidiaDriverCRD: false useOpenKernelModules: true devicePlugin: config: name: '' default: '' mps: root: /run/nvidia/mps enabled: true gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: 'nvcr.io/nvidia/cloud-native/nvidia-fs:2.20.5' vgpuManager: enabled: false vfioManager: enabled: true toolkit: installDir: /usr/local/nvidia enabled: true EOF

Now let's create the policy on the cluster.

$ oc create -f gpu-cluster-policy.yaml clusterpolicy.nvidia.com/gpu-cluster-policy created

Once the policy is created let's validate the pods are running before we move onto the next step.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-499wh 1/1 Running 0 18h gpu-feature-discovery-m68bn 1/1 Running 0 18h gpu-operator-c9ccd586d-htl5q 1/1 Running 0 19h nvidia-container-toolkit-daemonset-8m4r5 1/1 Running 0 18h nvidia-container-toolkit-daemonset-ld7qz 1/1 Running 0 18h nvidia-cuda-validator-fddq7 0/1 Completed 0 18h nvidia-cuda-validator-mdk6b 0/1 Completed 0 18h nvidia-dcgm-565tj 1/1 Running 0 18h nvidia-dcgm-exporter-jtgt6 1/1 Running 1 (18h ago) 18h nvidia-dcgm-exporter-znpgh 1/1 Running 1 (18h ago) 18h nvidia-dcgm-xpxbx 1/1 Running 0 18h nvidia-device-plugin-daemonset-2vn52 1/1 Running 0 18h nvidia-device-plugin-daemonset-kjzjz 1/1 Running 0 18h nvidia-driver-daemonset-417.94.202502051822-0-pj7hk 5/5 Running 2 (18h ago) 18h nvidia-driver-daemonset-417.94.202502051822-0-qp8xb 5/5 Running 5 (18h ago) 18h nvidia-node-status-exporter-48cx7 1/1 Running 0 18h nvidia-node-status-exporter-dpmsr 1/1 Running 0 18h nvidia-operator-validator-fmcz4 1/1 Running 0 18h nvidia-operator-validator-g2fbt 1/1 Running 0 18h

With the NVIDIA GPU Operator pods running we can rsh into the daemonset pods and confirm GDS is enabled by running the lsmod command (note the nvidia_fs module) and cat out the /proc/driver/nvidia-fs/stats file.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk sh-4.4# lsmod|grep nvidia nvidia_fs 327680 0 nvidia_peermem 24576 0 nvidia_modeset 1507328 0 video 73728 1 nvidia_modeset nvidia_uvm 6889472 8 nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset ib_uverbs 217088 19 nvidia_peermem,rdma_ucm,mlx5_ib drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200 $ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk sh-4.4# cat /proc/driver/nvidia-fs/stats GDS Version: 1.10.0.4 NVFS statistics(ver: 4.0) NVFS Driver(version: 2.20.5) Mellanox PeerDirect Supported: True IO stats: Disabled, peer IO stats: Disabled Logging level: info Active Shadow-Buffer (MiB): 0 Active Process: 0 Reads : err=0 io_state_err=0 Sparse Reads : n=0 io=0 holes=0 pages=0 Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0 Mmap : n=0 ok=0 err=0 munmap=0 Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0 Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0 Ops : Read=0 Write=0 BatchIO=0

If everything looks good we can move onto an additional step to confirm GDS is ready for workload consumption.

GDS Cuda Workload Container

Once the GPU Direct Storage drivers are loaded we can use one more additional tool to check and confirm GDS capability. This involves building a container that contains the CUDA packages and then running it on a node. The following pod yaml defines this configuration.

$ cat <<EOF > gds-check-workload.yaml apiVersion: v1 kind: Pod metadata: name: gds-check-workload namespace: default spec: serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.3 name: gds-check-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] volumeMounts: - name: udev mountPath: /run/udev volumeMounts: - name: kernel-config mountPath: /sys/kernel/config volumeMounts: - name: dev mountPath: /run/dev volumeMounts: - name: sys mountPath: /sys volumeMounts: - name: results mountPath: /results volumeMounts: - name: lib mountPath: /lib/modules resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 volumes: - name: udev hostPath: path: /run/udev - name: kernel-config hostPath: path: /sys/kernel/config - name: dev hostPath: path: /run/dev - name: sys hostPath: path: /sys - name: results hostPath: path: /results - name: lib hostPath: path: /lib/modules EOF

Now let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > default-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: rdma namespace: default EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z rdma clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

With the service account defined and our pod yaml ready we can create it on the cluster.

$ oc create -f gds-check-workload.yaml pod/gds-check-workload created $ oc get pods NAME READY STATUS RESTARTS AGE gds-check-workload 1/1 Running 0 3s

Once the pod is up and running we can rsh into the pod and run the gdscheck tool to confirm capabilities and configuration of GPU Direct Storage.

$ oc rsh gds-check-workload sh-5.1# /usr/local/cuda/gds/tools/gdscheck -p GDS release version: 1.13.1.3 nvidia_fs version: 2.20 libcufile version: 2.12 Platform: x86_64 ============ ENVIRONMENT: ============ ===================== DRIVER CONFIGURATION: ===================== NVMe P2PDMA : Unsupported NVMe : Supported NVMeOF : Supported SCSI : Unsupported ScaleFlux CSD : Unsupported NVMesh : Unsupported DDN EXAScaler : Unsupported IBM Spectrum Scale : Unsupported NFS : Supported BeeGFS : Unsupported WekaFS : Unsupported Userspace RDMA : Unsupported --Mellanox PeerDirect : Enabled --rdma library : Not Loaded (libcufile_rdma.so) --rdma devices : Not configured --rdma_device_status : Up: 0 Down: 0 ===================== CUFILE CONFIGURATION: ===================== properties.use_pci_p2pdma : false properties.use_compat_mode : true properties.force_compat_mode : false properties.gds_rdma_write_support : true properties.use_poll_mode : false properties.poll_mode_max_size_kb : 4 properties.max_batch_io_size : 128 properties.max_batch_io_timeout_msecs : 5 properties.max_direct_io_size_kb : 16384 properties.max_device_cache_size_kb : 131072 properties.max_device_pinned_mem_size_kb : 33554432 properties.posix_pool_slab_size_kb : 4 1024 16384 properties.posix_pool_slab_count : 128 64 64 properties.rdma_peer_affinity_policy : RoundRobin properties.rdma_dynamic_routing : 0 fs.generic.posix_unaligned_writes : false fs.lustre.posix_gds_min_kb: 0 fs.beegfs.posix_gds_min_kb: 0 fs.weka.rdma_write_support: false fs.gpfs.gds_write_support: false fs.gpfs.gds_async_support: true profile.nvtx : false profile.cufile_stats : 0 miscellaneous.api_check_aggressive : false execution.max_io_threads : 4 execution.max_io_queue_depth : 128 execution.parallel_io : true execution.min_io_threshold_size_kb : 8192 execution.max_request_parallelism : 4 properties.force_odirect_mode : false properties.prefer_iouring : false ========= GPU INFO: ========= GPU index 0 NVIDIA A40 bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled ============== PLATFORM INFO: ============== Found ACS enabled for switch 0000:e0:01.0 IOMMU: Pass-through or enabled WARN: GDS is not guaranteed to work functionally or in a performant way with iommu=on/pt Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed) Cuda Driver Version Installed: 12040 Platform: PowerEdge R760xa, Arch: x86_64(Linux 5.14.0-427.50.1.el9_4.x86_64) Platform verification succeeded

Hopefully this provides enough detail to enable GPU Direct Storage on OpenShift. 

Wednesday, January 08, 2025

Build RDMA GPU-Tools Container

 


The purpose of this blog is to build a container that automates building the testing tooling for validating RDMA connectivity and performance when used in conjunction with NVIDIA Network Operator and NVIDIA GPU Operator.  Specifically I want to be able to use the ib_write_bw command with the --use_cuda switch to demonstrate RDMA from one GPU in a node to another GPU in another node in an OpenShift cluster. The ib_write_bw command is part of the perftest suite which is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for HW or SW tuning as well as for functional testing.

The collection contains a set of bandwidth and latency benchmark such as:

  • Send - ib_send_bw and ib_send_lat
  • RDMA Read - ib_read_bw and ib_read_lat
  • RDMA Write - ib_write_bw and ib_write_lat
  • RDMA Atomic - ib_atomic_bw and ib_atomic_lat
  • Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat

In previous blogs, here and here,  I used a Fedora 35 container and manually added the components I wanted but here we will provide the tooling to build a container that will instantiate itself upon deployment. The workflow is as follows:

  • Dockerfile.tools - which provides the content for the base image and the entrypoint.sh script.
  • Entrypoint.sh - which provides the start up script for the container to pull in both the NVIDIA cuda libraries and also build and deploy the perftest suite with the cuda option available.
  • Additional RPMs - there are some packages that were not part of the UBI image repo but are dependencies for CUDA toolkit.

The first thing we need to do is create a working directory for our files and an rpms directory for the rpms we will need for our base image. I am using root here but it could be a regular user as well.

$ mkdir -p /root/gpu-tools/rpms
$ cd /root/gpu-tools

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory.

  • infiniband-diags-51.0-1.el9.x86_64.rpm
  • libglvnd-opengl-1.3.4-1.el9.x86_64.rpm
  • libibumad-51.0-1.el9.x86_64.rpm
  • librdmacm-51.0-1.el9.x86_64.rpm
  • libxcb-1.13.1-9.el9.x86_64.rpm
  • libxcb-devel-1.13.1-9.el9.x86_64.rpm
  • libxkbcommon-1.0.3-4.el9.x86_64.rpm
  • libxkbcommon-x11-1.0.3-4.el9.x86_64.rpm
  • pciutils-devel-3.7.0-5.el9.x86_64.rpm
  • rdma-core-devel-51.0-1.el9.x86_64.rpm
  • xcb-util-0.4.0-19.el9.x86_64.rpm
  • xcb-util-image-0.4.0-19.el9.x86_64.rpm
  • xcb-util-keysyms-0.4.0-17.el9.x86_64.rpm
  • xcb-util-renderutil-0.3.9-20.el9.x86_64.rpm
  • xcb-util-wm-0.4.1-22.el9.x86_64.rpm

Once we have all our rpms for the base image we can move onto creating the dockerfile.tools file which we will use to build our image.

$ cat <<EOF >dockerfile.tools # Start from UBI9 image FROM registry.access.redhat.com/ubi9/ubi:latest # Set work directory WORKDIR /root RUN mkdir /root/rpms COPY ./rpms/*.rpm /root/rpms/ # DNF install packages either from repo or locally RUN dnf install `ls -1 /root/rpms/*.rpm` -y RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y # Cleanup WORKDIR /root RUN dnf clean all # Run container entrypoint COPY entrypoint.sh /root/entrypoint.sh RUN chmod +x /root/entrypoint.sh ENTRYPOINT ["/root/entrypoint.sh"] EOF

We also need to create the entrypoint.sh script which is referenced in the dockerfile and does the heavy lifting of pulling in the cuda toolkit and the perftest suite.

$ cat <<EOF > entrypoint.sh #!/bin/bash # Set working dir cd /root # Configure and install cuda-toolkit dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo dnf clean all dnf -y install cuda-toolkit-12-6 # Export CUDA library paths export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH # Git clone perftest repository git clone https://github.com/linux-rdma/perftest.git # Change into perftest directory cd /root/perftest # Build perftest with the cuda libraries included ./autogen.sh ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h make -j make install # Sleep container indefinitly sleep infinity & wait EOF

Next we can use the dockerfile we just created to build the base image.

$ podman build -f dockerfile.tools -t quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 STEP 1/10: FROM registry.access.redhat.com/ubi9/ubi:latest STEP 2/10: WORKDIR /root --> Using cache 75f163f12503272b83e1137f7c1903520f84493ffe5aec0ef32ece722bd0d815 --> 75f163f12503 STEP 3/10: RUN mkdir /root/rpms --> Using cache ade32aa6605847a8b3f5c8b68cfcb64854dc01eece34868faab46137a60f946c --> ade32aa66058 STEP 4/10: COPY ./rpms/*.rpm /root/rpms/ --> Using cache 59dcef81d6675f44d22900f13a3e5441f5073555d7d2faa0b2f261f32e4ba6cd --> 59dcef81d667 STEP 5/10: RUN dnf install `ls -1 /root/rpms/*.rpm` -y --> Using cache ebb8b3150056240378ac36f7aa41d7f13b13308e9353513f26a8d3d70e618e3b --> ebb8b3150056 STEP 6/10: RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y --> Using cache 5ca85080c103ba559994906ada0417102f54f22c182bbc3a06913109855278cc --> 5ca85080c103 STEP 7/10: WORKDIR /root --> Using cache 68c8cd47a41bc364a0da5790c90f9aee5f8a8c7807732f3a5138bff795834fc1 --> 68c8cd47a41b STEP 8/10: RUN dnf clean all Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. 26 files removed --> a219fec5df49 STEP 9/10: COPY entrypoint.sh /root/entrypoint.sh --> aeb03bf74673 STEP 10/10: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] COMMIT quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 --> 45c2113e5082 Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 45c2113e5082fb2f548b9e1b16c17524184c4079e2db77399519cf29829af1e7

Once the image is created we can push it to our favorite registry.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 Getting image source signatures Copying blob 62ee1c6c02d5 done | Copying blob 6027214db22e done | Copying blob 4822ebd5a418 done | Copying blob 422a0e40f90b done | Copying blob 5916e2b21ab2 done | Copying blob 10bf375a4d78 done | Copying blob ca1c18e183d5 done | Copying config 3bbb6e1f9b done | Writing manifest to image destination

Now that we have an image let's test it out on the cluster where we have compatible RDMA hardware configured. I am using the same setup as I used in a previous blog so I am going to skip the details about setting up a service account and providing the privileges to it. We will however create our workload pod yaml files which we will use to deploy the image.

cat >>EOF >rdma-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 name: rdma-32-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF $ cat >>EOF >rdma-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 name: rdma-33-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF

Next we can deploy the containers.

$ oc create -f rdma-32-workload.yaml pod/rdma-eth-32-workload created $ oc create -f rdma-33-workload.yaml pod/rdma-eth-33-workload created

Validate the pods are up and running.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-eth-32-workload 1/1 Running 0 51s rdma-eth-33-workload 1/1 Running 0 47s

Now open two terminals and rsh into each pod in one of the terminals and validate that the perftest commands are present. We can also get the ipaddress of our pod inside the containers.

$ oc rsh rdma-eth-32-workload sh-5.1# ib ib_atomic_bw ib_read_lat ib_write_bw ibcacheedit ibfindnodesusing.pl iblinkinfo ibping ibroute ibstatus ibtracert ib_atomic_lat ib_send_bw ib_write_lat ibccconfig ibhosts ibnetdiscover ibportstate ibrouters ibswitches ib_read_bw ib_send_lat ibaddr ibccquery ibidsverify.pl ibnodes ibqueryerrors ibstat ibsysstat sh-5.1# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if96: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:83:00:34 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.0.52/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:34/64 scope link valid_lft forever preferred_lft forever 3: net1@if78: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 32:1a:83:4a:e2:39 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.2.1/24 brd 192.168.2.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::301a:83ff:fe4a:e239/64 scope link valid_lft forever preferred_lft forever $ oc rsh rdma-eth-33-workload sh-5.1# ib ib_atomic_bw ib_read_lat ib_write_bw ibcacheedit ibfindnodesusing.pl iblinkinfo ibping ibroute ibstatus ibtracert ib_atomic_lat ib_send_bw ib_write_lat ibccconfig ibhosts ibnetdiscover ibportstate ibrouters ibswitches ib_read_bw ib_send_lat ibaddr ibccquery ibidsverify.pl ibnodes ibqueryerrors ibstat ibsysstat sh-5.1# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if105: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:80:02:3d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.61/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe80:23d/64 scope link valid_lft forever preferred_lft forever 3: net1@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 22:3e:02:c9:d0:87 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.2.2/24 brd 192.168.2.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::203e:2ff:fec9:d087/64 scope link valid_lft forever preferred_lft forever

Now let's run the RDMA perftest with the --use_cuda switch. Again we will need to have two rsh sessions one on each pod. In the first terminal we can run the following.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************ ~

In the second terminal we will run the following command which will dump the output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.2 --use_cuda=0 192.168.2.1 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is E1:00 Picking device No. 0 [pid = 4101, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007f3dfa600000 pointer=0x7f3dfa600000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00c6 PSN 0x2986aa GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c7 PSN 0xa0ef83 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c8 PSN 0x74badb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c9 PSN 0x287d57 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00ca PSN 0xf5b155 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cb PSN 0x6cc15d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cc PSN 0x3730c2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cd PSN 0x74d75d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00ce PSN 0x51a707 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cf PSN 0x987246 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d0 PSN 0xa334a8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d1 PSN 0x5d8f52 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d2 PSN 0xc42ca0 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d3 PSN 0xf43696 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d4 PSN 0x43f9d2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d5 PSN 0xbc4d64 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c6 PSN 0xb1023e GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c7 PSN 0xc78587 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c8 PSN 0x5a328f GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c9 PSN 0x582cfb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cb PSN 0x40d229 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cc PSN 0x5833a1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cd PSN 0xcfefb6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00ce PSN 0xfd5d41 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cf PSN 0xed811b GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d0 PSN 0x5244ca GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d1 PSN 0x946edc GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d2 PSN 0x4e0f76 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d3 PSN 0x7b13f4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d5 PSN 0xd22346 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d6 PSN 0x722bc8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10384867 0.00 181.46 0.346100 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f3dfa600000 destroying current CUDA Ctx

And if we return to the first terminal we should see it updated with the same output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************ Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 61:00 Picking device No. 0 [pid = 4109, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007f8bca600000 pointer=0x7f8bca600000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- Waiting for client rdma_cm QP to connect Please run the same command with the IB/RoCE interface IP --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00c6 PSN 0xb1023e GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c7 PSN 0xc78587 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c8 PSN 0x5a328f GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c9 PSN 0x582cfb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cb PSN 0x40d229 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cc PSN 0x5833a1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cd PSN 0xcfefb6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00ce PSN 0xfd5d41 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cf PSN 0xed811b GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d0 PSN 0x5244ca GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d1 PSN 0x946edc GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d2 PSN 0x4e0f76 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d3 PSN 0x7b13f4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d5 PSN 0xd22346 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d6 PSN 0x722bc8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c6 PSN 0x2986aa GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c7 PSN 0xa0ef83 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c8 PSN 0x74badb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c9 PSN 0x287d57 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00ca PSN 0xf5b155 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cb PSN 0x6cc15d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cc PSN 0x3730c2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cd PSN 0x74d75d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00ce PSN 0x51a707 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cf PSN 0x987246 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d0 PSN 0xa334a8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d1 PSN 0x5d8f52 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d2 PSN 0xc42ca0 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d3 PSN 0xf43696 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d4 PSN 0x43f9d2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d5 PSN 0xbc4d64 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10384867 0.00 181.46 0.346100 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f8bca600000 destroying current CUDA Ctx

Hopefully this helped demonstrate a much cleaner and automated way to build a perftest container with cuda enabled to perform RDMA testing on OpenShift with NVIDIA Network Operator and NVIDIA GPU Operator.