SCHMAUSTECH: ai

Showing posts with label ai. Show all posts

Tuesday, February 24, 2026

OpenShift Network Card Rail Mapping

The goal of this writeup is to provide a simple mechanism to map which GPUs are associated to which NICs on the same PCIe switch inside a physical system. This mapped information can then assist in generating a OpenShift MachineConfig that can identify one network card per GPU on the same PCI root complex and persistently name that network device a rail(some number) while marking any others as secondary. This is primarily for NVIDIA's Spectrum-X stack but could be used across any platform where GPU to NIC coherency is important in regards to configuration for OpenShift.

Why?

For optimal cluster performance and minimal latency, it’s essential to align each GPU with its nearest high-speed network card, ideally on the same NUMA node and PCIe root complex. This ensures that data traveling to and from each GPU takes the shortest, most efficient path, which is especially critical for GPUDirect RDMA and high-throughput AI/HPC workloads.

While there are tools that can provide pieces of this view all the commands have to be run manually and then its up to the user to fit it all together. Ideally there should be one solution that can provide all the details in a concise manner.

Hwloc

The Portable Hardware Locality (hwloc) software package provides a portable abstraction of the hierarchical topology of modern architectures, including NUMA memory nodes (DRAM, HBM, non-volatile memory, CXL, etc.), processor packages, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs. A sample image that it can generate is shown below.

Hwloc primarily aims at helping applications with gathering information about increasingly complex parallel computing platforms so as to exploit them accordingly and efficiently. For instance, two tasks that tightly cooperate should probably be placed onto cores sharing a cache. However, two independent memory-intensive tasks should better be spread out onto different processor packages so as to maximize their memory throughput.

However Hwloc does not ship in OpenShift today. Further it does not generate UDEV rules, MachineConfigs and seems heavy handed for the task at hand.

Rail Mappings

The gpu-nic-rail-mapping script aims to provide a simple example to identify the GPU to NIC relationship and then generates the MachineConfig for OpenShift to ensure there is one rail per GPU marked. Below is an example run on a Dell 9680 (H200) system with the following devices in it:

8 x H200 GPUs - Device ID 10de:2335
14 x BF3 Cards - Device ID 15b3:a2dc

sh-5.1# ./gpu-nic-rail-mapping -g 10de:2335 -n 15b3:a2dc -u 70-persistent-net.rules -r worker

 GPU BusAddr   NIC BusAddr PCIe Switch             NIC Slot    NIC Port   UDEV Eth    UDEV IB
====================================================================================================
 1b:00.0       18:00.0     15:01.0/16:00.0         40          1          eth_rail0   roce_rail0             
 1b:00.0       1a:00.0     15:01.0/16:00.0         42          1           eth_sec0    roce_sec0             
 3c:00.0       3a:00.0     37:01.0/38:00.0         41          1          eth_rail1   roce_rail1             
 4b:00.0       4d:00.0     48:01.0/49:00.0         38          1          eth_rail2   roce_rail2             
 5c:00.0       5d:00.0     59:01.0/5a:00.0         37          1          eth_rail3   roce_rail3             
 5c:00.0       5f:00.0     59:01.0/5a:00.0         39          1           eth_sec1    roce_sec1             
 5c:00.0       5f:00.1     59:01.0/5a:00.0         39          2           eth_sec2    roce_sec2             
 9a:00.0       9b:00.0     97:01.0/98:00.0         32          1          eth_rail4   roce_rail4             
 bb:00.0       ba:00.0     b7:01.0/b8:00.0         31          1          eth_rail5   roce_rail5             
 bb:00.0       bc:00.0     b7:01.0/b8:00.0         33          1           eth_sec3    roce_sec3             
 bb:00.0       bc:00.1     b7:01.0/b8:00.0         33          2           eth_sec4    roce_sec4             
 cd:00.0       ca:00.0     c7:01.0/c8:00.0         36          1          eth_rail6   roce_rail6             
 cd:00.0       cc:00.0     c7:01.0/c8:00.0         34          1           eth_sec5    roce_sec5             
 dc:00.0       db:00.0     d7:01.0/d8:00.0         35          1          eth_rail7   roce_rail7             
Generated 99-machine-config-udev-network.yaml file for OpenShift

Here was the 70-persistent-net.rules file generated.

sh-5.1# cat 70-persistent-net.rules 
ACTION=="add", KERNELS=="0000:18:00.0", SUBSYSTEM=="net", NAME="eth_rail0"
ACTION=="add", KERNELS=="0000:18:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail0"
ACTION=="add", KERNELS=="0000:1a:00.0", SUBSYSTEM=="net", NAME="eth_sec0"
ACTION=="add", KERNELS=="0000:1a:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec0"
ACTION=="add", KERNELS=="0000:3a:00.0", SUBSYSTEM=="net", NAME="eth_rail1"
ACTION=="add", KERNELS=="0000:3a:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail1"
ACTION=="add", KERNELS=="0000:4d:00.0", SUBSYSTEM=="net", NAME="eth_rail2"
ACTION=="add", KERNELS=="0000:4d:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail2"
ACTION=="add", KERNELS=="0000:5d:00.0", SUBSYSTEM=="net", NAME="eth_rail3"
ACTION=="add", KERNELS=="0000:5d:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail3"
ACTION=="add", KERNELS=="0000:5f:00.0", SUBSYSTEM=="net", NAME="eth_sec1"
ACTION=="add", KERNELS=="0000:5f:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec1"
ACTION=="add", KERNELS=="0000:5f:00.1", SUBSYSTEM=="net", NAME="eth_sec2"
ACTION=="add", KERNELS=="0000:5f:00.1", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec2"
ACTION=="add", KERNELS=="0000:9b:00.0", SUBSYSTEM=="net", NAME="eth_rail4"
ACTION=="add", KERNELS=="0000:9b:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail4"
ACTION=="add", KERNELS=="0000:ba:00.0", SUBSYSTEM=="net", NAME="eth_rail5"
ACTION=="add", KERNELS=="0000:ba:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail5"
ACTION=="add", KERNELS=="0000:bc:00.0", SUBSYSTEM=="net", NAME="eth_sec3"
ACTION=="add", KERNELS=="0000:bc:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec3"
ACTION=="add", KERNELS=="0000:bc:00.1", SUBSYSTEM=="net", NAME="eth_sec4"
ACTION=="add", KERNELS=="0000:bc:00.1", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec4"
ACTION=="add", KERNELS=="0000:ca:00.0", SUBSYSTEM=="net", NAME="eth_rail6"
ACTION=="add", KERNELS=="0000:ca:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail6"
ACTION=="add", KERNELS=="0000:cc:00.0", SUBSYSTEM=="net", NAME="eth_sec5"
ACTION=="add", KERNELS=="0000:cc:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec5"
ACTION=="add", KERNELS=="0000:db:00.0", SUBSYSTEM=="net", NAME="eth_rail7"
ACTION=="add", KERNELS=="0000:db:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail7"

And finally the OpenShift MachineConfig 99-machine-config-udev-network.yaml for the udev rule naming.

sh-5.1# cat 99-machine-config-udev-network.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
   labels:
     machineconfiguration.openshift.io/role: worker
   name: 99-machine-config-udev-network
spec:
   config:
     ignition:
       version: 3.2.0
     storage:
       files:
       - contents:
           source: data:text/plain;charset=utf-8;base64,QUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MTg6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9yYWlsMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MTg6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2VfcmFpbDAiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjFhOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfc2VjMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MWE6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2Vfc2VjMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6M2E6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9yYWlsMSIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6M2E6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2VfcmFpbDEiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjRkOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjRkOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWwyIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZDowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3JhaWwzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZDowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9yYWlsMyIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6NWY6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9zZWMxIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZjowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9zZWMxIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZjowMC4xIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjVmOjAwLjEiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjliOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjliOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw0IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYTowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3JhaWw1IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYTowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9yYWlsNSIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6YmM6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9zZWMzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYzowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9zZWMzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYzowMC4xIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmJjOjAwLjEiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNhOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDYiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNhOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw2IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpjYzowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzUiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNjOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzUiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmRiOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDciCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmRiOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw3Igo=
         filesystem: root
         mode: 420
         path: /etc/udev/rules.d/70-persistent-net.rules

The above MachineConfig can now be applied on the worker nodes of an OpenShift cluster of homogeneous nodes and persistently name the the rail devices mapped to the according GPUs.

In this next example we tried this on an SuperMicro AMD Instinct type system which had the following devices in it:

8 x MI325X - Device ID 1002:74a5
7 x AMD Pensando Systems POLLARA-1Q400 100/200/400G 1-port Card - Device ID 1dd8:1002
1 x NVIDIA ConnectX-7 - Device ID 15b3:1021

This system was interesting because it had multiple network card types associated with GPUs which allowed us to test the script behavior in that scenario. One caveat on this system was that dmidecode and lspci both failed to show the physical slot number for the Pollara cards while the CX7 card showed its physical slot just fine.

# ./gpu-nic-rail-mapping -g 1002:74a5 -n 1dd8:1002,15b3:1021 -u 70-persistent-net.rules -r worker

 GPU BusAddr   NIC BusAddr PCIe Switch             NIC Slot    NIC Port   UDEV Eth    UDEV IB
====================================================================================================
 05:00.0       09:00.0     00:01.1/01:00.0         NA          1          eth_rail0   roce_rail0             
 15:00.0       19:00.0     10:01.1/11:00.0         NA          1          eth_rail1   roce_rail1             
 65:00.0       69:00.0     60:01.1/61:00.0         NA          1          eth_rail2   roce_rail2             
 75:00.0       79:00.0     70:01.1/71:00.0         NA          1          eth_rail3   roce_rail3             
 85:00.0       89:00.0     80:01.1/81:00.0         NA          1          eth_rail4   roce_rail4             
 95:00.0       99:00.0     90:01.1/91:00.0         NA          1          eth_rail5   roce_rail5             
 e5:00.0       e6:00.0     e0:01.1/e1:00.0          1          1          eth_rail6   roce_rail6             
 f5:00.0       f9:00.0     f0:01.1/f1:00.0         NA          1          eth_rail7   roce_rail7             
Generated 99-machine-config-udev-network.yaml file for OpenShift

Whilst a 70-persistent-net.rules file and 99-machine-config-udev-network.yaml machineconfig were generated here as well they look very much like the H200 example.

The overall idea here was to automate an otherwise tedious task when it came to identifying and mapping the same GPU and network devices on the same pcie root complex. Hopefully this provided a simple example to accomplish that task. For those interested in seeing the script the repository is here.

Tuesday, April 01, 2025

NVIDIA GPU Direct Storage on OpenShift

Welcome to the NVIDIA GPU Direct Storage on OpenShift workflow. The goal of this workflow is to understand and configure NVIDIA GPU Direct Storage for NVME devices in worker nodes of an OpenShift cluster.

What Is NVIDIA GPU Direct Storage?

GPU Direct Storage enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. Using this direct path can relieve system bandwidth bottlenecks and decrease the latency and utilization load on the CPU.

Assumptions

This document assumes that we have already deployed a OpenShift Cluster and have installed the necessary operators required for GPU Direct Storage. Those operators would be Node Feature Discover which should also be configured along with the base installation of the NVIDIA Network Operator (no NicClusterPolicy yet) and the NVIDIA GPU Operator (no GpuClusterPolicy yet).

Considerations

If any of the NVME devices in the system participate in either the operating system or other services (machine configs for LVMs or other customized access) the NVME kernel modules will not be able to unload properly even with the workaround defined in this documentation. Any use of GDS requires that the NVME drives are not in use during the deployment of the Network Operator in order for the Network Operator to be able to unload in-tree drivers and then load NVIDIA's out of tree drivers in place.

NVIDIA Network Operator Configuration

We assume the Network Operator has already been installed on the cluster but the NicClusterPolicy still needs to be created. The following NicClusterPolicy example will provide the needed configuration to ensure RDMA is properly loaded for nvme. The key option in this policy is the ENABLE_NFSRDMA variable and having it set to true. I want to note that this policy also optinonally has an rdmaSharedDevice and ENTRYPOINT_DEBUG set to true for more verbose logging.

$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  nicFeatureDiscovery:
    image: nic-feature-discovery
    repository: ghcr.io/mellanox
    version: v0.0.1
  docaTelemetryService:
    image: doca_telemetry
    repository: nvcr.io/nvidia/doca
    version: 1.16.5-doca2.6.0-host
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens1f0np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: ghcr.io/mellanox
    version: 'sha256:9f468fdc4449e65e4772575f83aa85840a00f97165f9a00ba34695c91d610fbd'
  secondaryNetwork:
    ipoib:
      image: ipoib-cni
      repository: ghcr.io/mellanox
      version: v1.2.0
  nvIpam:
    enableWebhook: false
    image: nvidia-k8s-ipam
    repository: ghcr.io/mellanox
    version: v0.2.0
  ofedDriver:
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    forcePrecompiled: false
    terminationGracePeriodSeconds: 300
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
        podSelector: ''
      maxParallelUpgrades: 1
      safeLoad: false
      waitForCompletion:
        timeoutSeconds: 0
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.01-0.6.0.0-0
    env:
    - name: UNLOAD_STORAGE_MODULES
      value: "true"
    - name: RESTORE_DRIVER_ON_POD_TERMINATION
      value: "true"
    - name: CREATE_IFNAMES_UDEV
      value: "true"
    - name: ENABLE_NFSRDMA
      value: "true"
    - name: ENTRYPOINT_DEBUG
      value: 'true'
EOF

Before creating the NicClusterPolicy on the cluster we need to prepare a script which will allow us to workaround an issue with GPU Direct Storage in the NVIDIA Network Operator. This script when run right after creating the NicClusterPolicy will determine which nodes have mofed pods running on them and based on that node list will ssh as the core user into each node and unload the following modules: nvme, nvme_tcp, nvme_fabrics, nvme_core. By using the script to unload the modules while the mofed container is busying building the doca drivers we eliminate an issue where when the mofed container goes to install the compiled doca drivers there is a failure to load. This issue is being investigated by NVIDIA.

$ cat <<EOF > nvme-fixer.sh 
#!/bin/bash

### Set array of modules to be unloaded
declare -a modarr=("nvme" "nvme_tcp" "nvme_fabrics" "nvme_core")

### Determine which hosts have mofed container running on them
declare -a hostarr=(`oc get pods -n nvidia-network-operator -o custom-columns=POD:.metadata.name,NODE:.spec..nodeName --no-headers|grep mofed|awk {'print $2'}`)

### Iterate through modules on each host and unload them 
for host in "${hostarr[@]}"
do
    echo "Unloading nvme dependencies on $host..."
    for module in "${modarr[@]}"
    do
       echo "Unloading module $module..."
       ssh core@$host sudo rmmod $module
    done
done

Change the execute bit on the script.

$ chmod +x nvme-fixer.sh

Now we are ready to create the NicClusterPolicy on the cluster and follow it up by running the nvme-fixer.sh script. If there are any rmmod "not currently loaded" errors those can safely be ignored as the module was not loaded to start with. In the example below we had two workers nodes that had mofed pods running on them so the script went ahead and unloaded the nvme modules.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml 
nicclusterpolicy.mellanox.com/nic-cluster-policy created

$ ./nvme-fixer.sh 
Unloading nvme dependencies on nvd-srv-22.nvidia.eng.rdu2.dc.redhat.com...
Unloading module nvme...
Unloading module nvme_tcp...
rmmod: ERROR: Module nvme_tcp is not currently loaded
Unloading module nvme_fabrics...
rmmod: ERROR: Module nvme_fabrics is not currently loaded
Unloading module nvme_core...
Unloading nvme dependencies on nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com...
Unloading module nvme...
Unloading module nvme_tcp...
Unloading module nvme_fabrics...
Unloading module nvme_core...
$

Now we wait for the mofed pod to finish compiling and installed the GPU Direct Storage modules. We will know its complete when the pods are in a running state like below:

$ oc get pods -n nvidia-network-operator
NAME                                                          READY   STATUS    RESTARTS       AGE
kube-ipoib-cni-ds-5f8wk                                       1/1     Running   0              38s
kube-ipoib-cni-ds-956nv                                       1/1     Running   0              38s
kube-ipoib-cni-ds-jpbph                                       1/1     Running   0              38s
kube-ipoib-cni-ds-jwtw2                                       1/1     Running   0              38s
kube-ipoib-cni-ds-v4sb8                                       1/1     Running   0              38s
mofed-rhcos4.17-69fb4cd685-ds-j77vl                           2/2     Running   0              37s
mofed-rhcos4.17-69fb4cd685-ds-lw7t9                           2/2     Running   0              37s
nic-feature-discovery-ds-527wc                                1/1     Running   0              36s
nic-feature-discovery-ds-fnn9v                                1/1     Running   0              36s
nic-feature-discovery-ds-l9lkf                                1/1     Running   0              36s
nic-feature-discovery-ds-qn4m9                                1/1     Running   0              36s
nic-feature-discovery-ds-w7vw4                                1/1     Running   0              36s
nv-ipam-controller-67556c846b-c4sfq                           1/1     Running   0              36s
nv-ipam-controller-67556c846b-wvm59                           1/1     Running   0              36s
nv-ipam-node-22rw9                                            1/1     Running   0              36s
nv-ipam-node-6w4x4                                            1/1     Running   0              36s
nv-ipam-node-f2p96                                            1/1     Running   0              36s
nv-ipam-node-jssjh                                            1/1     Running   0              36s
nv-ipam-node-z2mws                                            1/1     Running   0              36s
nvidia-network-operator-controller-manager-57c7cfddc8-6nw6j   1/1     Running   16 (10h ago)   14d

We can validate things look correct from a module perspective by logging into one of nodes either via SSH or even debug pod and listing out the nvme modules. The results should look like the following output below. Note I ran a lsblk to also show that my nvme device is visible as well.

$ ssh core@nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com
Red Hat Enterprise Linux CoreOS 417.94.202502051822-0
  Part of OpenShift 4.17, RHCOS is a Kubernetes-native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.17/architecture/architecture-rhcos.html
Last login: Fri Mar 21 17:48:41 2025 from 10.22.81.26
[systemd]
Failed Units: 1
  NetworkManager-wait-online.service
[core@nvd-srv-23 ~]$ sudo bash
[root@nvd-srv-23 core]# lsmod|grep nvme
nvme_rdma              57344  0
nvme_fabrics           45056  1 nvme_rdma
nvme                   73728  0
nvme_core             204800  3 nvme,nvme_rdma,nvme_fabrics
rdma_cm               155648  3 rpcrdma,nvme_rdma,rdma_ucm
ib_core               557056  10 rdma_cm,ib_ipoib,rpcrdma,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx_compat             20480  17 rdma_cm,ib_ipoib,mlxdevm,rpcrdma,nvme,nvme_rdma,mlxfw,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
nvme_common            24576  0
t10_pi                 24576  2 sd_mod,nvme_core

[root@nvd-srv-23 core]# lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda       8:0    0   1.5T  0 disk 
├─sda1    8:1    0     1M  0 part 
├─sda2    8:2    0   127M  0 part 
├─sda3    8:3    0   384M  0 part /boot
└─sda4    8:4    0   1.5T  0 part /var
                                  /sysroot/ostree/deploy/rhcos/var
                                  /usr
                                  /etc
                                  /
                                  /sysroot
sdb       8:16   0   1.5T  0 disk 
sdc       8:32   0   1.5T  0 disk 
sdd       8:48   0   1.5T  0 disk 
nvme0n1 259:1    0 894.2G  0 disk

This completes the NVIDIA Network Operator portion of the configuration for GPU Direct Storage.

NVIDIA GPU Operator Configuration

Now that the NicClusterPolicy is defined and the proper NVME modules have been loaded we can move into configuring our GPU ClusterPolicy. The below example is a policy that will enable GPU Direct Storage on the worker nodes that have a proper NVIDIA GPU.

$ cat <<EOF > gpu-cluster-policy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    serviceMonitor:
      enabled: true
    enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    licensingConfig:
      nlsEnabled: true
      configMapName: ''
    certConfig:
      name: ''
    rdma:
      enabled: true
    kernelModuleConfig:
      name: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig:
      configMapName: ''
    virtualTopology:
      config: ''
    enabled: true
    useNvidiaDriverCRD: false
    useOpenKernelModules: true
  devicePlugin:
    config:
      name: ''
      default: ''
    mps:
      root: /run/nvidia/mps
    enabled: true
  gdrcopy:
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: true
    image: 'nvcr.io/nvidia/cloud-native/nvidia-fs:2.20.5'
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    installDir: /usr/local/nvidia
    enabled: true
EOF

Now let's create the policy on the cluster.

$ oc create -f gpu-cluster-policy.yaml 
clusterpolicy.nvidia.com/gpu-cluster-policy created

Once the policy is created let's validate the pods are running before we move onto the next step.

$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-499wh                           1/1     Running     0             18h
gpu-feature-discovery-m68bn                           1/1     Running     0             18h
gpu-operator-c9ccd586d-htl5q                          1/1     Running     0             19h
nvidia-container-toolkit-daemonset-8m4r5              1/1     Running     0             18h
nvidia-container-toolkit-daemonset-ld7qz              1/1     Running     0             18h
nvidia-cuda-validator-fddq7                           0/1     Completed   0             18h
nvidia-cuda-validator-mdk6b                           0/1     Completed   0             18h
nvidia-dcgm-565tj                                     1/1     Running     0             18h
nvidia-dcgm-exporter-jtgt6                            1/1     Running     1 (18h ago)   18h
nvidia-dcgm-exporter-znpgh                            1/1     Running     1 (18h ago)   18h
nvidia-dcgm-xpxbx                                     1/1     Running     0             18h
nvidia-device-plugin-daemonset-2vn52                  1/1     Running     0             18h
nvidia-device-plugin-daemonset-kjzjz                  1/1     Running     0             18h
nvidia-driver-daemonset-417.94.202502051822-0-pj7hk   5/5     Running     2 (18h ago)   18h
nvidia-driver-daemonset-417.94.202502051822-0-qp8xb   5/5     Running     5 (18h ago)   18h
nvidia-node-status-exporter-48cx7                     1/1     Running     0             18h
nvidia-node-status-exporter-dpmsr                     1/1     Running     0             18h
nvidia-operator-validator-fmcz4                       1/1     Running     0             18h
nvidia-operator-validator-g2fbt                       1/1     Running     0             18h

With the NVIDIA GPU Operator pods running we can rsh into the daemonset pods and confirm GDS is enabled by running the lsmod command (note the nvidia_fs module) and cat out the /proc/driver/nvidia-fs/stats file.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk
sh-4.4# lsmod|grep nvidia
nvidia_fs             327680  0
nvidia_peermem         24576  0
nvidia_modeset       1507328  0
video                  73728  1 nvidia_modeset
nvidia_uvm           6889472  8
nvidia               8810496  43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
ib_uverbs             217088  19 nvidia_peermem,rdma_ucm,mlx5_ib
drm                   741376  5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk
sh-4.4# cat /proc/driver/nvidia-fs/stats
GDS Version: 1.10.0.4 
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.20.5)
Mellanox PeerDirect Supported: True
IO stats: Disabled, peer IO stats: Disabled
Logging level: info

Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads                : err=0 io_state_err=0
Sparse Reads                : n=0 io=0 holes=0 pages=0 
Writes                : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap                : n=0 ok=0 err=0 munmap=0
Bar1-map            : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error                : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops                : Read=0 Write=0 BatchIO=0

If everything looks good we can move onto an additional step to confirm GDS is ready for workload consumption.

GDS Cuda Workload Container

Once the GPU Direct Storage drivers are loaded we can use one more additional tool to check and confirm GDS capability. This involves building a container that contains the CUDA packages and then running it on a node. The following pod yaml defines this configuration.

$ cat <<EOF > gds-check-workload.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: gds-check-workload
  namespace: default
spec:
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.3
    name: gds-check-workload
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    volumeMounts:
    - name: udev
      mountPath: /run/udev
    volumeMounts:
    - name: kernel-config
      mountPath: /sys/kernel/config
    volumeMounts:
    - name: dev
      mountPath: /run/dev
    volumeMounts:
    - name: sys
      mountPath: /sys
    volumeMounts:
    - name: results
      mountPath: /results
    volumeMounts:
    - name: lib
      mountPath: /lib/modules
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
  volumes:
    - name: udev
      hostPath:
        path: /run/udev
    - name: kernel-config
      hostPath:
        path: /sys/kernel/config
    - name: dev
      hostPath:
        path: /run/dev
    - name: sys
      hostPath:
        path: /sys
    - name: results
      hostPath:
        path: /results
    - name: lib
      hostPath:
        path: /lib/modules
EOF

Now let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > default-serviceaccount.yaml 
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rdma
  namespace: default
EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml 
serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z rdma
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

With the service account defined and our pod yaml ready we can create it on the cluster.

$ oc create -f gds-check-workload.yaml 
pod/gds-check-workload created

$ oc get pods
NAME                 READY   STATUS    RESTARTS   AGE
gds-check-workload   1/1     Running   0          3s

Once the pod is up and running we can rsh into the pod and run the gdscheck tool to confirm capabilities and configuration of GPU Direct Storage.

$ oc rsh gds-check-workload

sh-5.1# /usr/local/cuda/gds/tools/gdscheck -p
 GDS release version: 1.13.1.3
 nvidia_fs version:  2.20 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA        : Unsupported
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Supported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_pci_p2pdma : false
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 64 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 fs.gpfs.gds_async_support: true
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A40 bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
 ==============
 PLATFORM INFO:
 ==============
 Found ACS enabled for switch 0000:e0:01.0
 IOMMU: Pass-through or enabled
 WARN: GDS is not guaranteed to work functionally or in a performant way with iommu=on/pt
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12040
 Platform: PowerEdge R760xa, Arch: x86_64(Linux 5.14.0-427.50.1.el9_4.x86_64)
 Platform verification succeeded

Hopefully this provides enough detail to enable GPU Direct Storage on OpenShift.

Wednesday, January 08, 2025

Build RDMA GPU-Tools Container

The purpose of this blog is to build a container that automates building the testing tooling for validating RDMA connectivity and performance when used in conjunction with NVIDIA Network Operator and NVIDIA GPU Operator. Specifically I want to be able to use the ib_write_bw command with the --use_cuda switch to demonstrate RDMA from one GPU in a node to another GPU in another node in an OpenShift cluster. The ib_write_bw command is part of the perftest suite which is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for HW or SW tuning as well as for functional testing.

The collection contains a set of bandwidth and latency benchmark such as:

Send - ib_send_bw and ib_send_lat
RDMA Read - ib_read_bw and ib_read_lat
RDMA Write - ib_write_bw and ib_write_lat
RDMA Atomic - ib_atomic_bw and ib_atomic_lat
Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat

In previous blogs, here and here, I used a Fedora 35 container and manually added the components I wanted but here we will provide the tooling to build a container that will instantiate itself upon deployment. The workflow is as follows:

Dockerfile.tools - which provides the content for the base image and the entrypoint.sh script.
Entrypoint.sh - which provides the start up script for the container to pull in both the NVIDIA cuda libraries and also build and deploy the perftest suite with the cuda option available.
Additional RPMs - there are some packages that were not part of the UBI image repo but are dependencies for CUDA toolkit.

The first thing we need to do is create a working directory for our files and an rpms directory for the rpms we will need for our base image. I am using root here but it could be a regular user as well.

$ mkdir -p /root/gpu-tools/rpms
$ cd /root/gpu-tools

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory.

infiniband-diags-51.0-1.el9.x86_64.rpm
libglvnd-opengl-1.3.4-1.el9.x86_64.rpm
libibumad-51.0-1.el9.x86_64.rpm
librdmacm-51.0-1.el9.x86_64.rpm
libxcb-1.13.1-9.el9.x86_64.rpm
libxcb-devel-1.13.1-9.el9.x86_64.rpm
libxkbcommon-1.0.3-4.el9.x86_64.rpm
libxkbcommon-x11-1.0.3-4.el9.x86_64.rpm
pciutils-devel-3.7.0-5.el9.x86_64.rpm
rdma-core-devel-51.0-1.el9.x86_64.rpm
xcb-util-0.4.0-19.el9.x86_64.rpm
xcb-util-image-0.4.0-19.el9.x86_64.rpm
xcb-util-keysyms-0.4.0-17.el9.x86_64.rpm
xcb-util-renderutil-0.3.9-20.el9.x86_64.rpm
xcb-util-wm-0.4.1-22.el9.x86_64.rpm

Once we have all our rpms for the base image we can move onto creating the dockerfile.tools file which we will use to build our image.

$ cat <<EOF >dockerfile.tools
# Start from UBI9 image
FROM registry.access.redhat.com/ubi9/ubi:latest

# Set work directory
WORKDIR /root
RUN mkdir /root/rpms
COPY ./rpms/*.rpm /root/rpms/

# DNF install packages either from repo or locally
RUN dnf install `ls -1 /root/rpms/*.rpm` -y
RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y

# Cleanup 
WORKDIR /root
RUN dnf clean all

# Run container entrypoint
COPY entrypoint.sh /root/entrypoint.sh
RUN chmod +x /root/entrypoint.sh

ENTRYPOINT ["/root/entrypoint.sh"]
EOF

We also need to create the entrypoint.sh script which is referenced in the dockerfile and does the heavy lifting of pulling in the cuda toolkit and the perftest suite.

$ cat <<EOF > entrypoint.sh 
#!/bin/bash
# Set working dir
cd /root

# Configure and install cuda-toolkit
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
dnf clean all
dnf -y install cuda-toolkit-12-6

# Export CUDA library paths
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

# Git clone perftest repository
git clone https://github.com/linux-rdma/perftest.git

# Change into perftest directory
cd /root/perftest

# Build perftest with the cuda libraries included
./autogen.sh
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h
make -j
make install

# Sleep container indefinitly
sleep infinity & wait
EOF

Next we can use the dockerfile we just created to build the base image.

$ podman build -f dockerfile.tools -t quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
STEP 1/10: FROM registry.access.redhat.com/ubi9/ubi:latest
STEP 2/10: WORKDIR /root
--> Using cache 75f163f12503272b83e1137f7c1903520f84493ffe5aec0ef32ece722bd0d815
--> 75f163f12503
STEP 3/10: RUN mkdir /root/rpms
--> Using cache ade32aa6605847a8b3f5c8b68cfcb64854dc01eece34868faab46137a60f946c
--> ade32aa66058
STEP 4/10: COPY ./rpms/*.rpm /root/rpms/
--> Using cache 59dcef81d6675f44d22900f13a3e5441f5073555d7d2faa0b2f261f32e4ba6cd
--> 59dcef81d667
STEP 5/10: RUN dnf install `ls -1 /root/rpms/*.rpm` -y
--> Using cache ebb8b3150056240378ac36f7aa41d7f13b13308e9353513f26a8d3d70e618e3b
--> ebb8b3150056
STEP 6/10: RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y
--> Using cache 5ca85080c103ba559994906ada0417102f54f22c182bbc3a06913109855278cc
--> 5ca85080c103
STEP 7/10: WORKDIR /root
--> Using cache 68c8cd47a41bc364a0da5790c90f9aee5f8a8c7807732f3a5138bff795834fc1
--> 68c8cd47a41b
STEP 8/10: RUN dnf clean all
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

26 files removed
--> a219fec5df49
STEP 9/10: COPY entrypoint.sh /root/entrypoint.sh
--> aeb03bf74673
STEP 10/10: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"]
COMMIT quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
--> 45c2113e5082
Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
45c2113e5082fb2f548b9e1b16c17524184c4079e2db77399519cf29829af1e7

Once the image is created we can push it to our favorite registry.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
Getting image source signatures
Copying blob 62ee1c6c02d5 done   | 
Copying blob 6027214db22e done   | 
Copying blob 4822ebd5a418 done   | 
Copying blob 422a0e40f90b done   | 
Copying blob 5916e2b21ab2 done   | 
Copying blob 10bf375a4d78 done   | 
Copying blob ca1c18e183d5 done   | 
Copying config 3bbb6e1f9b done   | 
Writing manifest to image destination

Now that we have an image let's test it out on the cluster where we have compatible RDMA hardware configured. I am using the same setup as I used in a previous blog so I am going to skip the details about setting up a service account and providing the privileges to it. We will however create our workload pod yaml files which we will use to deploy the image.

cat >>EOF >rdma-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-eth-32-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
    name: rdma-32-workload
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
EOF

$ cat >>EOF >rdma-33-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-eth-33-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
    name: rdma-33-workload
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
EOF

Next we can deploy the containers.

$ oc create -f rdma-32-workload.yaml 
pod/rdma-eth-32-workload created

$ oc create -f rdma-33-workload.yaml 
pod/rdma-eth-33-workload created

Validate the pods are up and running.

$ oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
rdma-eth-32-workload   1/1     Running   0          51s
rdma-eth-33-workload   1/1     Running   0          47s

Now open two terminals and rsh into each pod in one of the terminals and validate that the perftest commands are present. We can also get the ipaddress of our pod inside the containers.

$ oc rsh rdma-eth-32-workload
sh-5.1# ib
ib_atomic_bw         ib_read_lat          ib_write_bw          ibcacheedit          ibfindnodesusing.pl  iblinkinfo           ibping               ibroute              ibstatus             ibtracert            
ib_atomic_lat        ib_send_bw           ib_write_lat         ibccconfig           ibhosts              ibnetdiscover        ibportstate          ibrouters            ibswitches           
ib_read_bw           ib_send_lat          ibaddr               ibccquery            ibidsverify.pl       ibnodes              ibqueryerrors        ibstat               ibsysstat            
sh-5.1# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if96: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:83:00:34 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.52/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe83:34/64 scope link 
       valid_lft forever preferred_lft forever
3: net1@if78: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 32:1a:83:4a:e2:39 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.2.1/24 brd 192.168.2.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::301a:83ff:fe4a:e239/64 scope link 
       valid_lft forever preferred_lft forever

$ oc rsh rdma-eth-33-workload
sh-5.1# ib
ib_atomic_bw         ib_read_lat          ib_write_bw          ibcacheedit          ibfindnodesusing.pl  iblinkinfo           ibping               ibroute              ibstatus             ibtracert            
ib_atomic_lat        ib_send_bw           ib_write_lat         ibccconfig           ibhosts              ibnetdiscover        ibportstate          ibrouters            ibswitches           
ib_read_bw           ib_send_lat          ibaddr               ibccquery            ibidsverify.pl       ibnodes              ibqueryerrors        ibstat               ibsysstat            
sh-5.1# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if105: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:80:02:3d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.128.2.61/23 brd 10.128.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe80:23d/64 scope link 
       valid_lft forever preferred_lft forever
3: net1@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 22:3e:02:c9:d0:87 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.2.2/24 brd 192.168.2.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::203e:2ff:fec9:d087/64 scope link 
       valid_lft forever preferred_lft forever

Now let's run the RDMA perftest with the --use_cuda switch. Again we will need to have two rsh sessions one on each pod. In the first terminal we can run the following.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0
 WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************
~

In the second terminal we will run the following command which will dump the output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 192.168.2.2 --use_cuda=0 192.168.2.1
 WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
Requested mtu is higher than active mtu 
Changing to active mtu - 3
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is E1:00

Picking device No. 0
[pid = 4101, dev = 0] device name = [NVIDIA A40]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 2097152 bytes GPU buffer
allocated GPU buffer address at 00007f3dfa600000 pointer=0x7f3dfa600000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_1
 Number of qps   : 16        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON        Lock-free      : OFF
 ibv_wr* API     : ON        Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm     TOS    : 41
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x00c6 PSN 0x2986aa
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00c7 PSN 0xa0ef83
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00c8 PSN 0x74badb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00c9 PSN 0x287d57
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00ca PSN 0xf5b155
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00cb PSN 0x6cc15d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00cc PSN 0x3730c2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00cd PSN 0x74d75d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00ce PSN 0x51a707
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00cf PSN 0x987246
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d0 PSN 0xa334a8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d1 PSN 0x5d8f52
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d2 PSN 0xc42ca0
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d3 PSN 0xf43696
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d4 PSN 0x43f9d2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d5 PSN 0xbc4d64
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00c6 PSN 0xb1023e
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00c7 PSN 0xc78587
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00c8 PSN 0x5a328f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00c9 PSN 0x582cfb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00cb PSN 0x40d229
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00cc PSN 0x5833a1
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00cd PSN 0xcfefb6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00ce PSN 0xfd5d41
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00cf PSN 0xed811b
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d0 PSN 0x5244ca
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d1 PSN 0x946edc
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d2 PSN 0x4e0f76
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d3 PSN 0x7b13f4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d5 PSN 0xd22346
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d6 PSN 0x722bc8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      10384867         0.00               181.46              0.346100
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007f3dfa600000
destroying current CUDA Ctx

And if we return to the first terminal we should see it updated with the same output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0
 WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************
Requested mtu is higher than active mtu 
Changing to active mtu - 3
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 61:00

Picking device No. 0
[pid = 4109, dev = 0] device name = [NVIDIA A40]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 2097152 bytes GPU buffer
allocated GPU buffer address at 00007f8bca600000 pointer=0x7f8bca600000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_1
 Number of qps   : 16        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON        Lock-free      : OFF
 ibv_wr* API     : ON        Using DDP      : OFF
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm     TOS    : 41
---------------------------------------------------------------------------------------
 Waiting for client rdma_cm QP to connect
 Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x00c6 PSN 0xb1023e
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00c7 PSN 0xc78587
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00c8 PSN 0x5a328f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00c9 PSN 0x582cfb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00cb PSN 0x40d229
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00cc PSN 0x5833a1
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00cd PSN 0xcfefb6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00ce PSN 0xfd5d41
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00cf PSN 0xed811b
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d0 PSN 0x5244ca
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d1 PSN 0x946edc
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d2 PSN 0x4e0f76
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d3 PSN 0x7b13f4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d5 PSN 0xd22346
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d6 PSN 0x722bc8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00c6 PSN 0x2986aa
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00c7 PSN 0xa0ef83
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00c8 PSN 0x74badb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00c9 PSN 0x287d57
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00ca PSN 0xf5b155
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00cb PSN 0x6cc15d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00cc PSN 0x3730c2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00cd PSN 0x74d75d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00ce PSN 0x51a707
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00cf PSN 0x987246
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d0 PSN 0xa334a8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d1 PSN 0x5d8f52
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d2 PSN 0xc42ca0
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d3 PSN 0xf43696
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d4 PSN 0x43f9d2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d5 PSN 0xbc4d64
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      10384867         0.00               181.46              0.346100
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007f8bca600000
destroying current CUDA Ctx

Hopefully this helped demonstrate a much cleaner and automated way to build a perftest container with cuda enabled to perform RDMA testing on OpenShift with NVIDIA Network Operator and NVIDIA GPU Operator.