SCHMAUSTECH: 1/1/25

Wednesday, January 15, 2025

RDMA: Shared, Hostdevice, Legacy SRIOV

In a previous blog we discussed how to configure RDMA on OpenShift in three distinct methods: RDMA shared, host device and legacy SRIOV. However one of the biggest questions coming out of that blog was how do I know which one to choose? To answer this question comprehensively we probably should first step back and discuss RDMA and the three methods in detail.

What is RDMA?

Remote direct memory access (RDMA) is a technology, originally developed in the 1990s, that allows computers to directly access each others memory without the involvement of the hosts central processor unit (CPU) or operating system(OS). RDMA is an extension of direct memory access(DMA) which allows direct access to a hosts memory without the use of CPU. RDMA itself is geared toward high bandwidth and low latency applications making it a valuable component in the AI space.

NVIDIA offers GPUDirect RDMA which is a technology that provides a direct data path between the GPU memory directly between two or more hosts leveraging the NVIDIA networking device. This configuration provides a significant decrease in latency and offloads the CPU of the hosts. When leveraging this technology from NVDIA the consumer has the ability to configure it multiple ways to interact with the underlying technology but also based on the consumers use cases.

The three configuration methods for GPUDirect RDMA are as follows:

RDMA Shared Device
RDMA SR-IOV Legacy Device
RDMA Host Device

Let's take a look at each of these options and discuss why one might be used over the other depending on a consumers use case.

RDMA Shared Device

When using the NVIDIA network operator in OpenShift there is a configuration method in the NicClusterPolicy called RDMA shared device. This method allows for an RDMA device to be shared among multiple pods on the OpenShift worker node where the device is exposed. The user defined networks of those pods use VXLAN or VETH networking devices inside OpenShift. Usually those devices are defined in the NicClusterPolicy by specifying the physical device name like in the code snippet below:

  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibs2f0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens8f0np0"]
            }
          }
        ]
      }

The example above shows both an RDMA shared device for an ethernet interface and an infiniband interface. We also define the number of pods that could consume the interface via the rdmaHcaMax parameter. In the NicClusterPolicy we can define as many interfaces that we have in the worker nodes. Further we can set the number of pods that consume each device to various set points which makes this method very flexible.

In an RDMA shared device configuration keep in mind that the pods sharing the device will be competing for the bandwidth and latency of the same device as with any shared resource. Thus an RDMA shared device is better suited for developer or application environments where performance and latency are not key but the ability to have RDMA functionality across nodes is important.

RDMA SR-IOV Legacy Device

Single Root IO Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that, like an RDMA shared device, can share a single device with multiple pods. However the way the device is shared is very different because SR-IOV can segment the compliant network device at the hardware layer. The network device is recognized on the node as a physical function (PF) and when segmented creates multiple virtual functions (VFs). Each VF can be used like any other network device. The SR-IOV network device driver for the device determines how the VF is exposed in the container:

netdevice driver: A regular kernel network device in the netns of the container
vfio-pci driver: A character device mounted in the container

Unlike a shared device though an SR-IOV device can only be shared with the number of pods based off the number of VFs the physical device supports. However since each VF is like having direct access to the device the performance is ideal for workloads that are latency and bandwidth sensitive.

The configuration of the SRI-IOV devices doesn't take place in the NVIDIA network operator NicClusterPolicy, though we still need the policy for the driver, but rather in the SriovNetworkNodePolicy of the worker node. The below example shows how we define a vendor and pfName for the nicSelector along with a numVfs which defines the number of VFs to create (usually a value up to the number the device supports).

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace:  openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens8f0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy

Once the configuration is in place RDMA SR-IOV workloads that require high bandwidth and low latency are great candidates for this type of configuration where multiple pods need that performance from a single network device.

RDMA Host Device

Host device is in some ways a lot like SR-IOV in that a host device creates an additional network on a pod allowing direct physical ethernet access on the worker node. The plugin allows the network device to be moved from the hosts network namespace to the pods network namespace. However unlike SR-IOV once the device is passed into a pod the device is not available to any other host until the pod that is using it is removed from the system which makes it far more restrictive.

The configuration of this type of RDMA is handled again through the NVIDIA network operator NicClusterPolicy. The irony here is even though it is not an SR-IOV configuration the DOCA driver uses the SRIOV network device plugin to do the device passing. Below is an example of how to configure this type of RDMA where we will set a resourceName and use the NVIDIA vendors selector and any device that has the RDMA capability to be exposed as a host device. If there are multiple cards in the system the configuration will expose all of them assuming they match the vendor id and have RDMA capabilities.

  sriovDevicePlugin:
      image: sriov-network-device-plugin
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.7.0
      config: |
        {
          "resourceList": [
              {
                  "resourcePrefix": "nvidia.com",
                  "resourceName": "hostdev",
                  "selectors": {
                      "vendors": ["15b3"],
                      "isRdma": true
                  }
              }
          ]
        }

The RDMA host device is normally leveraged where the other two options above are not feasible. For example the use case requires performance but other requirements don't allow for the use of VFs. Maybe the cards themselves do not support SR-IOV, or there is not enough PCI express base address registers(BAR) or maybe the system board does not support SR-IOV. There are also rare cases where the SR-IOV netdevice driver does not support all the capabilities of the network device compared to the PF driver and the workload requires those features.

As we have discussed this blog covered what RDMA is and how one can configure three different methods of RDMA with the NVIDIA network operator. We also discussed and compared why one might use one method over the other along the way. Hopefully this gives those looking to adopt this technology enough detail to pursue the right solution for their use case.

Monday, January 13, 2025

Mellanox Firmware Updates via OpenShift

Anyone who has worked with Mellnox/NVIDIA networking devices knows there is sometimes the necessity to upgrade the firmware either to providing new feature functionality or addressing a current bug in the firmware. This might be trivial on a legacy package based system where its easy enough to install the NVIDIA Firmware Tools (MFT) packages once and be done. However for image based operating systems like Red Hat CoreOS which underpins the OpenShift Container Platform this can become cumbersome.

Some of the challenges around image based systems is standard tooling like dnf is not available and while rpm-ostree install is an option its really not meant to be used like a packaging system. When I initially was working on needing to update firmware I was instructed to install the MFT tools rpm inside the DOCA/MOFED container. While this method works the drawbacks are:

The container is ephemeral so that if the DOCA/MOFED container restarts and/or gets updated I have to install the MFT tools all over again.
I need to stage the packages in the DOCA/MOFED container and the required kernel-devel dependencies.

Given these challenges I decided I want to build an image that I could run on OpenShift that would provide the tooling whenever I needed it simply by spinning up a pod. We will cover that process through the rest of this blog.

Before we begin let's first explain what the MFT package of firmware management tools is used for:

Generate a standard or customized NVIDIA firmware image querying for firmware information
Burn a firmware image
Make configuration changes to the firmware settings

The following is a list of the available tools in MFT, together with a brief description of what each tool performs.

Tool	Description/Function
mst	Starts/stops the register access driver Lists the available mst devices
mlxburn	Generation of a standard or customized NVIDIA firmware image for burning (.bin or .mlx)to the Flash/EEPROM attached to a NVIDIA HCA or switch device
flint	This tool burns/query a firmware binary image or an expansion ROM image to the Flash device of a NVIDIA network adapter/gateway/switch device
debug utilities	A set of debug utilities (e.g., itrace, fwtrace, mlxtrace, mlxdump, mstdump, mlxmcg, wqdump, mcra, mlxi2c, i2c, mget_temp, and pckt_drop)
mlxup	The utility enables discovery of available NVIDIA adapters and indicates whether firmware update is required for each adapter
mlnx-tools	Mellanox userland tools and scripts

Sources: Mlnx-tools Repo MFT Tools Mlxup

Prerequisites

Before we can build the container we need to setup the directory structure, gather a few packages and create the dockerfile and entrypoint script. First let's create the directory structure. I am using root in this example but it could be a regular user.

$ mkdir -p /root/mft/rpms
$ cd /root/mft

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory. The first is the kernel-devel package for the kernel of the OpenShift node this container will run on. To obtain the kernel version we can run the following oc command on our cluster.

$ oc debug node/nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-29nvidiaengrdu2dcredhatcom-debug-rhlgs ...
To use host binaries, run `chroot /host`
Pod IP: 10.6.135.8
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# uname -r
5.14.0-427.47.1.el9_4.x86_64
sh-5.1#

Now that we have our kernel version we can download the two packages into our /root/mft/rpms directory.

kernel-devel-5.14.0-427.47.1.el9_4.x86_64.rpm
usbutils-017-1.el9.x86_64.rpm

Next we need to create the dockerfile.mft which will build the container.

$ cat <<EOF > dockerfile.mft 
# Start from UBI9 image
FROM registry.access.redhat.com/ubi9/ubi:latest

# Set work directory
WORKDIR /root/mft

# Copy in packages not available in UBI repo
COPY ./rpms/*.rpm /root/rpms/
RUN dnf install /root/rpms/usbutils*.rpm -y

# DNF install packages either from repo or locally
RUN dnf install wget procps-ng pciutils yum jq iputils ethtool net-tools kmod systemd-udev rpm-build gcc make -y

# Cleanup 
WORKDIR /root
RUN dnf clean all

# Run container entrypoint
COPY entrypoint.sh /root/entrypoint.sh
ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"]
EOF

The docker container file references a entrypoint.sh script so we need to create that under /root/mft/.

$ cat <<EOF > entrypoint.sh 
#!/bin/bash
# Set working dir
cd /root

# Set tool versions 
MLNXTOOLVER=23.07-1.el9
MFTTOOLVER=4.30.0-139
MLXUPVER=4.30.0

# Set architecture
ARCH=`uname -m`

# Pull mlnx-tools from EPEL
wget https://dl.fedoraproject.org/pub/epel/9/Everything/$ARCH/Packages/m/mlnx-tools-$MLNXTOOLVER.noarch.rpm

# Arm architecture fixup for mft-tools
if [ "$ARCH" == "aarch64" ]; then export ARCH="arm64"; fi

# Pull mft-tools
wget https://www.mellanox.com/downloads/MFT/mft-$MFTTOOLVER-$ARCH-rpm.tgz

# Install mlnx-tools into container
dnf install mlnx-tools-$MLNXTOOLVER.noarch.rpm

# Install kernel-devel package supplied in container
rpm -ivh /root/rpms/kernel-devel-*.rpm --nodeps
mkdir /lib/modules/$(uname -r)/
ln -s /usr/src/kernels/$(uname -r) /lib/modules/$(uname -r)/build

# Install mft-tools into container
tar -xzf mft-$MFTTOOLVER-$ARCH-rpm.tgz 
cd /root/mft-$MFTTOOLVER-$ARCH-rpm
#./install.sh --without-kernel
./install.sh 

# Change back to root workdir
cd /root

# x86 fixup for mlxup binary
if [ "$ARCH" == "x86_64" ]; then export ARCH="x64"; fi

# Pull and place mlxup binary
wget https://www.mellanox.com/downloads/firmware/mlxup/$MLXUPVER/SFX/linux_$ARCH/mlxup
mv mlxup /usr/local/bin
chmod +x /usr/local/bin/mlxup

sleep infinity & wait
EOF

Now we should have all the prerequisites and we can move onto building the container.

Building The Container

To build the container run the podman build command on a Red Hat Enterprise Linux 9.x system using the Dockerfile.mft provided in this repository.

$ podman build . -f dockerfile.mft -t quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0
STEP 1/9: FROM registry.access.redhat.com/ubi9/ubi:latest
STEP 2/9: WORKDIR /root/mft
--> 6e6c9f1636c7
STEP 3/9: COPY ./rpms/*.rpm /root/rpms/
--> 30a022291bd9
STEP 4/9: RUN dnf install /root/rpms/usbutils*.rpm -y
Updating Subscription Management repositories.
subscription-manager is operating in container mode.
Red Hat Enterprise Linux 9 for x86_64 - BaseOS  9.2 MB/s |  41 MB     00:04    
Red Hat Enterprise Linux 9 for x86_64 - AppStre 9.4 MB/s |  48 MB     00:05    
Red Hat Universal Base Image 9 (RPMs) - BaseOS  2.2 MB/s | 525 kB     00:00    
Red Hat Universal Base Image 9 (RPMs) - AppStre 5.2 MB/s | 2.3 MB     00:00    
Red Hat Universal Base Image 9 (RPMs) - CodeRea 1.7 MB/s | 281 kB     00:00    
Dependencies resolved.
================================================================================
 Package     Arch      Version           Repository                        Size
================================================================================
Installing:
 usbutils    x86_64    017-1.el9         @commandline                     120 k
Installing dependencies:
 hwdata      noarch    0.348-9.15.el9    rhel-9-for-x86_64-baseos-rpms    1.6 M
 libusbx     x86_64    1.0.26-1.el9      rhel-9-for-x86_64-baseos-rpms     78 k

Transaction Summary
================================================================================
Install  3 Packages

Total size: 1.8 M
Total download size: 1.7 M
Installed size: 9.8 M
Downloading Packages:
(1/2): libusbx-1.0.26-1.el9.x86_64.rpm          327 kB/s |  78 kB     00:00    
(2/2): hwdata-0.348-9.15.el9.noarch.rpm         3.3 MB/s | 1.6 MB     00:00    
--------------------------------------------------------------------------------
Total                                           3.4 MB/s | 1.7 MB     00:00     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                        1/1 
  Installing       : hwdata-0.348-9.15.el9.noarch                           1/3 
  Installing       : libusbx-1.0.26-1.el9.x86_64                            2/3 
  Installing       : usbutils-017-1.el9.x86_64                              3/3 
  Running scriptlet: usbutils-017-1.el9.x86_64                              3/3 
  Verifying        : libusbx-1.0.26-1.el9.x86_64                            1/3 
  Verifying        : hwdata-0.348-9.15.el9.noarch                           2/3 
  Verifying        : usbutils-017-1.el9.x86_64                              3/3 
Installed products updated.

Installed:
  hwdata-0.348-9.15.el9.noarch            libusbx-1.0.26-1.el9.x86_64           
  usbutils-017-1.el9.x86_64              
Complete!
--> 7c16c8d84152
STEP 5/9: RUN dnf install wget procps-ng pciutils yum jq iputils ethtool net-tools kmod systemd-udev rpm-build gcc make -y
Updating Subscription Management repositories.
subscription-manager is operating in container mode.
Last metadata expiration check: 0:00:08 ago on Thu Jan  9 18:32:20 2025.
Package yum-4.14.0-17.el9.noarch is already installed.
Dependencies resolved.
======================================================================================================
 Package                      Arch    Version                  Repository                         Size
======================================================================================================
Installing:
 ethtool                      x86_64  2:6.2-1.el9              rhel-9-for-x86_64-baseos-rpms     234 k
 gcc                          x86_64  11.5.0-2.el9             rhel-9-for-x86_64-appstream-rpms   32 M
 iputils                      x86_64  20210202-10.el9_5        rhel-9-for-x86_64-baseos-rpms     179 k
 (...)                                 
  unzip-6.0-57.el9.x86_64                                                       
  wget-1.21.1-8.el9_4.x86_64                                                    
  xz-5.2.5-8.el9_0.x86_64                                                       
  zip-3.0-35.el9.x86_64                                                         
  zstd-1.5.1-2.el9.x86_64                                                       

Complete!
--> 862d0e2c9c6f
STEP 6/9: WORKDIR /root
--> 5b3ec62db585
STEP 7/9: RUN dnf clean all
Updating Subscription Management repositories.
subscription-manager is operating in container mode.
43 files removed
--> c14c44f59e9e
STEP 8/9: COPY entrypoint.sh /root/entrypoint.sh
--> d2d5192c3c57
STEP 9/9: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"]
COMMIT quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0
--> 1873a4483236
Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0
1873a448323610f369a8565182a2914675f16d735ffe07f92258df89cd439224

Once the image has been built push the image up to the registry that the Openshift cluster can access.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0
Getting image source signatures
Copying blob e5df12622381 done   | 
Copying blob 97c1462e7c7b done   | 
Copying blob facf1e7dd3e0 skipped: already exists  
Copying blob 2dca7d5c2bb7 done   | 
Copying blob 6f64cedd7423 done   | 
Copying blob ec465ce79861 skipped: already exists  
Copying blob 121c270794cd done   | 
Copying config 1873a44832 done   | 
Writing manifest to image destination

Running The Container

The container will need to run priviledged so we can access the hardware devices. To do this we will create a ServiceAccount and Namespace for it to run in.

$ cat <<EOF > mfttool-project.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: mfttool
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mfttool
  namespace: mfttool
EOF

Once the resource file is generated create it on the cluster.

$ oc create -f mfttool-project.yaml 
namespace/mfttool created
serviceaccount/mfttoolcreated

Now that the project has been created assign the appropriate privileges to the service account.

$ oc -n mfttool adm policy add-scc-to-user privileged -z mfttool
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "mfttool"

Next we will create a pod yaml for each of our baremetal nodes that will run under the mfttool namespace and leverage the MFT tooling.

$ cat <<EOF > mfttool-pod-nvd-srv-29.yaml
apiVersion: v1
kind: Pod
metadata:
  name: mfttool-pod-nvd-srv-29
  namespace: mfttool
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com
  hostNetwork: true
  serviceAccountName: mfttool
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0
    name: mfttool-pod-nvd-srv-29
    securityContext:
      privileged: true
EOF

Once the custom resource file has been generated, create the resource on the cluster.

oc create -f mfttool-pod-nvd-srv-29.yaml
pod/mfttool-pod-nvd-srv-29 created

Validate that the pod is up and running.

$ oc get pods -n mfttool
NAME                     READY   STATUS    RESTARTS   AGE
mfttool-pod-nvd-srv-29   1/1     Running   0          28s

Next we can rsh into the pod.

$ oc rsh -n mfttool mfttool-pod-nvd-srv-29 
sh-5.1#

Once inside the pod we can run an mst start and then an mst status to see the devices.

$ oc rsh -n mfttool mfttool-pod-nvd-srv-29 
sh-5.1# mst start 
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success

sh-5.1# mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4129_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:0d:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf1         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:37:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

sh-5.1#

One of the things we can do with this container is query the devices and their settings with mlxconfig. We can also change those settings like when we need to change a port from ethernet mode to infiniband mode.

mlxconfig -d /dev/mst/mt4129_pciconf0 query

Device #1:
----------

Device type:        ConnectX7           
Name:               MCX715105AS-WEAT_Ax 
Description:        NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled
Device:             /dev/mst/mt4129_pciconf0

Configurations:                                          Next Boot
        MODULE_SPLIT_M0                             Array[0..15]        
        MEMIC_BAR_SIZE                              0                   
        MEMIC_SIZE_LIMIT                            _256KB(1)           
       (...)
        ADVANCED_PCI_SETTINGS                       False(0)            
        SAFE_MODE_THRESHOLD                         10                  
        SAFE_MODE_ENABLE                            True(1)

Another tool in the container is flint which allows us to see the firmware, product version and PSID for the device. This is useful for preparing for a firmware update.

flint -d /dev/mst/mt4129_pciconf0 query
Image type:            FS4
FW Version:            28.42.1000
FW Release Date:       8.8.2024
Product Version:       28.42.1000
Rom Info:              type=UEFI version=14.35.15 cpu=AMD64,AARCH64
                       type=PXE version=3.7.500 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             e09d730300126474        16
Base MAC:              e09d73126474            16
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000001244
Security Attributes:   secure-fw

Another tool in the container is mlxup which is an automated way to update the firmware. When we run mlxup it queries all devices on the system and reports back the current firmware and what available firmware there is for the device. We can then decide to update the cards or skip for now. In the example below I have two single port CX-7 cards in the node my pod is running on and I will upgrade their firmware.

$ mlxup
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX7
  Part Number:      MCX715105AS-WEAT_Ax
  Description:      NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled
  PSID:             MT_0000001244
  PCI Device Name:  /dev/mst/mt4129_pciconf1
  Base MAC:         e09d73125fc4
  Versions:         Current        Available     
     FW             28.42.1000     28.43.1014    
     PXE            3.7.0500       N/A           
     UEFI           14.35.0015     N/A           

  Status:           Update required

Device #2:
----------

  Device Type:      ConnectX7
  Part Number:      MCX715105AS-WEAT_Ax
  Description:      NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled
  PSID:             MT_0000001244
  PCI Device Name:  /dev/mst/mt4129_pciconf0
  Base MAC:         e09d73126474
  Versions:         Current        Available     
     FW             28.42.1000     28.43.1014    
     PXE            3.7.0500       N/A           
     UEFI           14.35.0015     N/A           

  Status:           Update required

---------
Found 2 device(s) requiring firmware update...

Perform FW update? [y/N]: y
Device #1: Updating FW ...     
FSMST_INITIALIZE -   OK          
Writing Boot image component -   OK          
Done
Device #2: Updating FW ...     
FSMST_INITIALIZE -   OK          
Writing Boot image component -   OK          
Done

Restart needed for updates to take effect.
Log File: /tmp/mlxup_workdir/mlxup-20250109_190606_17886.log

Notice the firmware upgrade completed but we need to restart the cards for the changes to take effect. We can use the mlxfwreset command to do this and then validate with the flint command that the card is running the new firmware.

sh-5.1# mlxfwreset -d /dev/mst/mt4129_pciconf0 reset -y

The reset level for device, /dev/mst/mt4129_pciconf0 is:

3: Driver restart and PCI reset
Continue with reset?[y/N] y
-I- Sending Reset Command To Fw             -Done
-I- Stopping Driver                         -Done
-I- Resetting PCI                           -Done
-I- Starting Driver                         -Done
-I- Restarting MST                          -Done
-I- FW was loaded successfully.

sh-5.1# flint -d /dev/mst/mt4129_pciconf0 query
Image type:            FS4
FW Version:            28.43.1014
FW Release Date:       7.11.2024
Product Version:       28.43.1014
Rom Info:              type=UEFI version=14.36.16 cpu=AMD64,AARCH64
                       type=PXE version=3.7.500 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             e09d730300126474        16
Base MAC:              e09d73126474            16
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000001244
Security Attributes:   secure-fw

We can repeat the same steps on the second card in the system to complete the firmware update.

sh-5.1# mlxfwreset -d /dev/mst/mt4129_pciconf1 reset -y

The reset level for device, /dev/mst/mt4129_pciconf1 is:

3: Driver restart and PCI reset
Continue with reset?[y/N] y
-I- Sending Reset Command To Fw             -Done
-I- Stopping Driver                         -Done
-I- Resetting PCI                           -Done
-I- Starting Driver                         -Done
-I- Restarting MST                          -Done
-I- FW was loaded successfully.

sh-5.1# flint -d /dev/mst/mt4129_pciconf1 query
Image type:            FS4
FW Version:            28.43.1014
FW Release Date:       7.11.2024
Product Version:       28.43.1014
Rom Info:              type=UEFI version=14.36.16 cpu=AMD64,AARCH64
                       type=PXE version=3.7.500 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             e09d730300125fc4        16
Base MAC:              e09d73125fc4            16
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000001244
Security Attributes:   secure-fw

Once the firmware update has been completed and validate we can remove the container as this completes the firmware update example.

Hopefully this gives an idea of what is required to use this container method which hopes to simplify the ability of upgrading Mellanox/NVIDIA firmware in a image based operating system like Red Hat CoreOS in OpenShift Container Platform.

Friday, January 10, 2025

Understanding Ethernet and Infiniband on OpenShift

I recently was involved in a conversation around using only infiniband on an OpenShift cluster installation. That is the customer wanted to only have infiniband connectivity for both the cluster apis and the high speed storage access requirements for the application. This interaction made me realize we probably need a refresher on the difference between infiniband and ethernet because they are not the same nor can they be swapped interchangeably.

The difference between infiniband and ethernet is very disparate from a design point of view. Infiniband was designed with the idea of having a high reliability, high bandwidth and low latency to interconnect nodes in a supercomputer cluster. Whereas ethernet was designed with the idea of how can I move data between multiple systems easily. This difference becomes more apparent in how each technology is designed to move data.

The design differences show up for example in how latency is handled between the two types of interconnects. For ethernet interconnects typically a store-and-forward along with MAC address network transport model is used for communication between hosts. This method increases the process flow because it has to take into account complex services like IP, MPLS and 802.1Q. Whereas with infiniband layer 2 processing uses a 16 bit LID(local ID) number which is the only one that can be used to search for the forwarding path information. Further the switching technology in infiniband uses a cut-through approach which reduces the forwarding delay making it significantly faster than ethernet.

Another difference also shows up in network reliability. The infiniband protocol is a complete network protocol with its own defined layers from layer 1 to layer 4. This end to end flow control provides the basis for infiniband's network packeting sending and receiving which can provide a lossless network. Ethernet on the other hand does not have a scheduling based flow control mechanism which results in the lack of a guarantee whether a node on the other end will end up being congested upon packet receipt. This is why ethernet switches are built with a cache to absorb these sudden bursts of traffic.

Networking mode or methods is another distinction between these two technologies. An software defined network is built into infiniband by design. There is a subnet manager present on each layer 2 infiniband network to configure the LID of the nodes. The subnet manager also calculates the forwarding path through the control plane and issues it to the infiniband exchange. Conversely ethernet uses a networking mode that generates MAC addresses and the IP protocol must cooperate with the ARP protocol. Nodes in the ethernet network are required to send packets on a regular basis to guarantee entries, in a ARP table for example, are updated in real time. All of this leads to more overhead in a ethernet network compared to infiniband.

We can see from the above there are significant differences between the two technologies which makes it impossible to swap them out like for like as in the case of our OpenShift installation request from the customer. For example take the OpenShift installation which will leverage OVN/OVS for networking. During the installation there is an expectation that the MAC address will exist on the interface marked for the cluster api. However in an infiniband network there is no MAC address concept. Once might see an error similar to the below:

Error: failed to modify 802-3-ethernet.cloned-mac-address:
'00:00:01:49:fe:80:00:00:00:00:00:00:00:11:22:33:01:32:02:00' is not a valid Ethernet MAC.

Further, drivers also become an issue for networking devices like a Mellonox CX-7 or BlueField-3. This is because the default mlx upstream drivers that ship with Red Hat CoreOS in OpenShift do not contain the RDMA component which is required for infiniband. To obtain the the RDMA component one needs to leverage the NVIDIA DOCA driver which is part of the NVIDIA network operator. However this operator cannot be leveraged in an OpenShift day 0 installation. Even if it could though again the expectation of OVS/OVN networking is to have a MAC address to work with from an ethernet network.

Given all these differences we had to explore how could we meet the customers needs but still apply the correct technology to the systems. If the customers goal was to ensure a high speed interconnect between the nodes in the cluster we can still do this with OpenShift. However we need to approach it differently and also break out the cluster apis so they are still working with an ethernet network. A suitable approach might look like this example below.

In the diagram we have a six node OpenShift cluster each with two single port Mellanox CX-7 cards. For each node we have one card plugged into an ethernet switch and another plugged into an infiniband switch. With this design we can now install OpenShift using the one CX-7 card operating in ethernet mode. Once OpenShift is installed we can then layer on the NVIDIA network operator to provide the RDMA infiniband driver and leverage the second CX-7 card operating in infiniband mode. This design enables us to not only get OpenShift installed but still provide a secondary network to our workloads with access to the high speed infiniband network. This same design would also work if we had just one dual ported CX-7 card as we can use the Mellanox tools to configure one port for ethernet and one for infiniband.

Hopefully this blog provided some insight into the difference between infiniband and ethernet and why one simply cannot swap out ethernet for infiniband on an OpenShift installation.

Wednesday, January 08, 2025

Build RDMA GPU-Tools Container

The purpose of this blog is to build a container that automates building the testing tooling for validating RDMA connectivity and performance when used in conjunction with NVIDIA Network Operator and NVIDIA GPU Operator. Specifically I want to be able to use the ib_write_bw command with the --use_cuda switch to demonstrate RDMA from one GPU in a node to another GPU in another node in an OpenShift cluster. The ib_write_bw command is part of the perftest suite which is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for HW or SW tuning as well as for functional testing.

The collection contains a set of bandwidth and latency benchmark such as:

Send - ib_send_bw and ib_send_lat
RDMA Read - ib_read_bw and ib_read_lat
RDMA Write - ib_write_bw and ib_write_lat
RDMA Atomic - ib_atomic_bw and ib_atomic_lat
Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat

In previous blogs, here and here, I used a Fedora 35 container and manually added the components I wanted but here we will provide the tooling to build a container that will instantiate itself upon deployment. The workflow is as follows:

Dockerfile.tools - which provides the content for the base image and the entrypoint.sh script.
Entrypoint.sh - which provides the start up script for the container to pull in both the NVIDIA cuda libraries and also build and deploy the perftest suite with the cuda option available.
Additional RPMs - there are some packages that were not part of the UBI image repo but are dependencies for CUDA toolkit.

The first thing we need to do is create a working directory for our files and an rpms directory for the rpms we will need for our base image. I am using root here but it could be a regular user as well.

$ mkdir -p /root/gpu-tools/rpms
$ cd /root/gpu-tools

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory.

infiniband-diags-51.0-1.el9.x86_64.rpm
libglvnd-opengl-1.3.4-1.el9.x86_64.rpm
libibumad-51.0-1.el9.x86_64.rpm
librdmacm-51.0-1.el9.x86_64.rpm
libxcb-1.13.1-9.el9.x86_64.rpm
libxcb-devel-1.13.1-9.el9.x86_64.rpm
libxkbcommon-1.0.3-4.el9.x86_64.rpm
libxkbcommon-x11-1.0.3-4.el9.x86_64.rpm
pciutils-devel-3.7.0-5.el9.x86_64.rpm
rdma-core-devel-51.0-1.el9.x86_64.rpm
xcb-util-0.4.0-19.el9.x86_64.rpm
xcb-util-image-0.4.0-19.el9.x86_64.rpm
xcb-util-keysyms-0.4.0-17.el9.x86_64.rpm
xcb-util-renderutil-0.3.9-20.el9.x86_64.rpm
xcb-util-wm-0.4.1-22.el9.x86_64.rpm

Once we have all our rpms for the base image we can move onto creating the dockerfile.tools file which we will use to build our image.

$ cat <<EOF >dockerfile.tools
# Start from UBI9 image
FROM registry.access.redhat.com/ubi9/ubi:latest

# Set work directory
WORKDIR /root
RUN mkdir /root/rpms
COPY ./rpms/*.rpm /root/rpms/

# DNF install packages either from repo or locally
RUN dnf install `ls -1 /root/rpms/*.rpm` -y
RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y

# Cleanup 
WORKDIR /root
RUN dnf clean all

# Run container entrypoint
COPY entrypoint.sh /root/entrypoint.sh
RUN chmod +x /root/entrypoint.sh

ENTRYPOINT ["/root/entrypoint.sh"]
EOF

We also need to create the entrypoint.sh script which is referenced in the dockerfile and does the heavy lifting of pulling in the cuda toolkit and the perftest suite.

$ cat <<EOF > entrypoint.sh 
#!/bin/bash
# Set working dir
cd /root

# Configure and install cuda-toolkit
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
dnf clean all
dnf -y install cuda-toolkit-12-6

# Export CUDA library paths
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

# Git clone perftest repository
git clone https://github.com/linux-rdma/perftest.git

# Change into perftest directory
cd /root/perftest

# Build perftest with the cuda libraries included
./autogen.sh
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h
make -j
make install

# Sleep container indefinitly
sleep infinity & wait
EOF

Next we can use the dockerfile we just created to build the base image.

$ podman build -f dockerfile.tools -t quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
STEP 1/10: FROM registry.access.redhat.com/ubi9/ubi:latest
STEP 2/10: WORKDIR /root
--> Using cache 75f163f12503272b83e1137f7c1903520f84493ffe5aec0ef32ece722bd0d815
--> 75f163f12503
STEP 3/10: RUN mkdir /root/rpms
--> Using cache ade32aa6605847a8b3f5c8b68cfcb64854dc01eece34868faab46137a60f946c
--> ade32aa66058
STEP 4/10: COPY ./rpms/*.rpm /root/rpms/
--> Using cache 59dcef81d6675f44d22900f13a3e5441f5073555d7d2faa0b2f261f32e4ba6cd
--> 59dcef81d667
STEP 5/10: RUN dnf install `ls -1 /root/rpms/*.rpm` -y
--> Using cache ebb8b3150056240378ac36f7aa41d7f13b13308e9353513f26a8d3d70e618e3b
--> ebb8b3150056
STEP 6/10: RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y
--> Using cache 5ca85080c103ba559994906ada0417102f54f22c182bbc3a06913109855278cc
--> 5ca85080c103
STEP 7/10: WORKDIR /root
--> Using cache 68c8cd47a41bc364a0da5790c90f9aee5f8a8c7807732f3a5138bff795834fc1
--> 68c8cd47a41b
STEP 8/10: RUN dnf clean all
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

26 files removed
--> a219fec5df49
STEP 9/10: COPY entrypoint.sh /root/entrypoint.sh
--> aeb03bf74673
STEP 10/10: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"]
COMMIT quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
--> 45c2113e5082
Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
45c2113e5082fb2f548b9e1b16c17524184c4079e2db77399519cf29829af1e7

Once the image is created we can push it to our favorite registry.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
Getting image source signatures
Copying blob 62ee1c6c02d5 done   | 
Copying blob 6027214db22e done   | 
Copying blob 4822ebd5a418 done   | 
Copying blob 422a0e40f90b done   | 
Copying blob 5916e2b21ab2 done   | 
Copying blob 10bf375a4d78 done   | 
Copying blob ca1c18e183d5 done   | 
Copying config 3bbb6e1f9b done   | 
Writing manifest to image destination

Now that we have an image let's test it out on the cluster where we have compatible RDMA hardware configured. I am using the same setup as I used in a previous blog so I am going to skip the details about setting up a service account and providing the privileges to it. We will however create our workload pod yaml files which we will use to deploy the image.

cat >>EOF >rdma-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-eth-32-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
    name: rdma-32-workload
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
EOF

$ cat >>EOF >rdma-33-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-eth-33-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2
    name: rdma-33-workload
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
EOF

Next we can deploy the containers.

$ oc create -f rdma-32-workload.yaml 
pod/rdma-eth-32-workload created

$ oc create -f rdma-33-workload.yaml 
pod/rdma-eth-33-workload created

Validate the pods are up and running.

$ oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
rdma-eth-32-workload   1/1     Running   0          51s
rdma-eth-33-workload   1/1     Running   0          47s

Now open two terminals and rsh into each pod in one of the terminals and validate that the perftest commands are present. We can also get the ipaddress of our pod inside the containers.

$ oc rsh rdma-eth-32-workload
sh-5.1# ib
ib_atomic_bw         ib_read_lat          ib_write_bw          ibcacheedit          ibfindnodesusing.pl  iblinkinfo           ibping               ibroute              ibstatus             ibtracert            
ib_atomic_lat        ib_send_bw           ib_write_lat         ibccconfig           ibhosts              ibnetdiscover        ibportstate          ibrouters            ibswitches           
ib_read_bw           ib_send_lat          ibaddr               ibccquery            ibidsverify.pl       ibnodes              ibqueryerrors        ibstat               ibsysstat            
sh-5.1# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if96: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:83:00:34 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.52/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe83:34/64 scope link 
       valid_lft forever preferred_lft forever
3: net1@if78: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 32:1a:83:4a:e2:39 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.2.1/24 brd 192.168.2.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::301a:83ff:fe4a:e239/64 scope link 
       valid_lft forever preferred_lft forever

$ oc rsh rdma-eth-33-workload
sh-5.1# ib
ib_atomic_bw         ib_read_lat          ib_write_bw          ibcacheedit          ibfindnodesusing.pl  iblinkinfo           ibping               ibroute              ibstatus             ibtracert            
ib_atomic_lat        ib_send_bw           ib_write_lat         ibccconfig           ibhosts              ibnetdiscover        ibportstate          ibrouters            ibswitches           
ib_read_bw           ib_send_lat          ibaddr               ibccquery            ibidsverify.pl       ibnodes              ibqueryerrors        ibstat               ibsysstat            
sh-5.1# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if105: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:80:02:3d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.128.2.61/23 brd 10.128.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe80:23d/64 scope link 
       valid_lft forever preferred_lft forever
3: net1@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 22:3e:02:c9:d0:87 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.2.2/24 brd 192.168.2.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::203e:2ff:fec9:d087/64 scope link 
       valid_lft forever preferred_lft forever

Now let's run the RDMA perftest with the --use_cuda switch. Again we will need to have two rsh sessions one on each pod. In the first terminal we can run the following.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0
 WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************
~

In the second terminal we will run the following command which will dump the output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 192.168.2.2 --use_cuda=0 192.168.2.1
 WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
Requested mtu is higher than active mtu 
Changing to active mtu - 3
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is E1:00

Picking device No. 0
[pid = 4101, dev = 0] device name = [NVIDIA A40]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 2097152 bytes GPU buffer
allocated GPU buffer address at 00007f3dfa600000 pointer=0x7f3dfa600000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_1
 Number of qps   : 16        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON        Lock-free      : OFF
 ibv_wr* API     : ON        Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm     TOS    : 41
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x00c6 PSN 0x2986aa
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00c7 PSN 0xa0ef83
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00c8 PSN 0x74badb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00c9 PSN 0x287d57
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00ca PSN 0xf5b155
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00cb PSN 0x6cc15d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00cc PSN 0x3730c2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00cd PSN 0x74d75d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00ce PSN 0x51a707
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00cf PSN 0x987246
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d0 PSN 0xa334a8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d1 PSN 0x5d8f52
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d2 PSN 0xc42ca0
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d3 PSN 0xf43696
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d4 PSN 0x43f9d2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 local address: LID 0000 QPN 0x00d5 PSN 0xbc4d64
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00c6 PSN 0xb1023e
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00c7 PSN 0xc78587
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00c8 PSN 0x5a328f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00c9 PSN 0x582cfb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00cb PSN 0x40d229
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00cc PSN 0x5833a1
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00cd PSN 0xcfefb6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00ce PSN 0xfd5d41
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00cf PSN 0xed811b
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d0 PSN 0x5244ca
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d1 PSN 0x946edc
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d2 PSN 0x4e0f76
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d3 PSN 0x7b13f4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d5 PSN 0xd22346
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00d6 PSN 0x722bc8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      10384867         0.00               181.46              0.346100
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007f3dfa600000
destroying current CUDA Ctx

And if we return to the first terminal we should see it updated with the same output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0
 WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************
Requested mtu is higher than active mtu 
Changing to active mtu - 3
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 61:00

Picking device No. 0
[pid = 4109, dev = 0] device name = [NVIDIA A40]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 2097152 bytes GPU buffer
allocated GPU buffer address at 00007f8bca600000 pointer=0x7f8bca600000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_1
 Number of qps   : 16        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON        Lock-free      : OFF
 ibv_wr* API     : ON        Using DDP      : OFF
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm     TOS    : 41
---------------------------------------------------------------------------------------
 Waiting for client rdma_cm QP to connect
 Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x00c6 PSN 0xb1023e
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00c7 PSN 0xc78587
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00c8 PSN 0x5a328f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00c9 PSN 0x582cfb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00cb PSN 0x40d229
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00cc PSN 0x5833a1
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00cd PSN 0xcfefb6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00ce PSN 0xfd5d41
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00cf PSN 0xed811b
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d0 PSN 0x5244ca
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d1 PSN 0x946edc
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d2 PSN 0x4e0f76
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d3 PSN 0x7b13f4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d5 PSN 0xd22346
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 local address: LID 0000 QPN 0x00d6 PSN 0x722bc8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32
 remote address: LID 0000 QPN 0x00c6 PSN 0x2986aa
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00c7 PSN 0xa0ef83
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00c8 PSN 0x74badb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00c9 PSN 0x287d57
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00ca PSN 0xf5b155
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00cb PSN 0x6cc15d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00cc PSN 0x3730c2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00cd PSN 0x74d75d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00ce PSN 0x51a707
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00cf PSN 0x987246
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d0 PSN 0xa334a8
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d1 PSN 0x5d8f52
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d2 PSN 0xc42ca0
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d3 PSN 0xf43696
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d4 PSN 0x43f9d2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
 remote address: LID 0000 QPN 0x00d5 PSN 0xbc4d64
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      10384867         0.00               181.46              0.346100
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007f8bca600000
destroying current CUDA Ctx

Hopefully this helped demonstrate a much cleaner and automated way to build a perftest container with cuda enabled to perform RDMA testing on OpenShift with NVIDIA Network Operator and NVIDIA GPU Operator.