Wednesday, January 15, 2025

RDMA: Shared, Hostdevice, Legacy SRIOV

 
In a previous blog we discussed how to configure RDMA on OpenShift in three distinct methods: RDMA shared, host device and legacy SRIOV.   However one of the biggest questions coming out of that blog was how do I know which one to choose?  To answer this question comprehensively we probably should first step back and discuss RDMA and the three methods in detail.

What is RDMA?

Remote direct memory access (RDMA) is a technology, originally developed in the 1990s, that allows computers to directly access each others memory without the involvement of the hosts central processor unit (CPU) or operating system(OS).  RDMA is an extension of direct memory access(DMA) which allows direct access to a hosts memory without the use of CPU.  RDMA itself is geared toward high bandwidth and low latency applications making it a valuable component in the AI space.

NVIDIA offers GPUDirect RDMA which is a technology that provides a direct data path between the GPU memory directly between two or more hosts leveraging the NVIDIA networking device.  This configuration provides a significant decrease in latency and offloads the CPU of the hosts.  When leveraging this technology from NVDIA the consumer has the ability to configure it multiple ways to interact with the underlying technology but also based on the consumers use cases.

The three configuration methods for GPUDirect RDMA are as follows:

  • RDMA Shared Device
  • RDMA SR-IOV Legacy Device
  • RDMA Host Device
Let's take a look at each of these options and discuss why one might be used over the other depending on a consumers use case.

RDMA Shared Device


When using the NVIDIA network operator in OpenShift there is a configuration method in the NicClusterPolicy called RDMA shared device.  This method allows for an RDMA device to be shared among multiple pods on the OpenShift worker node where the device is exposed.  The user defined networks of those pods use VXLAN or VETH networking devices inside OpenShift.   Usually those devices are defined in the NicClusterPolicy by specifying the physical device name like in the code snippet below:

  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibs2f0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens8f0np0"]
            }
          }
        ]
      }

The example above shows both an RDMA shared device for an ethernet interface and an infiniband interface.   We also define the number of pods that could consume the interface via the rdmaHcaMax parameter.   In the NicClusterPolicy we can define as many interfaces that we have in the worker nodes.  Further we can set the number of pods that consume each device to various set points which makes this method very flexible.

In an RDMA shared device configuration keep in mind that the pods sharing the device will be competing for the bandwidth and latency of the same device as with any shared resource.  Thus an RDMA shared device is better suited for developer or application environments where performance and latency are not key but the ability to have RDMA functionality across nodes is important.

RDMA SR-IOV Legacy Device


Single Root IO Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that, like an RDMA shared device, can share a single device with multiple pods.  However the way the device is shared is very different because SR-IOV can segment the compliant network device at the hardware layer.  The network device is recognized on the node as a physical function (PF) and when segmented creates multiple virtual functions (VFs).  Each VF can be used like any other network device.  The SR-IOV network device driver for the device determines how the VF is exposed in the container:
  • netdevice driver: A regular kernel network device in the netns of the container
  • vfio-pci driver: A character device mounted in the container
Unlike a shared device though an SR-IOV device can only be shared with the number of pods based off the number of VFs the physical device supports.  However since each VF is like having direct access to the device the performance is ideal for workloads that are latency and bandwidth sensitive.

The configuration of the SRI-IOV devices doesn't take place in the NVIDIA network operator NicClusterPolicy, though we still need the policy for the driver, but rather in the SriovNetworkNodePolicy of the worker node.   The below example shows how we define a vendor and pfName for the nicSelector along with a numVfs which defines the number of VFs to create (usually a value up to the number the device supports).  

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace:  openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens8f0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy

Once the configuration is in place RDMA SR-IOV workloads that require high bandwidth and low latency are great candidates for this type of configuration where multiple pods need that performance from a single network device.

RDMA Host Device


Host device is in some ways a lot like SR-IOV in that a host device creates an additional network on a pod allowing direct physical ethernet access on the worker node.  The plugin allows the network device to be moved from the hosts network namespace to the pods network namespace.  However unlike SR-IOV once the device is passed into a pod the device is not available to any other host until the pod that is using it is removed from the system which makes it far more restrictive.

The configuration of this type of RDMA is handled again through the NVIDIA network operator NicClusterPolicy.   The irony here is even though it is not an SR-IOV configuration the DOCA driver uses the SRIOV network device plugin to do the device passing.   Below is an example of how to configure this type of RDMA where we will set a resourceName and use the NVIDIA vendors selector and any device that has the RDMA capability to be exposed as a host device.  If there are multiple cards in the system the configuration will expose all of them assuming they match the vendor id and have RDMA capabilities.


  sriovDevicePlugin:
      image: sriov-network-device-plugin
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.7.0
      config: |
        {
          "resourceList": [
              {
                  "resourcePrefix": "nvidia.com",
                  "resourceName": "hostdev",
                  "selectors": {
                      "vendors": ["15b3"],
                      "isRdma": true
                  }
              }
          ]
        }

The RDMA host device is normally leveraged where the other two options above are not feasible.  For example the use case requires performance but other requirements don't allow for the use of VFs.  Maybe the cards themselves do not support SR-IOV, or there is not enough PCI express base address registers(BAR) or maybe the system board does not support SR-IOV.   There are also rare cases where the SR-IOV netdevice driver does not support all the capabilities of the network device compared to the PF driver and the workload requires those features.

As we have discussed this blog covered what RDMA is and how one can configure three different methods of RDMA with the NVIDIA network operator.   We also discussed and compared why one might use one method over the other along the way.   Hopefully this gives those looking to adopt this technology enough detail to pursue the right solution for their use case.

Monday, January 13, 2025

Mellanox Firmware Updates via OpenShift

 

Anyone who has worked with Mellnox/NVIDIA networking devices knows there is sometimes the necessity to upgrade the firmware either to providing new feature functionality or addressing a current bug in the firmware.  This might be trivial on a legacy package based system where its easy enough to install the NVIDIA Firmware Tools (MFT) packages once and be done.  However for image based operating systems like Red Hat CoreOS which underpins the OpenShift Container Platform this can become cumbersome.   

Some of the challenges around image based systems is standard tooling like dnf is not available and while rpm-ostree install is an option its really not meant to be used like a packaging system.   When I initially was working on needing to update firmware I was instructed to install the MFT tools rpm inside the DOCA/MOFED container.  While this method works the drawbacks are:
  • The container is ephemeral so that if the DOCA/MOFED container restarts and/or gets updated I have to install the MFT tools all over again.
  • I need to stage the packages in the DOCA/MOFED container and the required kernel-devel dependencies.
Given these challenges I decided I want to build an image that I could run on OpenShift that would provide the tooling whenever I needed it simply by spinning up a pod.  We will cover that process through the rest of this blog.

Before we begin let's first explain what the MFT package of firmware management tools is used for:

  • Generate a standard or customized NVIDIA firmware image querying for firmware information
  • Burn a firmware image
  • Make configuration changes to the firmware settings

The following is a list of the available tools in MFT, together with a brief description of what each tool performs.

Tool Description/Function
mst Starts/stops the register access driver Lists the available mst devices
mlxburn Generation of a standard or customized NVIDIA firmware image for burning (.bin or .mlx)to the Flash/EEPROM attached to a NVIDIA HCA or switch device
flint This tool burns/query a firmware binary image or an expansion ROM image to the Flash device of a NVIDIA network adapter/gateway/switch device
debug utilities A set of debug utilities (e.g., itrace, fwtrace, mlxtrace, mlxdump, mstdump, mlxmcg, wqdump, mcra, mlxi2c, i2c, mget_temp, and pckt_drop)
mlxup The utility enables discovery of available NVIDIA adapters and indicates whether firmware update is required for each adapter
mlnx-tools Mellanox userland tools and scripts

Sources: Mlnx-tools Repo MFT Tools Mlxup

Prerequisites

Before we can build the container we need to setup the directory structure, gather a few packages and create the dockerfile and entrypoint script. First let's create the directory structure. I am using root in this example but it could be a regular user.

$ mkdir -p /root/mft/rpms $ cd /root/mft

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory. The first is the kernel-devel package for the kernel of the OpenShift node this container will run on. To obtain the kernel version we can run the following oc command on our cluster.

$ oc debug node/nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-29nvidiaengrdu2dcredhatcom-debug-rhlgs ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.8 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# uname -r 5.14.0-427.47.1.el9_4.x86_64 sh-5.1#

Now that we have our kernel version we can download the two packages into our /root/mft/rpms directory.

  • kernel-devel-5.14.0-427.47.1.el9_4.x86_64.rpm
  • usbutils-017-1.el9.x86_64.rpm

Next we need to create the dockerfile.mft which will build the container.

$ cat <<EOF > dockerfile.mft # Start from UBI9 image FROM registry.access.redhat.com/ubi9/ubi:latest # Set work directory WORKDIR /root/mft # Copy in packages not available in UBI repo COPY ./rpms/*.rpm /root/rpms/ RUN dnf install /root/rpms/usbutils*.rpm -y # DNF install packages either from repo or locally RUN dnf install wget procps-ng pciutils yum jq iputils ethtool net-tools kmod systemd-udev rpm-build gcc make -y # Cleanup WORKDIR /root RUN dnf clean all # Run container entrypoint COPY entrypoint.sh /root/entrypoint.sh ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] EOF

The docker container file references a entrypoint.sh script so we need to create that under /root/mft/.

$ cat <<EOF > entrypoint.sh #!/bin/bash # Set working dir cd /root # Set tool versions MLNXTOOLVER=23.07-1.el9 MFTTOOLVER=4.30.0-139 MLXUPVER=4.30.0 # Set architecture ARCH=`uname -m` # Pull mlnx-tools from EPEL wget https://dl.fedoraproject.org/pub/epel/9/Everything/$ARCH/Packages/m/mlnx-tools-$MLNXTOOLVER.noarch.rpm # Arm architecture fixup for mft-tools if [ "$ARCH" == "aarch64" ]; then export ARCH="arm64"; fi # Pull mft-tools wget https://www.mellanox.com/downloads/MFT/mft-$MFTTOOLVER-$ARCH-rpm.tgz # Install mlnx-tools into container dnf install mlnx-tools-$MLNXTOOLVER.noarch.rpm # Install kernel-devel package supplied in container rpm -ivh /root/rpms/kernel-devel-*.rpm --nodeps mkdir /lib/modules/$(uname -r)/ ln -s /usr/src/kernels/$(uname -r) /lib/modules/$(uname -r)/build # Install mft-tools into container tar -xzf mft-$MFTTOOLVER-$ARCH-rpm.tgz cd /root/mft-$MFTTOOLVER-$ARCH-rpm #./install.sh --without-kernel ./install.sh # Change back to root workdir cd /root # x86 fixup for mlxup binary if [ "$ARCH" == "x86_64" ]; then export ARCH="x64"; fi # Pull and place mlxup binary wget https://www.mellanox.com/downloads/firmware/mlxup/$MLXUPVER/SFX/linux_$ARCH/mlxup mv mlxup /usr/local/bin chmod +x /usr/local/bin/mlxup sleep infinity & wait EOF

Now we should have all the prerequisites and we can move onto building the container.

Building The Container

To build the container run the podman build command on a Red Hat Enterprise Linux 9.x system using the Dockerfile.mft provided in this repository.

$ podman build . -f dockerfile.mft -t quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 STEP 1/9: FROM registry.access.redhat.com/ubi9/ubi:latest STEP 2/9: WORKDIR /root/mft --> 6e6c9f1636c7 STEP 3/9: COPY ./rpms/*.rpm /root/rpms/ --> 30a022291bd9 STEP 4/9: RUN dnf install /root/rpms/usbutils*.rpm -y Updating Subscription Management repositories. subscription-manager is operating in container mode. Red Hat Enterprise Linux 9 for x86_64 - BaseOS 9.2 MB/s | 41 MB 00:04 Red Hat Enterprise Linux 9 for x86_64 - AppStre 9.4 MB/s | 48 MB 00:05 Red Hat Universal Base Image 9 (RPMs) - BaseOS 2.2 MB/s | 525 kB 00:00 Red Hat Universal Base Image 9 (RPMs) - AppStre 5.2 MB/s | 2.3 MB 00:00 Red Hat Universal Base Image 9 (RPMs) - CodeRea 1.7 MB/s | 281 kB 00:00 Dependencies resolved. ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: usbutils x86_64 017-1.el9 @commandline 120 k Installing dependencies: hwdata noarch 0.348-9.15.el9 rhel-9-for-x86_64-baseos-rpms 1.6 M libusbx x86_64 1.0.26-1.el9 rhel-9-for-x86_64-baseos-rpms 78 k Transaction Summary ================================================================================ Install 3 Packages Total size: 1.8 M Total download size: 1.7 M Installed size: 9.8 M Downloading Packages: (1/2): libusbx-1.0.26-1.el9.x86_64.rpm 327 kB/s | 78 kB 00:00 (2/2): hwdata-0.348-9.15.el9.noarch.rpm 3.3 MB/s | 1.6 MB 00:00 -------------------------------------------------------------------------------- Total 3.4 MB/s | 1.7 MB 00:00 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : hwdata-0.348-9.15.el9.noarch 1/3 Installing : libusbx-1.0.26-1.el9.x86_64 2/3 Installing : usbutils-017-1.el9.x86_64 3/3 Running scriptlet: usbutils-017-1.el9.x86_64 3/3 Verifying : libusbx-1.0.26-1.el9.x86_64 1/3 Verifying : hwdata-0.348-9.15.el9.noarch 2/3 Verifying : usbutils-017-1.el9.x86_64 3/3 Installed products updated. Installed: hwdata-0.348-9.15.el9.noarch libusbx-1.0.26-1.el9.x86_64 usbutils-017-1.el9.x86_64 Complete! --> 7c16c8d84152 STEP 5/9: RUN dnf install wget procps-ng pciutils yum jq iputils ethtool net-tools kmod systemd-udev rpm-build gcc make -y Updating Subscription Management repositories. subscription-manager is operating in container mode. Last metadata expiration check: 0:00:08 ago on Thu Jan 9 18:32:20 2025. Package yum-4.14.0-17.el9.noarch is already installed. Dependencies resolved. ====================================================================================================== Package Arch Version Repository Size ====================================================================================================== Installing: ethtool x86_64 2:6.2-1.el9 rhel-9-for-x86_64-baseos-rpms 234 k gcc x86_64 11.5.0-2.el9 rhel-9-for-x86_64-appstream-rpms 32 M iputils x86_64 20210202-10.el9_5 rhel-9-for-x86_64-baseos-rpms 179 k (...) unzip-6.0-57.el9.x86_64 wget-1.21.1-8.el9_4.x86_64 xz-5.2.5-8.el9_0.x86_64 zip-3.0-35.el9.x86_64 zstd-1.5.1-2.el9.x86_64 Complete! --> 862d0e2c9c6f STEP 6/9: WORKDIR /root --> 5b3ec62db585 STEP 7/9: RUN dnf clean all Updating Subscription Management repositories. subscription-manager is operating in container mode. 43 files removed --> c14c44f59e9e STEP 8/9: COPY entrypoint.sh /root/entrypoint.sh --> d2d5192c3c57 STEP 9/9: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] COMMIT quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 --> 1873a4483236 Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 1873a448323610f369a8565182a2914675f16d735ffe07f92258df89cd439224

Once the image has been built push the image up to the registry that the Openshift cluster can access.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 Getting image source signatures Copying blob e5df12622381 done | Copying blob 97c1462e7c7b done | Copying blob facf1e7dd3e0 skipped: already exists Copying blob 2dca7d5c2bb7 done | Copying blob 6f64cedd7423 done | Copying blob ec465ce79861 skipped: already exists Copying blob 121c270794cd done | Copying config 1873a44832 done | Writing manifest to image destination

Running The Container

The container will need to run priviledged so we can access the hardware devices. To do this we will create a ServiceAccount and Namespace for it to run in.

$ cat <<EOF > mfttool-project.yaml apiVersion: v1 kind: Namespace metadata: name: mfttool --- apiVersion: v1 kind: ServiceAccount metadata: name: mfttool namespace: mfttool EOF

Once the resource file is generated create it on the cluster.

$ oc create -f mfttool-project.yaml namespace/mfttool created serviceaccount/mfttoolcreated

Now that the project has been created assign the appropriate privileges to the service account.

$ oc -n mfttool adm policy add-scc-to-user privileged -z mfttool clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "mfttool"

Next we will create a pod yaml for each of our baremetal nodes that will run under the mfttool namespace and leverage the MFT tooling.

$ cat <<EOF > mfttool-pod-nvd-srv-29.yaml apiVersion: v1 kind: Pod metadata: name: mfttool-pod-nvd-srv-29 namespace: mfttool spec: nodeSelector: kubernetes.io/hostname: nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com hostNetwork: true serviceAccountName: mfttool containers: - image: quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 name: mfttool-pod-nvd-srv-29 securityContext: privileged: true EOF

Once the custom resource file has been generated, create the resource on the cluster.

oc create -f mfttool-pod-nvd-srv-29.yaml pod/mfttool-pod-nvd-srv-29 created

Validate that the pod is up and running.

$ oc get pods -n mfttool NAME READY STATUS RESTARTS AGE mfttool-pod-nvd-srv-29 1/1 Running 0 28s

Next we can rsh into the pod.

$ oc rsh -n mfttool mfttool-pod-nvd-srv-29 sh-5.1#

Once inside the pod we can run an mst start and then an mst status to see the devices.

$ oc rsh -n mfttool mfttool-pod-nvd-srv-29 sh-5.1# mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success [warn] mst_pciconf is already loaded, skipping Create devices Unloading MST PCI module (unused) - Success sh-5.1# mst status MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded MST devices: ------------ /dev/mst/mt4129_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:0d:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00 /dev/mst/mt4129_pciconf1 - PCI configuration cycles access. domain:bus:dev.fn=0000:37:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00 sh-5.1#

One of the things we can do with this container is query the devices and their settings with mlxconfig. We can also change those settings like when we need to change a port from ethernet mode to infiniband mode.

mlxconfig -d /dev/mst/mt4129_pciconf0 query Device #1: ---------- Device type: ConnectX7 Name: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled Device: /dev/mst/mt4129_pciconf0 Configurations: Next Boot MODULE_SPLIT_M0 Array[0..15] MEMIC_BAR_SIZE 0 MEMIC_SIZE_LIMIT _256KB(1) (...) ADVANCED_PCI_SETTINGS False(0) SAFE_MODE_THRESHOLD 10 SAFE_MODE_ENABLE True(1)

Another tool in the container is flint which allows us to see the firmware, product version and PSID for the device. This is useful for preparing for a firmware update.

flint -d /dev/mst/mt4129_pciconf0 query Image type: FS4 FW Version: 28.42.1000 FW Release Date: 8.8.2024 Product Version: 28.42.1000 Rom Info: type=UEFI version=14.35.15 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300126474 16 Base MAC: e09d73126474 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

Another tool in the container is mlxup which is an automated way to update the firmware. When we run mlxup it queries all devices on the system and reports back the current firmware and what available firmware there is for the device. We can then decide to update the cards or skip for now. In the example below I have two single port CX-7 cards in the node my pod is running on and I will upgrade their firmware.

$ mlxup Querying Mellanox devices firmware ... Device #1: ---------- Device Type: ConnectX7 Part Number: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled PSID: MT_0000001244 PCI Device Name: /dev/mst/mt4129_pciconf1 Base MAC: e09d73125fc4 Versions: Current Available FW 28.42.1000 28.43.1014 PXE 3.7.0500 N/A UEFI 14.35.0015 N/A Status: Update required Device #2: ---------- Device Type: ConnectX7 Part Number: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled PSID: MT_0000001244 PCI Device Name: /dev/mst/mt4129_pciconf0 Base MAC: e09d73126474 Versions: Current Available FW 28.42.1000 28.43.1014 PXE 3.7.0500 N/A UEFI 14.35.0015 N/A Status: Update required --------- Found 2 device(s) requiring firmware update... Perform FW update? [y/N]: y Device #1: Updating FW ... FSMST_INITIALIZE - OK Writing Boot image component - OK Done Device #2: Updating FW ... FSMST_INITIALIZE - OK Writing Boot image component - OK Done Restart needed for updates to take effect. Log File: /tmp/mlxup_workdir/mlxup-20250109_190606_17886.log

Notice the firmware upgrade completed but we need to restart the cards for the changes to take effect. We can use the mlxfwreset command to do this and then validate with the flint command that the card is running the new firmware.

sh-5.1# mlxfwreset -d /dev/mst/mt4129_pciconf0 reset -y The reset level for device, /dev/mst/mt4129_pciconf0 is: 3: Driver restart and PCI reset Continue with reset?[y/N] y -I- Sending Reset Command To Fw -Done -I- Stopping Driver -Done -I- Resetting PCI -Done -I- Starting Driver -Done -I- Restarting MST -Done -I- FW was loaded successfully. sh-5.1# flint -d /dev/mst/mt4129_pciconf0 query Image type: FS4 FW Version: 28.43.1014 FW Release Date: 7.11.2024 Product Version: 28.43.1014 Rom Info: type=UEFI version=14.36.16 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300126474 16 Base MAC: e09d73126474 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

We can repeat the same steps on the second card in the system to complete the firmware update.

sh-5.1# mlxfwreset -d /dev/mst/mt4129_pciconf1 reset -y The reset level for device, /dev/mst/mt4129_pciconf1 is: 3: Driver restart and PCI reset Continue with reset?[y/N] y -I- Sending Reset Command To Fw -Done -I- Stopping Driver -Done -I- Resetting PCI -Done -I- Starting Driver -Done -I- Restarting MST -Done -I- FW was loaded successfully. sh-5.1# flint -d /dev/mst/mt4129_pciconf1 query Image type: FS4 FW Version: 28.43.1014 FW Release Date: 7.11.2024 Product Version: 28.43.1014 Rom Info: type=UEFI version=14.36.16 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300125fc4 16 Base MAC: e09d73125fc4 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

Once the firmware update has been completed and validate we can remove the container as this completes the firmware update example.   

Hopefully this gives an idea of what is required to use this container method which hopes to simplify the ability of upgrading Mellanox/NVIDIA firmware in a image based operating system like Red Hat CoreOS in OpenShift Container Platform.

Friday, January 10, 2025

Understanding Ethernet and Infiniband on OpenShift

I recently was involved in a conversation around using only infiniband on an OpenShift cluster installation.  That is the customer wanted to only have infiniband connectivity for both the cluster apis and the high speed storage access requirements for the application.   This interaction made me realize we probably need a refresher on the difference between infiniband and ethernet because they are not the same nor can they be swapped interchangeably.

The difference between infiniband and ethernet is very disparate from a design point of view.  Infiniband was designed with the idea of having a high reliability, high bandwidth and low latency to interconnect nodes in a supercomputer cluster.  Whereas ethernet was designed with the idea of how can I move data between multiple systems easily.  This difference becomes more apparent in how each technology is designed to move data.

The design differences show up for example in how latency is handled between the two types of interconnects.  For ethernet interconnects typically a store-and-forward along with MAC address network transport model is used for communication between hosts.  This method increases the process flow because it has to take into account complex services like IP, MPLS and 802.1Q.   Whereas with infiniband layer 2 processing uses a 16 bit LID(local ID) number which is the only one that can be used to search for the forwarding path information.  Further the switching technology in infiniband uses a cut-through approach which reduces the forwarding delay making it significantly faster than ethernet.

Another difference also shows up in network reliability.  The infiniband protocol is a complete network protocol with its own defined layers from layer 1 to layer 4.  This end to end flow control provides the basis for infiniband's network packeting sending and receiving which can provide a lossless network.  Ethernet on the other hand does not have a scheduling based flow control mechanism which results in the lack of a guarantee whether a node on the other end will end up being congested upon packet receipt.  This is why ethernet switches are built with a cache to absorb these sudden bursts of traffic.

Networking mode or methods is another distinction between these two technologies.  An software defined network is built into infiniband by design.  There is a subnet manager present on each layer 2 infiniband network to configure the LID of the nodes.  The subnet manager also calculates the forwarding path through the control plane and issues it to the infiniband exchange.  Conversely ethernet uses a networking mode that generates MAC addresses and the IP protocol must cooperate with the ARP protocol.  Nodes in the ethernet network are required to send packets on a regular basis to guarantee entries, in a ARP table for example, are updated in real time.  All of this leads to more overhead in a ethernet network compared to infiniband.

We can see from the above there are significant differences between the two technologies which makes it impossible to swap them out like for like as in the case of our OpenShift installation request from the customer.  For example take the OpenShift installation which will leverage OVN/OVS for networking.  During the installation there is an expectation that the MAC address will exist on the interface marked for the cluster api.   However in an infiniband network there is no MAC address concept.  Once might see an error similar to the below:

Error: failed to modify 802-3-ethernet.cloned-mac-address: '00:00:01:49:fe:80:00:00:00:00:00:00:00:11:22:33:01:32:02:00' is not a valid Ethernet MAC.

Further, drivers also become an issue for networking devices like a Mellonox CX-7 or BlueField-3.  This is because the default mlx upstream drivers that ship with Red Hat CoreOS in OpenShift do not contain the RDMA component which is required for infiniband.  To obtain the the RDMA component one needs to leverage the NVIDIA DOCA driver which is part of the NVIDIA network operator.  However this operator cannot be leveraged in an OpenShift day 0 installation.  Even if it could though again the expectation of OVS/OVN networking is to have a MAC address to work with from an ethernet network.

Given all these differences we had to explore how could we meet the customers needs but still apply the correct technology to the systems.   If the customers goal was to ensure a high speed interconnect between the nodes in the cluster we can still do this with OpenShift.  However we need to approach it differently and also break out the cluster apis so they are still working with an ethernet network.   A suitable approach might look like this example below.

In the diagram we have a six node OpenShift cluster each with two single port Mellanox CX-7 cards.  For each node we have one card plugged into an ethernet switch and another plugged into an infiniband switch.   With this design we can now install OpenShift using the one CX-7 card operating in ethernet mode.   Once OpenShift is installed we can then layer on the NVIDIA network operator to provide the RDMA infiniband driver and leverage the second CX-7 card operating in infiniband mode.   This design enables us to not only get OpenShift installed but still provide a secondary network to our workloads with access to the high speed infiniband network.   This same design would also work if we had just one dual ported CX-7 card as we can use the Mellanox tools to configure one port for ethernet and one for infiniband.

Hopefully this blog provided some insight into the difference between infiniband and ethernet and why one simply cannot swap out ethernet for infiniband on an OpenShift installation.    


Wednesday, January 08, 2025

Build RDMA GPU-Tools Container

 


The purpose of this blog is to build a container that automates building the testing tooling for validating RDMA connectivity and performance when used in conjunction with NVIDIA Network Operator and NVIDIA GPU Operator.  Specifically I want to be able to use the ib_write_bw command with the --use_cuda switch to demonstrate RDMA from one GPU in a node to another GPU in another node in an OpenShift cluster. The ib_write_bw command is part of the perftest suite which is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for HW or SW tuning as well as for functional testing.

The collection contains a set of bandwidth and latency benchmark such as:

  • Send - ib_send_bw and ib_send_lat
  • RDMA Read - ib_read_bw and ib_read_lat
  • RDMA Write - ib_write_bw and ib_write_lat
  • RDMA Atomic - ib_atomic_bw and ib_atomic_lat
  • Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat

In previous blogs, here and here,  I used a Fedora 35 container and manually added the components I wanted but here we will provide the tooling to build a container that will instantiate itself upon deployment. The workflow is as follows:

  • Dockerfile.tools - which provides the content for the base image and the entrypoint.sh script.
  • Entrypoint.sh - which provides the start up script for the container to pull in both the NVIDIA cuda libraries and also build and deploy the perftest suite with the cuda option available.
  • Additional RPMs - there are some packages that were not part of the UBI image repo but are dependencies for CUDA toolkit.

The first thing we need to do is create a working directory for our files and an rpms directory for the rpms we will need for our base image. I am using root here but it could be a regular user as well.

$ mkdir -p /root/gpu-tools/rpms
$ cd /root/gpu-tools

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory.

  • infiniband-diags-51.0-1.el9.x86_64.rpm
  • libglvnd-opengl-1.3.4-1.el9.x86_64.rpm
  • libibumad-51.0-1.el9.x86_64.rpm
  • librdmacm-51.0-1.el9.x86_64.rpm
  • libxcb-1.13.1-9.el9.x86_64.rpm
  • libxcb-devel-1.13.1-9.el9.x86_64.rpm
  • libxkbcommon-1.0.3-4.el9.x86_64.rpm
  • libxkbcommon-x11-1.0.3-4.el9.x86_64.rpm
  • pciutils-devel-3.7.0-5.el9.x86_64.rpm
  • rdma-core-devel-51.0-1.el9.x86_64.rpm
  • xcb-util-0.4.0-19.el9.x86_64.rpm
  • xcb-util-image-0.4.0-19.el9.x86_64.rpm
  • xcb-util-keysyms-0.4.0-17.el9.x86_64.rpm
  • xcb-util-renderutil-0.3.9-20.el9.x86_64.rpm
  • xcb-util-wm-0.4.1-22.el9.x86_64.rpm

Once we have all our rpms for the base image we can move onto creating the dockerfile.tools file which we will use to build our image.

$ cat <<EOF >dockerfile.tools # Start from UBI9 image FROM registry.access.redhat.com/ubi9/ubi:latest # Set work directory WORKDIR /root RUN mkdir /root/rpms COPY ./rpms/*.rpm /root/rpms/ # DNF install packages either from repo or locally RUN dnf install `ls -1 /root/rpms/*.rpm` -y RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y # Cleanup WORKDIR /root RUN dnf clean all # Run container entrypoint COPY entrypoint.sh /root/entrypoint.sh RUN chmod +x /root/entrypoint.sh ENTRYPOINT ["/root/entrypoint.sh"] EOF

We also need to create the entrypoint.sh script which is referenced in the dockerfile and does the heavy lifting of pulling in the cuda toolkit and the perftest suite.

$ cat <<EOF > entrypoint.sh #!/bin/bash # Set working dir cd /root # Configure and install cuda-toolkit dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo dnf clean all dnf -y install cuda-toolkit-12-6 # Export CUDA library paths export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH # Git clone perftest repository git clone https://github.com/linux-rdma/perftest.git # Change into perftest directory cd /root/perftest # Build perftest with the cuda libraries included ./autogen.sh ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h make -j make install # Sleep container indefinitly sleep infinity & wait EOF

Next we can use the dockerfile we just created to build the base image.

$ podman build -f dockerfile.tools -t quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 STEP 1/10: FROM registry.access.redhat.com/ubi9/ubi:latest STEP 2/10: WORKDIR /root --> Using cache 75f163f12503272b83e1137f7c1903520f84493ffe5aec0ef32ece722bd0d815 --> 75f163f12503 STEP 3/10: RUN mkdir /root/rpms --> Using cache ade32aa6605847a8b3f5c8b68cfcb64854dc01eece34868faab46137a60f946c --> ade32aa66058 STEP 4/10: COPY ./rpms/*.rpm /root/rpms/ --> Using cache 59dcef81d6675f44d22900f13a3e5441f5073555d7d2faa0b2f261f32e4ba6cd --> 59dcef81d667 STEP 5/10: RUN dnf install `ls -1 /root/rpms/*.rpm` -y --> Using cache ebb8b3150056240378ac36f7aa41d7f13b13308e9353513f26a8d3d70e618e3b --> ebb8b3150056 STEP 6/10: RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y --> Using cache 5ca85080c103ba559994906ada0417102f54f22c182bbc3a06913109855278cc --> 5ca85080c103 STEP 7/10: WORKDIR /root --> Using cache 68c8cd47a41bc364a0da5790c90f9aee5f8a8c7807732f3a5138bff795834fc1 --> 68c8cd47a41b STEP 8/10: RUN dnf clean all Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. 26 files removed --> a219fec5df49 STEP 9/10: COPY entrypoint.sh /root/entrypoint.sh --> aeb03bf74673 STEP 10/10: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] COMMIT quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 --> 45c2113e5082 Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 45c2113e5082fb2f548b9e1b16c17524184c4079e2db77399519cf29829af1e7

Once the image is created we can push it to our favorite registry.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 Getting image source signatures Copying blob 62ee1c6c02d5 done | Copying blob 6027214db22e done | Copying blob 4822ebd5a418 done | Copying blob 422a0e40f90b done | Copying blob 5916e2b21ab2 done | Copying blob 10bf375a4d78 done | Copying blob ca1c18e183d5 done | Copying config 3bbb6e1f9b done | Writing manifest to image destination

Now that we have an image let's test it out on the cluster where we have compatible RDMA hardware configured. I am using the same setup as I used in a previous blog so I am going to skip the details about setting up a service account and providing the privileges to it. We will however create our workload pod yaml files which we will use to deploy the image.

cat >>EOF >rdma-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 name: rdma-32-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF $ cat >>EOF >rdma-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 name: rdma-33-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF

Next we can deploy the containers.

$ oc create -f rdma-32-workload.yaml pod/rdma-eth-32-workload created $ oc create -f rdma-33-workload.yaml pod/rdma-eth-33-workload created

Validate the pods are up and running.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-eth-32-workload 1/1 Running 0 51s rdma-eth-33-workload 1/1 Running 0 47s

Now open two terminals and rsh into each pod in one of the terminals and validate that the perftest commands are present. We can also get the ipaddress of our pod inside the containers.

$ oc rsh rdma-eth-32-workload sh-5.1# ib ib_atomic_bw ib_read_lat ib_write_bw ibcacheedit ibfindnodesusing.pl iblinkinfo ibping ibroute ibstatus ibtracert ib_atomic_lat ib_send_bw ib_write_lat ibccconfig ibhosts ibnetdiscover ibportstate ibrouters ibswitches ib_read_bw ib_send_lat ibaddr ibccquery ibidsverify.pl ibnodes ibqueryerrors ibstat ibsysstat sh-5.1# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if96: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:83:00:34 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.0.52/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:34/64 scope link valid_lft forever preferred_lft forever 3: net1@if78: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 32:1a:83:4a:e2:39 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.2.1/24 brd 192.168.2.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::301a:83ff:fe4a:e239/64 scope link valid_lft forever preferred_lft forever $ oc rsh rdma-eth-33-workload sh-5.1# ib ib_atomic_bw ib_read_lat ib_write_bw ibcacheedit ibfindnodesusing.pl iblinkinfo ibping ibroute ibstatus ibtracert ib_atomic_lat ib_send_bw ib_write_lat ibccconfig ibhosts ibnetdiscover ibportstate ibrouters ibswitches ib_read_bw ib_send_lat ibaddr ibccquery ibidsverify.pl ibnodes ibqueryerrors ibstat ibsysstat sh-5.1# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if105: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:80:02:3d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.61/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe80:23d/64 scope link valid_lft forever preferred_lft forever 3: net1@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 22:3e:02:c9:d0:87 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.2.2/24 brd 192.168.2.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::203e:2ff:fec9:d087/64 scope link valid_lft forever preferred_lft forever

Now let's run the RDMA perftest with the --use_cuda switch. Again we will need to have two rsh sessions one on each pod. In the first terminal we can run the following.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************ ~

In the second terminal we will run the following command which will dump the output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.2 --use_cuda=0 192.168.2.1 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is E1:00 Picking device No. 0 [pid = 4101, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007f3dfa600000 pointer=0x7f3dfa600000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00c6 PSN 0x2986aa GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c7 PSN 0xa0ef83 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c8 PSN 0x74badb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c9 PSN 0x287d57 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00ca PSN 0xf5b155 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cb PSN 0x6cc15d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cc PSN 0x3730c2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cd PSN 0x74d75d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00ce PSN 0x51a707 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cf PSN 0x987246 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d0 PSN 0xa334a8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d1 PSN 0x5d8f52 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d2 PSN 0xc42ca0 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d3 PSN 0xf43696 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d4 PSN 0x43f9d2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d5 PSN 0xbc4d64 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c6 PSN 0xb1023e GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c7 PSN 0xc78587 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c8 PSN 0x5a328f GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c9 PSN 0x582cfb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cb PSN 0x40d229 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cc PSN 0x5833a1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cd PSN 0xcfefb6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00ce PSN 0xfd5d41 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cf PSN 0xed811b GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d0 PSN 0x5244ca GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d1 PSN 0x946edc GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d2 PSN 0x4e0f76 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d3 PSN 0x7b13f4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d5 PSN 0xd22346 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d6 PSN 0x722bc8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10384867 0.00 181.46 0.346100 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f3dfa600000 destroying current CUDA Ctx

And if we return to the first terminal we should see it updated with the same output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************ Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 61:00 Picking device No. 0 [pid = 4109, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007f8bca600000 pointer=0x7f8bca600000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- Waiting for client rdma_cm QP to connect Please run the same command with the IB/RoCE interface IP --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00c6 PSN 0xb1023e GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c7 PSN 0xc78587 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c8 PSN 0x5a328f GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c9 PSN 0x582cfb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cb PSN 0x40d229 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cc PSN 0x5833a1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cd PSN 0xcfefb6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00ce PSN 0xfd5d41 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cf PSN 0xed811b GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d0 PSN 0x5244ca GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d1 PSN 0x946edc GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d2 PSN 0x4e0f76 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d3 PSN 0x7b13f4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d5 PSN 0xd22346 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d6 PSN 0x722bc8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c6 PSN 0x2986aa GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c7 PSN 0xa0ef83 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c8 PSN 0x74badb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c9 PSN 0x287d57 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00ca PSN 0xf5b155 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cb PSN 0x6cc15d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cc PSN 0x3730c2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cd PSN 0x74d75d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00ce PSN 0x51a707 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cf PSN 0x987246 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d0 PSN 0xa334a8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d1 PSN 0x5d8f52 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d2 PSN 0xc42ca0 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d3 PSN 0xf43696 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d4 PSN 0x43f9d2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d5 PSN 0xbc4d64 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10384867 0.00 181.46 0.346100 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f8bca600000 destroying current CUDA Ctx

Hopefully this helped demonstrate a much cleaner and automated way to build a perftest container with cuda enabled to perform RDMA testing on OpenShift with NVIDIA Network Operator and NVIDIA GPU Operator.