Monday, January 13, 2025

Mellanox Firmware Updates via OpenShift

 

Anyone who has worked with Mellnox/NVIDIA networking devices knows there is sometimes the necessity to upgrade the firmware either to providing new feature functionality or addressing a current bug in the firmware.  This might be trivial on a legacy package based system where its easy enough to install the NVIDIA Firmware Tools (MFT) packages once and be done.  However for image based operating systems like Red Hat CoreOS which underpins the OpenShift Container Platform this can become cumbersome.   

Some of the challenges around image based systems is standard tooling like dnf is not available and while rpm-ostree install is an option its really not meant to be used like a packaging system.   When I initially was working on needing to update firmware I was instructed to install the MFT tools rpm inside the DOCA/MOFED container.  While this method works the drawbacks are:
  • The container is ephemeral so that if the DOCA/MOFED container restarts and/or gets updated I have to install the MFT tools all over again.
  • I need to stage the packages in the DOCA/MOFED container and the required kernel-devel dependencies.
Given these challenges I decided I want to build an image that I could run on OpenShift that would provide the tooling whenever I needed it simply by spinning up a pod.  We will cover that process through the rest of this blog.

Before we begin let's first explain what the MFT package of firmware management tools is used for:

  • Generate a standard or customized NVIDIA firmware image querying for firmware information
  • Burn a firmware image
  • Make configuration changes to the firmware settings

The following is a list of the available tools in MFT, together with a brief description of what each tool performs.

Tool Description/Function
mst Starts/stops the register access driver Lists the available mst devices
mlxburn Generation of a standard or customized NVIDIA firmware image for burning (.bin or .mlx)to the Flash/EEPROM attached to a NVIDIA HCA or switch device
flint This tool burns/query a firmware binary image or an expansion ROM image to the Flash device of a NVIDIA network adapter/gateway/switch device
debug utilities A set of debug utilities (e.g., itrace, fwtrace, mlxtrace, mlxdump, mstdump, mlxmcg, wqdump, mcra, mlxi2c, i2c, mget_temp, and pckt_drop)
mlxup The utility enables discovery of available NVIDIA adapters and indicates whether firmware update is required for each adapter
mlnx-tools Mellanox userland tools and scripts

Sources: Mlnx-tools Repo MFT Tools Mlxup

Prerequisites

Before we can build the container we need to setup the directory structure, gather a few packages and create the dockerfile and entrypoint script. First let's create the directory structure. I am using root in this example but it could be a regular user.

$ mkdir -p /root/mft/rpms $ cd /root/mft

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory. The first is the kernel-devel package for the kernel of the OpenShift node this container will run on. To obtain the kernel version we can run the following oc command on our cluster.

$ oc debug node/nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-29nvidiaengrdu2dcredhatcom-debug-rhlgs ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.8 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# uname -r 5.14.0-427.47.1.el9_4.x86_64 sh-5.1#

Now that we have our kernel version we can download the two packages into our /root/mft/rpms directory.

  • kernel-devel-5.14.0-427.47.1.el9_4.x86_64.rpm
  • usbutils-017-1.el9.x86_64.rpm

Next we need to create the dockerfile.mft which will build the container.

$ cat <<EOF > dockerfile.mft # Start from UBI9 image FROM registry.access.redhat.com/ubi9/ubi:latest # Set work directory WORKDIR /root/mft # Copy in packages not available in UBI repo COPY ./rpms/*.rpm /root/rpms/ RUN dnf install /root/rpms/usbutils*.rpm -y # DNF install packages either from repo or locally RUN dnf install wget procps-ng pciutils yum jq iputils ethtool net-tools kmod systemd-udev rpm-build gcc make -y # Cleanup WORKDIR /root RUN dnf clean all # Run container entrypoint COPY entrypoint.sh /root/entrypoint.sh ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] EOF

The docker container file references a entrypoint.sh script so we need to create that under /root/mft/.

$ cat <<EOF > entrypoint.sh #!/bin/bash # Set working dir cd /root # Set tool versions MLNXTOOLVER=23.07-1.el9 MFTTOOLVER=4.30.0-139 MLXUPVER=4.30.0 # Set architecture ARCH=`uname -m` # Pull mlnx-tools from EPEL wget https://dl.fedoraproject.org/pub/epel/9/Everything/$ARCH/Packages/m/mlnx-tools-$MLNXTOOLVER.noarch.rpm # Arm architecture fixup for mft-tools if [ "$ARCH" == "aarch64" ]; then export ARCH="arm64"; fi # Pull mft-tools wget https://www.mellanox.com/downloads/MFT/mft-$MFTTOOLVER-$ARCH-rpm.tgz # Install mlnx-tools into container dnf install mlnx-tools-$MLNXTOOLVER.noarch.rpm # Install kernel-devel package supplied in container rpm -ivh /root/rpms/kernel-devel-*.rpm --nodeps mkdir /lib/modules/$(uname -r)/ ln -s /usr/src/kernels/$(uname -r) /lib/modules/$(uname -r)/build # Install mft-tools into container tar -xzf mft-$MFTTOOLVER-$ARCH-rpm.tgz cd /root/mft-$MFTTOOLVER-$ARCH-rpm #./install.sh --without-kernel ./install.sh # Change back to root workdir cd /root # x86 fixup for mlxup binary if [ "$ARCH" == "x86_64" ]; then export ARCH="x64"; fi # Pull and place mlxup binary wget https://www.mellanox.com/downloads/firmware/mlxup/$MLXUPVER/SFX/linux_$ARCH/mlxup mv mlxup /usr/local/bin chmod +x /usr/local/bin/mlxup sleep infinity & wait EOF

Now we should have all the prerequisites and we can move onto building the container.

Building The Container

To build the container run the podman build command on a Red Hat Enterprise Linux 9.x system using the Dockerfile.mft provided in this repository.

$ podman build . -f dockerfile.mft -t quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 STEP 1/9: FROM registry.access.redhat.com/ubi9/ubi:latest STEP 2/9: WORKDIR /root/mft --> 6e6c9f1636c7 STEP 3/9: COPY ./rpms/*.rpm /root/rpms/ --> 30a022291bd9 STEP 4/9: RUN dnf install /root/rpms/usbutils*.rpm -y Updating Subscription Management repositories. subscription-manager is operating in container mode. Red Hat Enterprise Linux 9 for x86_64 - BaseOS 9.2 MB/s | 41 MB 00:04 Red Hat Enterprise Linux 9 for x86_64 - AppStre 9.4 MB/s | 48 MB 00:05 Red Hat Universal Base Image 9 (RPMs) - BaseOS 2.2 MB/s | 525 kB 00:00 Red Hat Universal Base Image 9 (RPMs) - AppStre 5.2 MB/s | 2.3 MB 00:00 Red Hat Universal Base Image 9 (RPMs) - CodeRea 1.7 MB/s | 281 kB 00:00 Dependencies resolved. ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: usbutils x86_64 017-1.el9 @commandline 120 k Installing dependencies: hwdata noarch 0.348-9.15.el9 rhel-9-for-x86_64-baseos-rpms 1.6 M libusbx x86_64 1.0.26-1.el9 rhel-9-for-x86_64-baseos-rpms 78 k Transaction Summary ================================================================================ Install 3 Packages Total size: 1.8 M Total download size: 1.7 M Installed size: 9.8 M Downloading Packages: (1/2): libusbx-1.0.26-1.el9.x86_64.rpm 327 kB/s | 78 kB 00:00 (2/2): hwdata-0.348-9.15.el9.noarch.rpm 3.3 MB/s | 1.6 MB 00:00 -------------------------------------------------------------------------------- Total 3.4 MB/s | 1.7 MB 00:00 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : hwdata-0.348-9.15.el9.noarch 1/3 Installing : libusbx-1.0.26-1.el9.x86_64 2/3 Installing : usbutils-017-1.el9.x86_64 3/3 Running scriptlet: usbutils-017-1.el9.x86_64 3/3 Verifying : libusbx-1.0.26-1.el9.x86_64 1/3 Verifying : hwdata-0.348-9.15.el9.noarch 2/3 Verifying : usbutils-017-1.el9.x86_64 3/3 Installed products updated. Installed: hwdata-0.348-9.15.el9.noarch libusbx-1.0.26-1.el9.x86_64 usbutils-017-1.el9.x86_64 Complete! --> 7c16c8d84152 STEP 5/9: RUN dnf install wget procps-ng pciutils yum jq iputils ethtool net-tools kmod systemd-udev rpm-build gcc make -y Updating Subscription Management repositories. subscription-manager is operating in container mode. Last metadata expiration check: 0:00:08 ago on Thu Jan 9 18:32:20 2025. Package yum-4.14.0-17.el9.noarch is already installed. Dependencies resolved. ====================================================================================================== Package Arch Version Repository Size ====================================================================================================== Installing: ethtool x86_64 2:6.2-1.el9 rhel-9-for-x86_64-baseos-rpms 234 k gcc x86_64 11.5.0-2.el9 rhel-9-for-x86_64-appstream-rpms 32 M iputils x86_64 20210202-10.el9_5 rhel-9-for-x86_64-baseos-rpms 179 k (...) unzip-6.0-57.el9.x86_64 wget-1.21.1-8.el9_4.x86_64 xz-5.2.5-8.el9_0.x86_64 zip-3.0-35.el9.x86_64 zstd-1.5.1-2.el9.x86_64 Complete! --> 862d0e2c9c6f STEP 6/9: WORKDIR /root --> 5b3ec62db585 STEP 7/9: RUN dnf clean all Updating Subscription Management repositories. subscription-manager is operating in container mode. 43 files removed --> c14c44f59e9e STEP 8/9: COPY entrypoint.sh /root/entrypoint.sh --> d2d5192c3c57 STEP 9/9: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] COMMIT quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 --> 1873a4483236 Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 1873a448323610f369a8565182a2914675f16d735ffe07f92258df89cd439224

Once the image has been built push the image up to the registry that the Openshift cluster can access.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 Getting image source signatures Copying blob e5df12622381 done | Copying blob 97c1462e7c7b done | Copying blob facf1e7dd3e0 skipped: already exists Copying blob 2dca7d5c2bb7 done | Copying blob 6f64cedd7423 done | Copying blob ec465ce79861 skipped: already exists Copying blob 121c270794cd done | Copying config 1873a44832 done | Writing manifest to image destination

Running The Container

The container will need to run priviledged so we can access the hardware devices. To do this we will create a ServiceAccount and Namespace for it to run in.

$ cat <<EOF > mfttool-project.yaml apiVersion: v1 kind: Namespace metadata: name: mfttool --- apiVersion: v1 kind: ServiceAccount metadata: name: mfttool namespace: mfttool EOF

Once the resource file is generated create it on the cluster.

$ oc create -f mfttool-project.yaml namespace/mfttool created serviceaccount/mfttoolcreated

Now that the project has been created assign the appropriate privileges to the service account.

$ oc -n mfttool adm policy add-scc-to-user privileged -z mfttool clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "mfttool"

Next we will create a pod yaml for each of our baremetal nodes that will run under the mfttool namespace and leverage the MFT tooling.

$ cat <<EOF > mfttool-pod-nvd-srv-29.yaml apiVersion: v1 kind: Pod metadata: name: mfttool-pod-nvd-srv-29 namespace: mfttool spec: nodeSelector: kubernetes.io/hostname: nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com hostNetwork: true serviceAccountName: mfttool containers: - image: quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 name: mfttool-pod-nvd-srv-29 securityContext: privileged: true EOF

Once the custom resource file has been generated, create the resource on the cluster.

oc create -f mfttool-pod-nvd-srv-29.yaml pod/mfttool-pod-nvd-srv-29 created

Validate that the pod is up and running.

$ oc get pods -n mfttool NAME READY STATUS RESTARTS AGE mfttool-pod-nvd-srv-29 1/1 Running 0 28s

Next we can rsh into the pod.

$ oc rsh -n mfttool mfttool-pod-nvd-srv-29 sh-5.1#

Once inside the pod we can run an mst start and then an mst status to see the devices.

$ oc rsh -n mfttool mfttool-pod-nvd-srv-29 sh-5.1# mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success [warn] mst_pciconf is already loaded, skipping Create devices Unloading MST PCI module (unused) - Success sh-5.1# mst status MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded MST devices: ------------ /dev/mst/mt4129_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:0d:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00 /dev/mst/mt4129_pciconf1 - PCI configuration cycles access. domain:bus:dev.fn=0000:37:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00 sh-5.1#

One of the things we can do with this container is query the devices and their settings with mlxconfig. We can also change those settings like when we need to change a port from ethernet mode to infiniband mode.

mlxconfig -d /dev/mst/mt4129_pciconf0 query Device #1: ---------- Device type: ConnectX7 Name: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled Device: /dev/mst/mt4129_pciconf0 Configurations: Next Boot MODULE_SPLIT_M0 Array[0..15] MEMIC_BAR_SIZE 0 MEMIC_SIZE_LIMIT _256KB(1) (...) ADVANCED_PCI_SETTINGS False(0) SAFE_MODE_THRESHOLD 10 SAFE_MODE_ENABLE True(1)

Another tool in the container is flint which allows us to see the firmware, product version and PSID for the device. This is useful for preparing for a firmware update.

flint -d /dev/mst/mt4129_pciconf0 query Image type: FS4 FW Version: 28.42.1000 FW Release Date: 8.8.2024 Product Version: 28.42.1000 Rom Info: type=UEFI version=14.35.15 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300126474 16 Base MAC: e09d73126474 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

Another tool in the container is mlxup which is an automated way to update the firmware. When we run mlxup it queries all devices on the system and reports back the current firmware and what available firmware there is for the device. We can then decide to update the cards or skip for now. In the example below I have two single port CX-7 cards in the node my pod is running on and I will upgrade their firmware.

$ mlxup Querying Mellanox devices firmware ... Device #1: ---------- Device Type: ConnectX7 Part Number: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled PSID: MT_0000001244 PCI Device Name: /dev/mst/mt4129_pciconf1 Base MAC: e09d73125fc4 Versions: Current Available FW 28.42.1000 28.43.1014 PXE 3.7.0500 N/A UEFI 14.35.0015 N/A Status: Update required Device #2: ---------- Device Type: ConnectX7 Part Number: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled PSID: MT_0000001244 PCI Device Name: /dev/mst/mt4129_pciconf0 Base MAC: e09d73126474 Versions: Current Available FW 28.42.1000 28.43.1014 PXE 3.7.0500 N/A UEFI 14.35.0015 N/A Status: Update required --------- Found 2 device(s) requiring firmware update... Perform FW update? [y/N]: y Device #1: Updating FW ... FSMST_INITIALIZE - OK Writing Boot image component - OK Done Device #2: Updating FW ... FSMST_INITIALIZE - OK Writing Boot image component - OK Done Restart needed for updates to take effect. Log File: /tmp/mlxup_workdir/mlxup-20250109_190606_17886.log

Notice the firmware upgrade completed but we need to restart the cards for the changes to take effect. We can use the mlxfwreset command to do this and then validate with the flint command that the card is running the new firmware.

sh-5.1# mlxfwreset -d /dev/mst/mt4129_pciconf0 reset -y The reset level for device, /dev/mst/mt4129_pciconf0 is: 3: Driver restart and PCI reset Continue with reset?[y/N] y -I- Sending Reset Command To Fw -Done -I- Stopping Driver -Done -I- Resetting PCI -Done -I- Starting Driver -Done -I- Restarting MST -Done -I- FW was loaded successfully. sh-5.1# flint -d /dev/mst/mt4129_pciconf0 query Image type: FS4 FW Version: 28.43.1014 FW Release Date: 7.11.2024 Product Version: 28.43.1014 Rom Info: type=UEFI version=14.36.16 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300126474 16 Base MAC: e09d73126474 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

We can repeat the same steps on the second card in the system to complete the firmware update.

sh-5.1# mlxfwreset -d /dev/mst/mt4129_pciconf1 reset -y The reset level for device, /dev/mst/mt4129_pciconf1 is: 3: Driver restart and PCI reset Continue with reset?[y/N] y -I- Sending Reset Command To Fw -Done -I- Stopping Driver -Done -I- Resetting PCI -Done -I- Starting Driver -Done -I- Restarting MST -Done -I- FW was loaded successfully. sh-5.1# flint -d /dev/mst/mt4129_pciconf1 query Image type: FS4 FW Version: 28.43.1014 FW Release Date: 7.11.2024 Product Version: 28.43.1014 Rom Info: type=UEFI version=14.36.16 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300125fc4 16 Base MAC: e09d73125fc4 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

Once the firmware update has been completed and validate we can remove the container as this completes the firmware update example.   

Hopefully this gives an idea of what is required to use this container method which hopes to simplify the ability of upgrading Mellanox/NVIDIA firmware in a image based operating system like Red Hat CoreOS in OpenShift Container Platform.

Friday, January 10, 2025

Understanding Ethernet and Infiniband on OpenShift

I recently was involved in a conversation around using only infiniband on an OpenShift cluster installation.  That is the customer wanted to only have infiniband connectivity for both the cluster apis and the high speed storage access requirements for the application.   This interaction made me realize we probably need a refresher on the difference between infiniband and ethernet because they are not the same nor can they be swapped interchangeably.

The difference between infiniband and ethernet is very disparate from a design point of view.  Infiniband was designed with the idea of having a high reliability, high bandwidth and low latency to interconnect nodes in a supercomputer cluster.  Whereas ethernet was designed with the idea of how can I move data between multiple systems easily.  This difference becomes more apparent in how each technology is designed to move data.

The design differences show up for example in how latency is handled between the two types of interconnects.  For ethernet interconnects typically a store-and-forward along with MAC address network transport model is used for communication between hosts.  This method increases the process flow because it has to take into account complex services like IP, MPLS and 802.1Q.   Whereas with infiniband layer 2 processing uses a 16 bit LID(local ID) number which is the only one that can be used to search for the forwarding path information.  Further the switching technology in infiniband uses a cut-through approach which reduces the forwarding delay making it significantly faster than ethernet.

Another difference also shows up in network reliability.  The infiniband protocol is a complete network protocol with its own defined layers from layer 1 to layer 4.  This end to end flow control provides the basis for infiniband's network packeting sending and receiving which can provide a lossless network.  Ethernet on the other hand does not have a scheduling based flow control mechanism which results in the lack of a guarantee whether a node on the other end will end up being congested upon packet receipt.  This is why ethernet switches are built with a cache to absorb these sudden bursts of traffic.

Networking mode or methods is another distinction between these two technologies.  An software defined network is built into infiniband by design.  There is a subnet manager present on each layer 2 infiniband network to configure the LID of the nodes.  The subnet manager also calculates the forwarding path through the control plane and issues it to the infiniband exchange.  Conversely ethernet uses a networking mode that generates MAC addresses and the IP protocol must cooperate with the ARP protocol.  Nodes in the ethernet network are required to send packets on a regular basis to guarantee entries, in a ARP table for example, are updated in real time.  All of this leads to more overhead in a ethernet network compared to infiniband.

We can see from the above there are significant differences between the two technologies which makes it impossible to swap them out like for like as in the case of our OpenShift installation request from the customer.  For example take the OpenShift installation which will leverage OVN/OVS for networking.  During the installation there is an expectation that the MAC address will exist on the interface marked for the cluster api.   However in an infiniband network there is no MAC address concept.  Once might see an error similar to the below:

Error: failed to modify 802-3-ethernet.cloned-mac-address: '00:00:01:49:fe:80:00:00:00:00:00:00:00:11:22:33:01:32:02:00' is not a valid Ethernet MAC.

Further, drivers also become an issue for networking devices like a Mellonox CX-7 or BlueField-3.  This is because the default mlx upstream drivers that ship with Red Hat CoreOS in OpenShift do not contain the RDMA component which is required for infiniband.  To obtain the the RDMA component one needs to leverage the NVIDIA DOCA driver which is part of the NVIDIA network operator.  However this operator cannot be leveraged in an OpenShift day 0 installation.  Even if it could though again the expectation of OVS/OVN networking is to have a MAC address to work with from an ethernet network.

Given all these differences we had to explore how could we meet the customers needs but still apply the correct technology to the systems.   If the customers goal was to ensure a high speed interconnect between the nodes in the cluster we can still do this with OpenShift.  However we need to approach it differently and also break out the cluster apis so they are still working with an ethernet network.   A suitable approach might look like this example below.

In the diagram we have a six node OpenShift cluster each with two single port Mellanox CX-7 cards.  For each node we have one card plugged into an ethernet switch and another plugged into an infiniband switch.   With this design we can now install OpenShift using the one CX-7 card operating in ethernet mode.   Once OpenShift is installed we can then layer on the NVIDIA network operator to provide the RDMA infiniband driver and leverage the second CX-7 card operating in infiniband mode.   This design enables us to not only get OpenShift installed but still provide a secondary network to our workloads with access to the high speed infiniband network.   This same design would also work if we had just one dual ported CX-7 card as we can use the Mellanox tools to configure one port for ethernet and one for infiniband.

Hopefully this blog provided some insight into the difference between infiniband and ethernet and why one simply cannot swap out ethernet for infiniband on an OpenShift installation.    


Wednesday, January 08, 2025

Build RDMA GPU-Tools Container

 


The purpose of this blog is to build a container that automates building the testing tooling for validating RDMA connectivity and performance when used in conjunction with NVIDIA Network Operator and NVIDIA GPU Operator.  Specifically I want to be able to use the ib_write_bw command with the --use_cuda switch to demonstrate RDMA from one GPU in a node to another GPU in another node in an OpenShift cluster. The ib_write_bw command is part of the perftest suite which is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for HW or SW tuning as well as for functional testing.

The collection contains a set of bandwidth and latency benchmark such as:

  • Send - ib_send_bw and ib_send_lat
  • RDMA Read - ib_read_bw and ib_read_lat
  • RDMA Write - ib_write_bw and ib_write_lat
  • RDMA Atomic - ib_atomic_bw and ib_atomic_lat
  • Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat

In previous blogs, here and here,  I used a Fedora 35 container and manually added the components I wanted but here we will provide the tooling to build a container that will instantiate itself upon deployment. The workflow is as follows:

  • Dockerfile.tools - which provides the content for the base image and the entrypoint.sh script.
  • Entrypoint.sh - which provides the start up script for the container to pull in both the NVIDIA cuda libraries and also build and deploy the perftest suite with the cuda option available.
  • Additional RPMs - there are some packages that were not part of the UBI image repo but are dependencies for CUDA toolkit.

The first thing we need to do is create a working directory for our files and an rpms directory for the rpms we will need for our base image. I am using root here but it could be a regular user as well.

$ mkdir -p /root/gpu-tools/rpms
$ cd /root/gpu-tools

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory.

  • infiniband-diags-51.0-1.el9.x86_64.rpm
  • libglvnd-opengl-1.3.4-1.el9.x86_64.rpm
  • libibumad-51.0-1.el9.x86_64.rpm
  • librdmacm-51.0-1.el9.x86_64.rpm
  • libxcb-1.13.1-9.el9.x86_64.rpm
  • libxcb-devel-1.13.1-9.el9.x86_64.rpm
  • libxkbcommon-1.0.3-4.el9.x86_64.rpm
  • libxkbcommon-x11-1.0.3-4.el9.x86_64.rpm
  • pciutils-devel-3.7.0-5.el9.x86_64.rpm
  • rdma-core-devel-51.0-1.el9.x86_64.rpm
  • xcb-util-0.4.0-19.el9.x86_64.rpm
  • xcb-util-image-0.4.0-19.el9.x86_64.rpm
  • xcb-util-keysyms-0.4.0-17.el9.x86_64.rpm
  • xcb-util-renderutil-0.3.9-20.el9.x86_64.rpm
  • xcb-util-wm-0.4.1-22.el9.x86_64.rpm

Once we have all our rpms for the base image we can move onto creating the dockerfile.tools file which we will use to build our image.

$ cat <<EOF >dockerfile.tools # Start from UBI9 image FROM registry.access.redhat.com/ubi9/ubi:latest # Set work directory WORKDIR /root RUN mkdir /root/rpms COPY ./rpms/*.rpm /root/rpms/ # DNF install packages either from repo or locally RUN dnf install `ls -1 /root/rpms/*.rpm` -y RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y # Cleanup WORKDIR /root RUN dnf clean all # Run container entrypoint COPY entrypoint.sh /root/entrypoint.sh RUN chmod +x /root/entrypoint.sh ENTRYPOINT ["/root/entrypoint.sh"] EOF

We also need to create the entrypoint.sh script which is referenced in the dockerfile and does the heavy lifting of pulling in the cuda toolkit and the perftest suite.

$ cat <<EOF > entrypoint.sh #!/bin/bash # Set working dir cd /root # Configure and install cuda-toolkit dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo dnf clean all dnf -y install cuda-toolkit-12-6 # Export CUDA library paths export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH # Git clone perftest repository git clone https://github.com/linux-rdma/perftest.git # Change into perftest directory cd /root/perftest # Build perftest with the cuda libraries included ./autogen.sh ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h make -j make install # Sleep container indefinitly sleep infinity & wait EOF

Next we can use the dockerfile we just created to build the base image.

$ podman build -f dockerfile.tools -t quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 STEP 1/10: FROM registry.access.redhat.com/ubi9/ubi:latest STEP 2/10: WORKDIR /root --> Using cache 75f163f12503272b83e1137f7c1903520f84493ffe5aec0ef32ece722bd0d815 --> 75f163f12503 STEP 3/10: RUN mkdir /root/rpms --> Using cache ade32aa6605847a8b3f5c8b68cfcb64854dc01eece34868faab46137a60f946c --> ade32aa66058 STEP 4/10: COPY ./rpms/*.rpm /root/rpms/ --> Using cache 59dcef81d6675f44d22900f13a3e5441f5073555d7d2faa0b2f261f32e4ba6cd --> 59dcef81d667 STEP 5/10: RUN dnf install `ls -1 /root/rpms/*.rpm` -y --> Using cache ebb8b3150056240378ac36f7aa41d7f13b13308e9353513f26a8d3d70e618e3b --> ebb8b3150056 STEP 6/10: RUN dnf install wget procps-ng pciutils jq iputils ethtool net-tools git autoconf automake libtool -y --> Using cache 5ca85080c103ba559994906ada0417102f54f22c182bbc3a06913109855278cc --> 5ca85080c103 STEP 7/10: WORKDIR /root --> Using cache 68c8cd47a41bc364a0da5790c90f9aee5f8a8c7807732f3a5138bff795834fc1 --> 68c8cd47a41b STEP 8/10: RUN dnf clean all Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. 26 files removed --> a219fec5df49 STEP 9/10: COPY entrypoint.sh /root/entrypoint.sh --> aeb03bf74673 STEP 10/10: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] COMMIT quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 --> 45c2113e5082 Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 45c2113e5082fb2f548b9e1b16c17524184c4079e2db77399519cf29829af1e7

Once the image is created we can push it to our favorite registry.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 Getting image source signatures Copying blob 62ee1c6c02d5 done | Copying blob 6027214db22e done | Copying blob 4822ebd5a418 done | Copying blob 422a0e40f90b done | Copying blob 5916e2b21ab2 done | Copying blob 10bf375a4d78 done | Copying blob ca1c18e183d5 done | Copying config 3bbb6e1f9b done | Writing manifest to image destination

Now that we have an image let's test it out on the cluster where we have compatible RDMA hardware configured. I am using the same setup as I used in a previous blog so I am going to skip the details about setting up a service account and providing the privileges to it. We will however create our workload pod yaml files which we will use to deploy the image.

cat >>EOF >rdma-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 name: rdma-32-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF $ cat >>EOF >rdma-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.2 name: rdma-33-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF

Next we can deploy the containers.

$ oc create -f rdma-32-workload.yaml pod/rdma-eth-32-workload created $ oc create -f rdma-33-workload.yaml pod/rdma-eth-33-workload created

Validate the pods are up and running.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-eth-32-workload 1/1 Running 0 51s rdma-eth-33-workload 1/1 Running 0 47s

Now open two terminals and rsh into each pod in one of the terminals and validate that the perftest commands are present. We can also get the ipaddress of our pod inside the containers.

$ oc rsh rdma-eth-32-workload sh-5.1# ib ib_atomic_bw ib_read_lat ib_write_bw ibcacheedit ibfindnodesusing.pl iblinkinfo ibping ibroute ibstatus ibtracert ib_atomic_lat ib_send_bw ib_write_lat ibccconfig ibhosts ibnetdiscover ibportstate ibrouters ibswitches ib_read_bw ib_send_lat ibaddr ibccquery ibidsverify.pl ibnodes ibqueryerrors ibstat ibsysstat sh-5.1# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if96: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:83:00:34 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.0.52/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:34/64 scope link valid_lft forever preferred_lft forever 3: net1@if78: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 32:1a:83:4a:e2:39 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.2.1/24 brd 192.168.2.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::301a:83ff:fe4a:e239/64 scope link valid_lft forever preferred_lft forever $ oc rsh rdma-eth-33-workload sh-5.1# ib ib_atomic_bw ib_read_lat ib_write_bw ibcacheedit ibfindnodesusing.pl iblinkinfo ibping ibroute ibstatus ibtracert ib_atomic_lat ib_send_bw ib_write_lat ibccconfig ibhosts ibnetdiscover ibportstate ibrouters ibswitches ib_read_bw ib_send_lat ibaddr ibccquery ibidsverify.pl ibnodes ibqueryerrors ibstat ibsysstat sh-5.1# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if105: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:80:02:3d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.61/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe80:23d/64 scope link valid_lft forever preferred_lft forever 3: net1@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 22:3e:02:c9:d0:87 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.2.2/24 brd 192.168.2.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::203e:2ff:fec9:d087/64 scope link valid_lft forever preferred_lft forever

Now let's run the RDMA perftest with the --use_cuda switch. Again we will need to have two rsh sessions one on each pod. In the first terminal we can run the following.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************ ~

In the second terminal we will run the following command which will dump the output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.2 --use_cuda=0 192.168.2.1 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is E1:00 Picking device No. 0 [pid = 4101, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007f3dfa600000 pointer=0x7f3dfa600000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00c6 PSN 0x2986aa GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c7 PSN 0xa0ef83 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c8 PSN 0x74badb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00c9 PSN 0x287d57 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00ca PSN 0xf5b155 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cb PSN 0x6cc15d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cc PSN 0x3730c2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cd PSN 0x74d75d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00ce PSN 0x51a707 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00cf PSN 0x987246 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d0 PSN 0xa334a8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d1 PSN 0x5d8f52 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d2 PSN 0xc42ca0 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d3 PSN 0xf43696 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d4 PSN 0x43f9d2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 local address: LID 0000 QPN 0x00d5 PSN 0xbc4d64 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c6 PSN 0xb1023e GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c7 PSN 0xc78587 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c8 PSN 0x5a328f GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c9 PSN 0x582cfb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cb PSN 0x40d229 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cc PSN 0x5833a1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cd PSN 0xcfefb6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00ce PSN 0xfd5d41 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00cf PSN 0xed811b GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d0 PSN 0x5244ca GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d1 PSN 0x946edc GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d2 PSN 0x4e0f76 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d3 PSN 0x7b13f4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d5 PSN 0xd22346 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00d6 PSN 0x722bc8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10384867 0.00 181.46 0.346100 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f3dfa600000 destroying current CUDA Ctx

And if we return to the first terminal we should see it updated with the same output.

sh-5.1# ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 192.168.2.1 --use_cuda=0 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************ Requested mtu is higher than active mtu Changing to active mtu - 3 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 61:00 Picking device No. 0 [pid = 4109, dev = 0] device name = [NVIDIA A40] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007f8bca600000 pointer=0x7f8bca600000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- Waiting for client rdma_cm QP to connect Please run the same command with the IB/RoCE interface IP --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00c6 PSN 0xb1023e GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c7 PSN 0xc78587 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c8 PSN 0x5a328f GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00c9 PSN 0x582cfb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cb PSN 0x40d229 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cc PSN 0x5833a1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cd PSN 0xcfefb6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00ce PSN 0xfd5d41 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00cf PSN 0xed811b GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d0 PSN 0x5244ca GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d1 PSN 0x946edc GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d2 PSN 0x4e0f76 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d3 PSN 0x7b13f4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d4 PSN 0x1a2d5a GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d5 PSN 0xd22346 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 local address: LID 0000 QPN 0x00d6 PSN 0x722bc8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:32 remote address: LID 0000 QPN 0x00c6 PSN 0x2986aa GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c7 PSN 0xa0ef83 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c8 PSN 0x74badb GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00c9 PSN 0x287d57 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00ca PSN 0xf5b155 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cb PSN 0x6cc15d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cc PSN 0x3730c2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cd PSN 0x74d75d GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00ce PSN 0x51a707 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00cf PSN 0x987246 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d0 PSN 0xa334a8 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d1 PSN 0x5d8f52 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d2 PSN 0xc42ca0 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d3 PSN 0xf43696 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d4 PSN 0x43f9d2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 remote address: LID 0000 QPN 0x00d5 PSN 0xbc4d64 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:06:145:33 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 10384867 0.00 181.46 0.346100 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007f8bca600000 destroying current CUDA Ctx

Hopefully this helped demonstrate a much cleaner and automated way to build a perftest container with cuda enabled to perform RDMA testing on OpenShift with NVIDIA Network Operator and NVIDIA GPU Operator.

Monday, January 06, 2025

RDMA+CUDA with NVIDIA on OpenShift

In a previous blog I described how to configure an OpenShift cluster with RDMA when using the NVIDIA Network Operator and NVIDIA GPU Operator.  However in that blog we only did simple RDMA testing across the network interfaces with no involvement of the GPU.   In this blog I will show the testing so it does involve the GPU and CUDA libraries.   Keep in mind though this testing is for validating that the configuration is setup correctly and should not replace real world workload testing of an application.

In this example we are using the same versions of OpenShift and the operators as in the previous blog so I will not go into those details here.  What we will capture below is how to configure the container appropriately to do the RDMA+CUDA testing.

The first thing we need to do is create a ServiceAccount in the default namespace. We can do so by generating the custom resource file below and creating on the cluster.
$ cat <<EOF > default-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: rdma namespace: default EOF $ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Now that the rdma account is created let's give it privileged access.

oc -n default adm policy add-scc-to-user privileged -z rdma clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

Next we will generate two pod custom resource files to run our workload pod image on the two baremetal a100 nodes in our environment.

$ cat <<EOF > rdma-eth-a100-01-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-a100-01-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: a100-1.private.openshiftvcn.schmaustech.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-eth-a100-01-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF $ cat <<EOF > rdma-eth-a100-02-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-a100-02-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: a100-2.private.openshiftvcn.schmaustech.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-eth-a100-02-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF

With the pod files generated we can create them on the cluster.

$ oc create -f oci-agent-pod-a100-01.yaml pod/oci-agent-pod-a100-01 created $ oc create -f oci-agent-pod-a100-02.yaml pod/oci-agent-pod-a100-02 created

Validate that the pods are running.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-eth-a100-01-workload 1/1 Running 0 1m rdma-eth-a100-02-workload 1/1 Running 0 1m

Next we can rsh into each of them in separate terminal windows.

$ oc rsh rdma-eth-a100-01-workload sh-5.1# cd /root sh-5.1# $ oc rsh rdma-eth-a100-02-workload sh-5.1# cd /root sh-5.1# 

Building RDMA Validation Tests

The next steps are required on both running pods and enable perftest to have CUDA capable binaries.

First we need to download the CUDA repo and since our image is Fedora 35 based we will pulling down a Fedora 35 based package with wget. Note one might have to install wget.

sh-5.1# wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm --2024-11-20 16:06:08-- https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.20.126 Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.20.126|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 3795608809 (3.5G) [application/x-rpm] Saving to: 'cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm' cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm 100%[=========================================================================================================================================>] 3.53G 28.0MB/s in 2m 10s 2024-11-20 16:08:18 (27.9 MB/s) - 'cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm' saved [3795608809/3795608809]

Once the package is downloaded install it with rpm command.

sh-5.1# rpm -i cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm warning: cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID d42d0685: NOKEY

Then clean all local repos with dnf clean all.

sh-5.1# dnf clean all 42 files removed And finally install the CUDA toolkit. sh-5.1# dnf -y install cuda Fedora 35 - x86_64 - Updates 32 MB/s | 34 MB 00:01 Fedora Modular 35 - x86_64 - Updates 7.2 MB/s | 3.9 MB 00:00 Dependencies resolved. ============================================================================================================================================================================================================================================== Package Architecture Version Repository Size ============================================================================================================================================================================================================================================== Installing: cuda x86_64 11.7.0-1 cuda-fedora35-11-7-local 2.7 k Upgrading: systemd-libs x86_64 249.13-6.fc35 updates 599 k Installing dependencies: NetworkManager-libnm x86_64 1:1.32.12-2.fc35 updates 1.7 M acl x86_64 2.3.1-2.fc35 fedora 71 k (...) tracker-3.2.1-1.fc35.x86_64 tracker-miners-3.2.2-1.fc35.x86_64 ttmkfdir-3.0.9-64.fc35.x86_64 tzdata-java-2022g-1.fc35.noarch uchardet-0.0.6-14.fc35.x86_64 upower-0.99.13-1.fc35.x86_64 vulkan-loader-1.3.204.0-1.fc35.x86_64 which-2.21-27.fc35.x86_64 xcb-util-0.4.0-18.fc35.x86_64 xcb-util-image-0.4.0-18.fc35.x86_64 xcb-util-keysyms-0.4.0-16.fc35.x86_64 xcb-util-renderutil-0.3.9-19.fc35.x86_64 xcb-util-wm-0.4.1-21.fc35.x86_64 xkbcomp-1.4.5-2.fc35.x86_64 xkeyboard-config-2.33-2.fc35.noarch xml-common-0.6.3-57.fc35.noarch xorg-x11-drv-libinput-1.2.0-1.fc35.x86_64 xorg-x11-fonts-Type1-7.5-32.fc35.noarch xorg-x11-proto-devel-2021.5-1.fc35.noarch xorg-x11-server-Xorg-1.20.14-9.fc35.x86_64 xorg-x11-server-common-1.20.14-9.fc35.x86_64 xz-5.2.5-7.fc35.x86_64 Failed: nvidia-driver-cuda-3:515.43.04-1.fc35.x86_64 nvidia-persistenced-3:515.43.04-1.fc35.x86_64 Error: Transaction failed

The CUDA toolkit installation will say transaction failed but this is okay. The necessary files were installed to provide what we need for building perftest.

Set the LD_LIBRARY_PATH and LIBRARY_PATH variables below.

sh-5.1# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH sh-5.1# export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

Next remove the existing /root/perftest directory in the pod and git clone down the perftest repository.

sh-5.1# rm -r -f perftest sh-5.1# git clone https://github.com/linux-rdma/perftest.git Cloning into 'perftest'... remote: Enumerating objects: 6077, done. remote: Counting objects: 100% (2157/2157), done. remote: Compressing objects: 100% (398/398), done. remote: Total 6077 (delta 1876), reused 1920 (delta 1747), pack-reused 3920 (from 1) Receiving objects: 100% (6077/6077), 1.89 MiB | 43.11 MiB/s, done. Resolving deltas: 100% (4826/4826), done.

Finally change into the perftest directory and build the binaries.

sh-5.1# cd perftest/ sh-5.1# ./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'. libtoolize: copying file 'config/ltmain.sh' libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'. libtoolize: copying file 'm4/libtool.m4' libtoolize: copying file 'm4/ltoptions.m4' libtoolize: copying file 'm4/ltsugar.m4' libtoolize: copying file 'm4/ltversion.m4' libtoolize: copying file 'm4/lt~obsolete.m4' libtoolize: 'AC_PROG_RANLIB' is rendered obsolete by 'LT_INIT' configure.ac:55: installing 'config/compile' configure.ac:59: installing 'config/config.guess' configure.ac:59: installing 'config/config.sub' configure.ac:36: installing 'config/install-sh' configure.ac:36: installing 'config/missing' Makefile.am: installing 'config/depcomp' configure: loading site script /usr/share/config.site checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /usr/bin/mkdir -p checking for gawk... gawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking whether make supports nested variables... (cached) yes checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking whether gcc understands -c and -o together... yes checking whether make supports the include directive... yes (GNU style) checking dependency style of gcc... gcc3 checking for g++... g++ checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking dependency style of g++... gcc3 checking dependency style of gcc... gcc3 checking build system type... x86_64-pc-linux-gnu checking host system type... x86_64-pc-linux-gnu checking how to print strings... printf checking for a sed that does not truncate output... /usr/bin/sed checking for grep that handles long lines and -e... /usr/bin/grep checking for egrep... /usr/bin/grep -E checking for fgrep... /usr/bin/grep -F checking for ld used by gcc... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B checking the name lister (/usr/bin/nm -B) interface... BSD nm checking whether ln -s works... yes checking the maximum length of command line arguments... 1572864 checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop checking for /usr/bin/ld option to reload object files... -r checking for objdump... objdump checking how to recognize dependent libraries... pass_all checking for dlltool... no checking how to associate runtime and link libraries... printf %s\n checking for ar... ar checking for archiver @FILE support... @ checking for strip... strip checking for ranlib... ranlib checking command to parse /usr/bin/nm -B output from gcc object... ok checking for sysroot... no checking for a working dd... /usr/bin/dd checking how to truncate binary pipes... /usr/bin/dd bs=4096 count=1 checking for mt... no checking if : is a manifest tool... no checking how to run the C preprocessor... gcc -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking for dlfcn.h... yes checking for objdir... .libs checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC -DPIC checking if gcc PIC flag -fPIC -DPIC works... yes checking if gcc static flag -static works... no checking if gcc supports -c -o file.o... yes checking if gcc supports -c -o file.o... (cached) yes checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking how to run the C++ preprocessor... g++ -E checking for ld used by g++... /usr/bin/ld -m elf_x86_64 checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking for g++ option to produce PIC... -fPIC -DPIC checking if g++ PIC flag -fPIC -DPIC works... yes checking if g++ static flag -static works... no checking if g++ supports -c -o file.o... yes checking if g++ supports -c -o file.o... (cached) yes checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking dynamic linker characteristics... (cached) GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking for ranlib... (cached) ranlib checking for ANSI C header files... (cached) yes checking infiniband/verbs.h usability... yes checking infiniband/verbs.h presence... yes checking for infiniband/verbs.h... yes checking for ibv_get_device_list in -libverbs... yes checking for rdma_create_event_channel in -lrdmacm... yes checking for umad_init in -libumad... yes checking for log in -lm... yes checking for ibv_reg_dmabuf_mr in -libverbs... yes checking pci/pci.h usability... yes checking pci/pci.h presence... yes checking for pci/pci.h... yes checking for pci_init in -lpci... yes checking for cuMemGetHandleForAddressRange in -lcuda... yes checking for efadv_create_qp_ex in -lefa... yes checking for mlx5dv_create_qp in -lmlx5... yes checking for hnsdv_query_device in -lhns... no checking that generated files are newer than configure... done configure: creating ./config.status config.status: creating Makefile config.status: creating config.h config.status: executing depfiles commands config.status: executing libtool commands config.status: executing man commands make all-am make[1]: Entering directory '/root/perftest' ln -s .././man/perftest.1 man/ib_read_bw.1 ln -s .././man/perftest.1 man/ib_write_bw.1 ln -s .././man/perftest.1 man/ib_send_bw.1 ln -s .././man/perftest.1 man/ib_atomic_bw.1 ln -s .././man/perftest.1 man/ib_write_lat.1 ln -s .././man/perftest.1 man/ib_read_lat.1 ln -s .././man/perftest.1 man/ib_send_lat.1 ln -s .././man/perftest.1 man/ib_atomic_lat.1 ln -s .././man/perftest.1 man/raw_ethernet_bw.1 ln -s .././man/perftest.1 man/raw_ethernet_lat.1 CC src/send_bw.o ln -s .././man/perftest.1 man/raw_ethernet_burst_lat.1 ln -s .././man/perftest.1 man/raw_ethernet_fs_rate.1 CC src/multicast_resources.o CC src/get_clock.o CC src/perftest_communication.o CC src/perftest_parameters.o CC src/perftest_resources.o CC src/perftest_counters.o CC src/host_memory.o CC src/mmap_memory.o CC src/cuda_memory.o CC src/raw_ethernet_resources.o CC src/send_lat.o CC src/write_lat.o CC src/write_bw.o CC src/read_lat.o CC src/read_bw.o CC src/atomic_lat.o CC src/atomic_bw.o CC src/raw_ethernet_send_bw.o CC src/raw_ethernet_send_lat.o CC src/raw_ethernet_send_burst_lat.o CC src/raw_ethernet_fs_rate.o AR libperftest.a CCLD ib_send_bw CCLD ib_write_lat CCLD ib_send_lat CCLD ib_write_bw CCLD ib_read_lat CCLD ib_read_bw CCLD ib_atomic_lat CCLD ib_atomic_bw CCLD raw_ethernet_bw CCLD raw_ethernet_lat CCLD raw_ethernet_burst_lat CCLD raw_ethernet_fs_rate make[1]: Leaving directory '/root/perftest'

With the binaries built we can move onto running our validation tests.

Running RDMA Validation Tests

We already should have our workload pods running on the cluster in the default namespace.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-eth-a100-01-workload 1/1 Running 0 15m rdma-eth-a100-02-workload 1/1 Running 0 15m

Next we will need to open two rsh connections one into each pod.

$ oc rsh rdma-eth-a100-01-workload sh-5.1# $ oc rsh rdma-eth-a100-02-workload sh-5.1#

Then in the rsh connection into rdma-eth-a100-01-workload we will run the following ib_write_bw command.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 172.16.0.1 WARNING: BW peak won't be measured in this run. ************************************ * Waiting for client to connect... * ************************************

Then in the second rsh connection into rdma-eth-a100-02-workload we will run the following ib_write_bw command. Note this test is without cuda and will take a few minutes.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 172.16.0.2 172.16.0.1 WARNING: BW peak won't be measured in this run. --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00bd PSN 0x6e902d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00be PSN 0xdf3b13 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00bf PSN 0x14ba61 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c0 PSN 0xd9209c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c1 PSN 0xc07f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c2 PSN 0xf06575 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c3 PSN 0x481230 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c4 PSN 0xc1a69 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c5 PSN 0x7c6e59 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c6 PSN 0xf16f67 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c7 PSN 0xe82e7f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c8 PSN 0xf0a6a6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c9 PSN 0x41069a GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00ca PSN 0xe2153f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00cb PSN 0xed2a91 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00cc PSN 0x2f3581 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 remote address: LID 0000 QPN 0x00c7 PSN 0xc8665d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00c8 PSN 0xce8d83 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00c9 PSN 0x4b7411 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00ca PSN 0x3a508c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00cb PSN 0xb3c9af GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00cc PSN 0xac6ee5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00cd PSN 0x12b6e0 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00ce PSN 0x8a5959 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00cf PSN 0x8da89 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d0 PSN 0x2e9fd7 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d1 PSN 0xba6e2f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d2 PSN 0xede496 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d3 PSN 0xfa05ca GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d4 PSN 0x2bdcaf GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d5 PSN 0xc5b541 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d6 PSN 0x3c6271 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 5296615 0.00 92.56 0.176539 ---------------------------------------------------------------------------------------

Now we are going to repeat the test but include the GPU with the --use_cuda switch on the command. So in the first rsh connection run.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 172.16.0.1 --use_cuda=0 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************

Then in the second rsh connection run the following.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 172.16.0.2 --use_cuda=0 172.16.0.1 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 0F:00 Picking device No. 0 [pid = 4488, dev = 0] device name = [NVIDIA A100-SXM4-80GB] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007fbebf200000 pointer=0x7fbebf200000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00ce PSN 0x282aa6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00cf PSN 0x3ab698 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d0 PSN 0x9dd002 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d1 PSN 0x11fc29 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d2 PSN 0x72e988 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d3 PSN 0xb5f44a GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d4 PSN 0x1540e1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d5 PSN 0x8801c6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d6 PSN 0xd77ef2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d7 PSN 0xacf68c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d8 PSN 0x47f740 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d9 PSN 0x286d3 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00da PSN 0xc1e7c3 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00db PSN 0xd8c9b4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00dc PSN 0xf51e62 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00dd PSN 0x4bcb7e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 remote address: LID 0000 QPN 0x00d8 PSN 0x533e76 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d9 PSN 0x1f0628 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00da PSN 0xf1052 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00db PSN 0xd23e39 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00dc PSN 0x696a58 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00dd PSN 0x25acda GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00de PSN 0x383631 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00df PSN 0x9054d6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e0 PSN 0xc33cc2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e1 PSN 0x55a81c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e2 PSN 0x62f190 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e3 PSN 0x22fae3 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e4 PSN 0x99b293 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e5 PSN 0xb10444 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e6 PSN 0x636db2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e7 PSN 0x45708e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 3994687 0.00 69.79 0.133122 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007fbebf200000 destroying current CUDA Ctx

Once the tests complete we have confirmed the RDMA is working and we can now add our real-world workload.

Hopefully this blog was useful in showing RDMA with CUDA testing on an OpenShift environment.