SCHMAUSTECH: mellanox

Showing posts with label mellanox. Show all posts

Saturday, February 28, 2026

OpenShift Passthrough For Some

I wanted to provide a simple mechanism to configure vfio-pci devices of a certain device type when some of those device types are in use by the base operating system. For example on some Grace Hopper nodes the only network devices might be BlueField-3 interfaces. If I want one BlueField-3 to provide networking access to the base operating system I need to leave the kernel driver in place. However I might want to take the additional Bluefield-3 devices and use them in passthrough mode which would require them to be unbound from mlx5 drivers and bound to vfio-pci. The following writeup provides a working example both manually and then automatically in the context of OpenShift.

Why

There are going to be use cases where the workloads running in virtual machines on OpenShift worker nodes will need to have the network devices in passthrough mode. While this is not a problem when the OpenShift worker node cluster interface is on a different network card type then those those that need to be passed to the virtual machine. It does becomes an issue on systems that are outfitted with all the same network interface types. This means that the device id for all the network cards are the same. It also means that from a traditional sense I cannot use the current method of enabling passthrough for the network cards. That current method involves blacklisting the network kernel driver from loading and then configuring the device ids to attach to the vfio-pci driver. If we were to implement that on a system with all of the same network cards when the system rebooted to apply the machineconfig the node would come up without any networking and show as NotReady. That is why in the rest of this document we will demonstrate a different practical approach to this problem.

Manually Configure

Kernel driver unbinding and binding was introduces back in kernel 2.6.13 back in 2005 so its a technology that has been around for quite some time. This is the exact feature that we will be using to show how to only make some of our network cards vfio-pci bound. To begin let's take a look at our network interfaces via lspci where I have filtered out the devices by the device id 15b3:a2dc. We can see here that I have 4 network card ports on an OpenShift node in a debug pod.

sh-5.2# lspci -nn |grep 15b3:a2dc
0000:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0000:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0002:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0002:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)

Now let's examime the physical interface names for these 4 ports.

sh-5.2# grep PCI_SLOT_NAME /sys/class/net/*/device/uevent
/sys/class/net/enP2s2f0np0/device/uevent:PCI_SLOT_NAME=0002:01:00.0
/sys/class/net/enP2s2f1np1/device/uevent:PCI_SLOT_NAME=0002:01:00.1
/sys/class/net/enp1s0f0np0/device/uevent:PCI_SLOT_NAME=0000:01:00.0
/sys/class/net/enp1s0f1np1/device/uevent:PCI_SLOT_NAME=0000:01:00.1

Now we have to see which one is already in use by OpenShift so we do not inadvertently work with the wrong card. This will always be the one where the master-

sh-5.2# ovs-vsctl --no-heading --format=table --columns=name,type find Interface type=system| awk '{print $1}'
enp1s0f0np0

We can see enp1sf0np0 which correlates to the 0000:01:00.0 card. So we will focus on the 0002:01:00.0 & 0002:01:00.1.

Now that we have determined which cards we can use we will begin the process of unbinding them from their current driver which is mlx5_core.

echo -n "0002:01:00.0" > /sys/bus/pci/drivers/mlx5_core/unbind
echo -n "0002:01:00.1" > /sys/bus/pci/drivers/mlx5_core/unbind

At this point if looked at the lspci output we would see these two devices no longer have a "Kernel driver in use" line in the output. Rather then four lines here we only see two which are the two ports related the system network card.

sh-5.2# lspci -k -s 0002:01:00.0
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel modules: mlx5_core
sh-5.2# lspci -k -s 0002:01:00.1
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel modules: mlx5_core

We are now ready to for them to use the vfio-pci driver but first we may need to load that driver.

modprobe vfio-pci

We can validate that the vfio-pci driver is loaded with lsmod.

sh-5.2# lsmod|grep vfio
vfio_pci               16384  0
vfio_pci_core          90112  1 vfio_pci
vfio_iommu_type1       49152  0
vfio                   73728  3 vfio_pci_core,vfio_iommu_type1,vfio_pci
iommufd               131072  1 vfio

Now that we have unbound the two devices drivers let's override the kernel driver they should use with vfio-pci.

sh-5.2# echo vfio-pci > /sys/bus/pci/devices/0002:01:00.0/driver_override
sh-5.2# echo vfio-pci > /sys/bus/pci/devices/0002:01:00.1/driver_override

With the vfio-driver override in place we can now bind our two devices to that driver.

sh-5.2# echo "0002:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
sh-5.2# echo "0002:01:00.1" > /sys/bus/pci/drivers/vfio-pci/bind

And finally we can validate that the driver for those devices is now using the vfio-pci driver.

sh-5.2# lspci -k -s 0002:01:00.0
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core
sh-5.2# lspci -k -s 0002:01:00.1
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core

Automatically Configure

While one can manually configure the vfio-pci passthrough like we did above this won't be scalable in a large cluster especially after OpenShift upgrades so we need something that is more automatic. The answer to this is twofold in that we first need a script that can automate the process above and then a mechanism of running that script on OpenShift nodes.

For the automation script we can use the example code in this repository here. This script will identify all the interfaces of a certain device type and then determine which ones can be used as passthrough devices. The factor that prohibits the device from being used as a passthrough is if the device has an OVS bridge associated to it. Once we have idenfitied the list it will go ahead and unbind the kernel driver in use on that device and then override the driver and bind it to vfio-pci so it is available for passthrough.

Here is a manuall run of the system we had to test on.

sh-5.2# ./passthrough-some-nics.sh -n 15b3:a2dc

 NIC Name     NIC Bus ID       Kernel Driver  OCP BR NIC     PassThru Eligible
====================================================================================================
 enp1s0f0np0    0000:01:00.0   mlx5_core      Yes            No            
 enp1s0f1np1    0000:01:00.1   mlx5_core      Yes            No            
 enP2s2f0np0    0002:01:00.0   mlx5_core      No             Yes           
 enP2s2f1np1    0002:01:00.1   mlx5_core      No             Yes           

Loading vfio-pci......Done!


Unbinding device 0002:01:00.0 from mlx5_core kernel driver...
Applying driver override to device 0002:01:00.0...
Binding device 0002:01:00.0 to vfio-pci...
Device kernel driver validation...
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core

Unbinding device 0002:01:00.1 from mlx5_core kernel driver...
Applying driver override to device 0002:01:00.1...
Binding device 0002:01:00.1 to vfio-pci...
Device kernel driver validation...
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core

Notice the script changes the kernel driver in use for the two devices. If we run the script again we should see that no changes can be made because there are no other eligible passthrough devices.

sh-5.2# ./passthrough-some-nics.sh -n 15b3:a2dc

 NIC Name     NIC Bus ID       Kernel Driver  OCP BR NIC     PassThru Eligible
====================================================================================================
 enp1s0f0np0    0000:01:00.0   mlx5_core      Yes            No            
 enp1s0f1np1    0000:01:00.1   mlx5_core      Yes            No            
 NA             0002:01:00.0   vfio-pci       No             Complete      
 NA             0002:01:00.1   vfio-pci       No             Complete      
vfio_pci 16384 0 - Live 0xffffb968aee88000

Now that we have seen the script work let's make this more relatable to OpenShift. First we will have to base64 encode the script by piping it through base64 command.

$ BASE64_SCRIPT=$(cat passthrough-some-nics.sh | base64 -w 0)
$ echo $BASE64_SCRIPT
IyEvYmluL2Jhc2gKIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjCiMgVGhpcyBzY3JpcHQgcGFzc2VzIHRocm91Z2ggc29tZSBvZiB0aGUgTklDcyB3aGVuIGFsbCB0aGUgTklDcyBhcmUgdGhlIHNhbWUgZGV2aWNlIHR5cGUgICAgICAgICAgICAgICAgICAgIwojIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMKCiMgSG93IHRvIHVzZSB0aGUgc2NyaXB0IGlmIHVzZXIgZG9lcyBub3Qga25vdyBob3cKaG93dG8oKXsKICBlY2hvICJVc2FnZTogcGFzc3Rocm91Z2gtc29tZS1uaWNzLnNoIC1uIDxuaWMtZGV2aWNlLWlkPiIKICBlY2hvICJFeGFtcGxlIFNpbmdsZSBEZXZpY2UgSUQ6IHBhc3N0aHJvdWdoLXNvbWUtbmljcy5zaCAtbiAxNWIzOmEyZGMiCiAgZWNobyAiRXhhbXBsZSBNdWx0aSBEZXZpY2UgSUQ6IHBhc3N0aHJvdWdoLXNvbWUtbmljcy5zaCAtbiAxZGQ4OjEwMDJ8MTViMzoxMDIxIgp9CgojIEdldG9wdHMgc2V0dXAgZm9yIHZhcmlhYmxlcyB0byBwYXNzIGZyb20gb3B0aW9ucwp3aGlsZSBnZXRvcHRzIGc6bjp1OnI6aCBvcHRpb24KZG8KY2FzZSAiJHtvcHRpb259IgppbgpuKSBuaWNpZD0ke09QVEFSR307OwpoKSBob3d0bzsgZXhpdCAwOzsKXD8pIGhvd3RvOyBleGl0IDE7Owplc2FjCmRvbmUKCiMgTWFrZSBzdXJlIHRoZSB2YXJpYWJsZXMgYXJlIHBvcHVsYXRlZCB3aXRoIHZhbHVlcyBvdGhlcndpc2Ugc2hvdyBob3d0bwppZiAoWyAteiAiJG5pY2lkIiBdKSB0aGVuCiAgIGhvd3RvCiAgIGV4aXQgMQpmaQoKIyBTZXQgdGFibGUgaGVhZGVyIGZvcm1hdCAKZGl2aWRlcj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09CmRpdmlkZXI9JGRpdmlkZXIkZGl2aWRlciRkaXZpZGVyCmhlYWRlcj0iXG4gJS0xMnMgJS0xNnMgJS0xNHMgJS0xNHMgJS0xNHNcbiIKZm9ybWF0PSIgJS0xNHMgJS0xNHMgJS0xNHMgJS0xNHMgJS0xNHNcbiIKd2lkdGg9MTAwCgojIFNsdXJwIGluIG5pYyBkZXZpY2UgdHlwZSBpZHMgZnJvbSBsc3BjaQpuaWNpZD1gZWNobyAkbmljaWQgfHNlZCAncy8sL1x8L2cnYAptYXBmaWxlIC10IG15X25pY3MgPCA8KGxzcGNpIC1ufGdyZXAgLUUgJG5pY2lkKQoKIyBQcmludCBvdXQgaGVhZGVycyAKcHJpbnRmICIkaGVhZGVyIiAiTklDIE5hbWUiICJOSUMgQnVzIElEIiAiS2VybmVsIERyaXZlciIgIk9DUCBCUiBOSUMiICJQYXNzVGhydSBFbGlnaWJsZSIKcHJpbnRmICIlJHdpZHRoLiR7d2lkdGh9c1xuIiAiJGRpdmlkZXIiCgojIEdyYWIgaW50ZXJmYWNlIGFzc29jaWF0ZWQgdG8gb3ZzLXN5c3RlbSBicmlkZ2UuICBCb25kcyBkbyBub3Qgd29yayBoZXJlIHlldApicnBoeWludD1gb3ZzLXZzY3RsIC0tbm8taGVhZGluZyAtLWZvcm1hdD10YWJsZSAtLWNvbHVtbnM9bmFtZSx0eXBlIGZpbmQgSW50ZXJmYWNlIHR5cGU9c3lzdGVtfCBhd2sgJ3twcmludCAkMX0nYApicnBoeWJ1cz1gZ3JlcCBQQ0lfU0xPVF9OQU1FIC9zeXMvY2xhc3MvbmV0LyovZGV2aWNlL3VldmVudHxncmVwICRicnBoeWludHwgYXdrIC1GICI9IiAne3ByaW50ICQyfSdgCgojIERlY2xhcmUgZW1wdHkgYXJyYXkgdG8gc3RvcmUgbmljIGRldGFpbHMgb24gdGhvc2UgdGhhdCBjYW4gYmUgdW5ib3VuZApkZWNsYXJlIC1hIHBhc3N0aHJvdWdoPSgpCgpmb3IgKCggbmljPTA7IG5pYzwkeyNteV9uaWNzW0BdfTsgbmljKysgKSkKZG8KICAgbmljYnVzaWQ9YGVjaG8gJHtteV9uaWNzWyRuaWNdfSB8IGF3ayAne3ByaW50ICQxfSdgCiAgIG5pY2tkcnY9YGxzcGNpIC1rbiAtcyAkbmljYnVzaWQgfCBncmVwICJLZXJuZWwgZHJpdmVyIGluIHVzZToifCBhd2sgLUYgIjogIiAne3ByaW50ICQyfSdgCiAgIG5pY25hbWU9YGdyZXAgUENJX1NMT1RfTkFNRSAvc3lzL2NsYXNzL25ldC8qL2RldmljZS91ZXZlbnR8Z3JlcCAkbmljYnVzaWR8IGF3ayAtRiAnLycgJ3twcmludCAkNX0nYAogICBpZiBbICIkbmljbmFtZSIgPSAiIiBdOyB0aGVuCiAgICAgIG5pY25hbWU9Ik5BIgogICBmaQoKICAgIyBPYnRhaW4gZmlyc3QgMTEgY2hhcmFjdGVycyBvZiBlYWNoIHZhcmlhYmxlIHN0cmluZyB0byB1c2UgZm9yIGNvbXBhcmUKICAgc3VibmljYnVzaWQ9IiR7bmljYnVzaWQ6MDoxMX0iCiAgIHN1YmJycGh5YnVzPSIke2JycGh5YnVzOjA6MTF9IgoKICAgIyBDb21wYXJlIHRoZSBzdWJzdHJpbmdzCiAgIGlmIFtbICIkc3VibmljYnVzaWQiID09ICIkc3ViYnJwaHlidXMiIF1dOyB0aGVuCiAgICAgIHN5c25pYz0iWWVzIgogICAgICBwYXNzdGhydT0iTm8iCiAgICAgICMgRGlzcGxheSB0byBjb25zb2xlIHRoZSBkZXRhaWxzCiAgICAgIHByaW50ZiAiJGZvcm1hdCIgJG5pY25hbWUgJG5pY2J1c2lkICRuaWNrZHJ2ICRzeXNuaWMgJHBhc3N0aHJ1CiAgIGVsc2UKICAgICAgc3lzbmljPSJObyIKICAgICAgaWYgWyAiJG5pY2tkcnYiID0gInZmaW8tcGNpIiBdOyB0aGVuCiAgICAgICAgIHBhc3N0aHJ1PSJDb21wbGV0ZSIKICAgICAgZWxzZQogICAgICAgICBwYXNzdGhydT0iWWVzIgogICAgICAgICBwYXNzdGhyb3VnaCs9KCIkbmljYnVzaWR8JG5pY2tkcnYiKQogICAgICBmaQogICAgICAjIERpc3BsYXkgdG8gY29uc29sZSB0aGUgZGV0YWlscwogICAgICBwcmludGYgIiRmb3JtYXQiICRuaWNuYW1lICRuaWNidXNpZCAkbmlja2RydiAkc3lzbmljICRwYXNzdGhydQogICBmaQpkb25lCgppZiAhIGdyZXAgLUUgIl52ZmlvX3BjaSAiIC9wcm9jL21vZHVsZXM7IHRoZW4KICBlY2hvICIgIgogIGVjaG8gLW4gIkxvYWRpbmcgdmZpby1wY2kuLi4iCiAgbW9kcHJvYmUgdmZpby1wY2kKICBlY2hvICIuLi5Eb25lISIKICBlY2hvICIgIgpmaQoKCmZvciAoKCBwYXNzPTA7IHBhc3M8JHsjcGFzc3Rocm91Z2hbQF19OyBwYXNzKysgKSkKZG8KICAgbmljYnVzaWQ9YGVjaG8gJHtwYXNzdGhyb3VnaFskcGFzc119IHwgYXdrIC1GICJ8IiAne3ByaW50ICQxfSdgCiAgIG5pY2tkcnY9YGVjaG8gJHtwYXNzdGhyb3VnaFskcGFzc119IHwgYXdrIC1GICJ8IiAne3ByaW50ICQyfSdgCiAgIGVjaG8gIiAiCiAgIGVjaG8gIlVuYmluZGluZyBkZXZpY2UgJG5pY2J1c2lkIGZyb20gJG5pY2tkcnYga2VybmVsIGRyaXZlci4uLiIKICAgZWNobyAtbiAiJG5pY2J1c2lkIiA+IC9zeXMvYnVzL3BjaS9kcml2ZXJzL21seDVfY29yZS91bmJpbmQKICAgZWNobyAiQXBwbHlpbmcgZHJpdmVyIG92ZXJyaWRlIHRvIGRldmljZSAkbmljYnVzaWQuLi4iCiAgIGVjaG8gdmZpby1wY2kgPiAvc3lzL2J1cy9wY2kvZGV2aWNlcy8kbmljYnVzaWQvZHJpdmVyX292ZXJyaWRlCiAgIGVjaG8gIkJpbmRpbmcgZGV2aWNlICRuaWNidXNpZCB0byB2ZmlvLXBjaS4uLiIKICAgZWNobyAiJG5pY2J1c2lkIiA+IC9zeXMvYnVzL3BjaS9kcml2ZXJzL3ZmaW8tcGNpL2JpbmQKICAgZWNobyAiRGV2aWNlIGtlcm5lbCBkcml2ZXIgdmFsaWRhdGlvbi4uLiIKICAgbHNwY2kgLWsgLXMgJG5pY2J1c2lkCmRvbmUKZXhpdCAwCg==

We will also set our device id variable that will get embedded in the machineconfig as the argument for the script. Please note if we wanted to use multiple device ids we would pipe delimite them.

$ DEVICEID="15b3:a2dc" # Single device id

$ DEVICEID="1dd8:1002|15b3:1021" # Multiple device ids

We also have to set the the length of wait time to allow system to come up. 120 seconds is a good rule of thumb.

$ SLP="120"

Then we have to configure a MachineConfig that will place the base64 encoded script on the system and establish a systemd service to run the script everytime the node boots.

$ cat > passthrough-for-some-machineconfig.yaml << EOF
kind: MachineConfig
apiVersion: machineconfiguration.openshift.io/v1
metadata:
  name: passthrough-for-some-systemd-service
  labels:
    machineconfiguration.openshift.io/role: master
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - name: passthrough-for-some.service
        enabled: true
        contents: |
          [Unit]
          Description=Identifies and enabled passthough on select network interfaces
          After=NetworkManager-wait-online.service openvswitch.service
          Wants=NetworkManager-wait-online.service openvswitch.service
          [Service]
          RemainAfterExit=yes
          ExecStart=/etc/scripts/passthrough-some-nics.sh -n $DEVICEID -s $SLP
          Type=oneshot
          [Install]
          WantedBy=multi-user.target
    storage:
      files:
      - filesystem: root
        path: "/etc/scripts/passthrough-some-nics.sh"
        contents:
          source: data:text/plain;charset=utf-8;base64,$BASE64_SCRIPT
          verification: {}
        mode: 0755
        overwrite: true
EOF

Now let's create the MachineConfig on the cluster.

$ oc create -f passthrough-for-some-machineconfig.yaml 
machineconfig.machineconfiguration.openshift.io/passthrough-for-some-systemd-service created

We need to wait for the node to reboot. Once oc get mcp is responsive and confirms the node is updated we can start to validate.

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-c88d4164a5bd26edb3d4025d24a5d2f8   True      False      False      1              1                   1                     0                      6d7h
worker   rendered-worker-9890b2fbe760e8e731e68bf217b87278   True      False      False      0              0                   0                     0                      6d7h

Let's check the status of the service on the node. We can see from the below output it already identified the interfaces that can be made passthrough.

# systemctl status passthrough-for-some.service 
● passthrough-for-some.service - Identifies and enabled passthough on select network interfaces
     Loaded: loaded (/etc/systemd/system/passthrough-for-some.service; enabled; preset: disabled)
     Active: activating (start) since Thu 2026-02-19 22:27:01 UTC; 5min ago
        Job: 408
 Invocation: 29eaf89183be4424a9f2fb4a2bd249a4
   Main PID: 4282 (passthrough-som)
      Tasks: 1 (limit: 3084134)
     Memory: 1.5M (peak: 10.8M)
        CPU: 213ms
     CGroup: /system.slice/passthrough-for-some.service
             └─4282 /bin/bash /etc/scripts/passthrough-some-nics.sh -n 15b3:a2dc

Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: ====================================================================================================
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:  enp1s0f0np0    0000:01:00.0   mlx5_core      Yes            No
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:  enp1s0f1np1    0000:01:00.1   mlx5_core      Yes            No
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:  enP2s2f0np0    0002:01:00.0   mlx5_core      No             Yes
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:  enP2s2f1np1    0002:01:00.1   mlx5_core      No             Yes
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:  
Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Loading vfio-pci......Done!
Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:  
Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:  
Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Unbinding device 0002:01:00.0 from mlx5_core kernel driver...

Let's look at the lspci output for the devices we saw in the logs. We can see the first two interfaces stayed bound to mlx5_core because those ports are part of the same card and associated to the OVS bridge. The last two ports though were unbound from mlx5_core and bound to vfio-pci to enable passthrough.

# lspci -k -s 0000:01:00.0
0000:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core

# lspci -k -s 0000:01:00.1
0000:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core

# lspci -k -s 0002:01:00.0
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core

# lspci -k -s 0002:01:00.1
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
    Subsystem: Mellanox Technologies Device 0009
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core

One final thing we can do is run the script manually on the node again to also confirm our findings.

# /etc/scripts/passthrough-some-nics.sh -n 15b3:a2dc

 NIC Name     NIC Bus ID       Kernel Driver  OCP BR NIC     PassThru Eligible
====================================================================================================
 enp1s0f0np0    0000:01:00.0   mlx5_core      Yes            No            
 enp1s0f1np1    0000:01:00.1   mlx5_core      Yes            No            
 NA             0002:01:00.0   vfio-pci       No             Complete      
 NA             0002:01:00.1   vfio-pci       No             Complete      
vfio_pci 16384 0 - Live 0xffffd5d69072b000

Openshift Virtualization Passthrough

Now that our devices are set to passthrough we can configure OpenShift Virtualization to see them as an available resource. We will need to edite the hyperconverged setup on our OpenShift cluster and add the following section.

  permittedHostDevices:
    pciHostDevices:
    - pciDeviceSelector: 15b3:a2dc
      resourceName: nvidia.com/BF3_CX7
  resourceRequirements:

We can make the edit by doing the following and inserting the section above right before the resourceRequirements section of the spec file.

$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

Then we can confirm the resources are exposed by the OpenShift node using oc describe node.

$ oc describe node | grep -E 'Capacity:|Allocatable:' -A12
Capacity:
  cpu:                            72
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              936709572Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  hugepages-32Mi:                 0
  hugepages-64Ki:                 0
  memory:                         493510268Ki
  nvidia.com/BF3_CX7:             2
  pods:                           250
Allocatable:
  cpu:                            71500m
  devices.kubevirt.io/kvm:        1k
  devices.kubevirt.io/tun:        1k
  devices.kubevirt.io/vhost-net:  1k
  ephemeral-storage:              862197798302
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  hugepages-32Mi:                 0
  hugepages-64Ki:                 0
  memory:                         492359292Ki
  nvidia.com/BF3_CX7:             2
  pods:                           250

Now when we go launch a virtual machine in OpenShift we will want to include the following section in our virtual machine spec file nested under spec->domain->devices.

          hostDevices:
            - deviceName: nvidia.com/BF3_CX7
              name: hostDevices-turquoise-hornet-42

And if all goes well once we launch our virtual machine and it's running we should be able to see the passthrough ethernet interface.

$ oc get vmi -n openshift-cnv
NAMESPACE       NAME                  AGE   PHASE     IP            NODENAME                                   READY
openshift-cnv   rhel9-red-locust-96   10m   Running   10.128.0.49   nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com   True

$ virtctl console rhel9-red-locust-96 -n openshift-cnv
Successfully connected to rhel9-red-locust-96 console. The escape sequence is ^]

rhel9-red-locust-96 login: cloud-user
Password: 
Last login: Fri Feb 20 08:08:53 on tty1

[cloud-user@rhel9-red-locust-96 ~]$ sudo bash
[root@rhel9-red-locust-96 cloud-user]# lspci -nn|grep Mellanox
0a:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)

Hopefully this provides a decent example of enabling passthrough for a subset of devices on a server where all the devices are the same but not all can be passed through due to the need for base networking at the OS level.

Saturday, January 04, 2025

RDMA with NVIDIA on OpenShift

The rise of artificial intelligence(AI) has generated some really challenging problems with data movement. In traditional environments if I needed to move data from one node to another it would need to be manipulated by the central processor (CPU) of the host. While this was reasonable with small amounts of data a better and more efficient method is needed for AI workloads and their large datasets.

To solve this challenge we can use RDMA or remote direct memory access which enables direct memory access from the memory of one compute node to another compute node without involving the CPU of the hosts. This enables high-throughput, low-latency networking which is especially useful in massive compute clusters with large datasets.

The rest of this blog will cover example(s) of using RDMA with NVIDIA's Network Operator and GPU Operator along with Red Hat OpenShift Container Platform. The three primary examples covered in this document will be: RDMA Shared Device, RDMA Host Device and RDMA in Legacy SRIOV.

Lab Environment

The following configurations and testing were done a OpenShift environment that consisted of the following:

OpenShift 4.16.19 x86
Network Operator 24.10
All other operators used the default values for OCP 4.16.
3 physical nodes: 1 SNO master, 2 workers
The workers consisted of Dell R760xa with 2 NVIDIA BF3 cards in them.
One BF3 card was attached to the NVIDIA Spectrum SN5600 switch for RDMA over ethernet
One BF3 card was attached to the NVIDIA Quantum QM9700 switch for RDMA over infiniband

Blacklist IRDMA Module

On some systems, including the DellR750xa I used for testing, the irdma kernel module creates problems for the NVIDIA Network Operator on unload/load of the DOCA drivers so we need to blacklist it with a machine configuration that gets applied to all worker nodes.

Generate the following machine configuration file yaml specifying the module irdma to blacklist.

$ cat <<EOF > 99-machine-config-blacklist-irdma.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-blacklist-irdma
spec:
  kernelArguments:
    - "module_blacklist=irdma"
EOF

Then create the machine configuration on the cluster and wait for the worker nodes to reboot.

$ oc create -f 99-machine-config-blacklist-irdma.yaml 
machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created

Validate in a debug pod on each node that the module has not loaded.

$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-btfj2 ...
To use host binaries, run `chroot /host`
Pod IP: 10.6.135.11
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# lsmod|grep irdma
sh-5.1#

At this point, if everything looks good, we can move onto the next steps of the workflow.

Persistent Naming Rules

Sometimes there is a need to make sure the device names persist on reboots. On the R760xa systems and where nodes had a large number of networking cards, I was noticing the Mellanox devices were being renamed on reboots so I decided to use a MachineConfig to set persistence.

First gather the the MAC address names into a file from the worker nodes for the node(s) and also provide names for the interfaces that need to persist. We will call the file 70-persistent-net.rules and stash the details in it.

$ cat <<EOF > 70-persistent-net.rules
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:28",ATTR{type}=="1",NAME="ibs2f0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:29",ATTR{type}=="1",NAME="ens8f0np0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d0",ATTR{type}=="1",NAME="ibs2f0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d1",ATTR{type}=="1",NAME="ens8f0np0"
EOF

Now we need to convert that file into a base64 string without line breaks and set the output to the variable PERSIST.

$ PERSIST=`cat 70-persistent-net.rules| base64 -w 0`

$ echo $PERSIST
U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK

Now we can create a machine configuration and set the base64 encoding in our custom resource file. Notice how I am using the PERSIST variable in my yaml creation to mitigate copy/paste type errors.

$ cat <<EOF > 99-machine-config-udev-network.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
   labels:
     machineconfiguration.openshift.io/role: worker
   name: 99-machine-config-udev-network
spec:
   config:
     ignition:
       version: 3.2.0
     storage:
       files:
       - contents:
           source: data:text/plain;base64,$PERSIST
         filesystem: root
         mode: 420
         path: /etc/udev/rules.d/70-persistent-net.rules
EOF

Once we have the machine configuration we can create it on the cluster.

$ oc create -f 99-machine-config-udev-network.yaml 
machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-9adfe851c2c14d9598eea5ec3df6c187   True      False      False      1              1                   1                     0                      6h21m
worker   rendered-worker-4568f1b174066b4b1a4de794cf538fee   False     True       False      2              0                   0                     0                      6h21m

The worker nodes will reboot and once the updating field goes back to false we can validate on the nodes by looking at the devices in a debug pod if we chose to do so.

If everything looks good we can move onto configuring the operators of the OpenShift cluster.

Install and Configure Required Operators

This next section will cover the installation and configurations of the required operators we need for the RDMA testing.

Install and Configure NFD Operator

The Node Feature Discovery (NFD) operator manages the detection of hardware features and configuration in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, operating system version, and so on.

To get started we will generate a NFD Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > nfd-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-nfd
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-nfd
  namespace: openshift-nfd
spec:
  targetNamespaces:
    - openshift-nfd
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nfd
  namespace: openshift-nfd
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: nfd
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

Next we can create the resources on the cluster.

$ oc create -f nfd-operator.yaml
namespace/openshift-nfd created
operatorgroup.operators.coreos.com/openshift-nfd created
subscription.operators.coreos.com/nfd created

We can validate that the operator is installed and running by looking at the pods in the openshift-nfd namespace.

$ oc get pods -n openshift-nfd
NAME                                      READY   STATUS    RESTARTS   AGE
nfd-controller-manager-8698c88cdd-t8gbc   2/2     Running   0          2m

With the NFD controller running we can move onto generating the NodeFeatureDiscovery instance and adding it to the cluster.

The ClusterServiceVersion specification for NFD operator provides default values, including the NFD operand image that is part of the operator payload. We retrieve its value with the following command line and assign it to the variable NFD_OPERAND_IMAGE.

$ NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`

We can now create the NodeFeatureDiscovery instance. Note that we add entries to the default deviceClasseWhiteList field, so that to support more network adapters, such as the NVIDIA BlueField DPUs and the NVIDIA GPUs.

$ cat <<EOF > nfd-instance.yaml
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: nfd-instance
  namespace: openshift-nfd
spec:
  instance: ''
  operand:
    image: '${NFD_OPERAND_IMAGE}'
    servicePort: 12000
  prunerOnDelete: false
  topologyUpdater: false
  workerConfig:
    configData: |
      core:
        sleepInterval: 60s
      sources:
        pci:
          deviceClassWhitelist:
            - "02"
            - "03"
            - "0200"
            - "0207"
            - "12"
          deviceLabelFields:
            - "vendor"
EOF

$ oc create -f nfd-instance.yaml
nodefeaturediscovery.nfd.openshift.io/nfd-instance created

Finally we can validate our instance is up and running by again looking at the pods under the openshift-nfd namespace.

$ oc get pods -n openshift-nfd
NAME                                    READY   STATUS    RESTARTS   AGE
nfd-controller-manager-7cb6d656-jcnqb   2/2     Running   0          4m
nfd-gc-7576d64889-s28k9                 1/1     Running   0          21s
nfd-master-b7bcf5cfd-qnrmz              1/1     Running   0          21s
nfd-worker-96pfh                        1/1     Running   0          21s
nfd-worker-b2gkg                        1/1     Running   0          21s
nfd-worker-bd9bk                        1/1     Running   0          21s
nfd-worker-cswf4                        1/1     Running   0          21s
nfd-worker-kp6gg                        1/1     Running   0          21s

After a minute or so, we can verify that NFD has added labels to the node. The NFD labels are prefixed with feature.node.kubernetes.io, so we can easily filter them.

$ oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))'
{
  "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.CETSS": "true",
  "feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true",
  "feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true",
  "feature.node.kubernetes.io/cpu-cpuid.CPBOOST": "true",
  "feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FP256": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FSRM": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FXSR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBRS": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBS": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBSFFV": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST": "true",
  "feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.INVLPGB": "true",
  "feature.node.kubernetes.io/cpu-cpuid.LAHF": "true",
  "feature.node.kubernetes.io/cpu-cpuid.LBRVIRT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MCOMMIT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MOVU": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MSRIRC": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH": "true",
  "feature.node.kubernetes.io/cpu-cpuid.NRIPS": "true",
  "feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.PPIN": "true",
  "feature.node.kubernetes.io/cpu-cpuid.PSFD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.RDPRU": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SEV": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SEV_ES": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SEV_SNP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SME": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true",
  "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SVM": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SVMDA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SVMFBASID": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SVML": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SVMNP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SVMPF": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SVMPFT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED": "true",
  "feature.node.kubernetes.io/cpu-cpuid.TOPEXT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VMPL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VTE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.X87": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true",
  "feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
  "feature.node.kubernetes.io/cpu-model.family": "25",
  "feature.node.kubernetes.io/cpu-model.id": "1",
  "feature.node.kubernetes.io/cpu-model.vendor_id": "AMD",
  "feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
  "feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true",
  "feature.node.kubernetes.io/kernel-selinux.enabled": "true",
  "feature.node.kubernetes.io/kernel-version.full": "5.14.0-427.35.1.el9_4.x86_64",
  "feature.node.kubernetes.io/kernel-version.major": "5",
  "feature.node.kubernetes.io/kernel-version.minor": "14",
  "feature.node.kubernetes.io/kernel-version.revision": "0",
  "feature.node.kubernetes.io/memory-numa": "true",
  "feature.node.kubernetes.io/network-sriov.capable": "true",
  "feature.node.kubernetes.io/pci-102b.present": "true",
  "feature.node.kubernetes.io/pci-10de.present": "true",
  "feature.node.kubernetes.io/pci-10de.sriov.capable": "true",
  "feature.node.kubernetes.io/pci-15b3.present": "true",
  "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true",
  "feature.node.kubernetes.io/rdma.available": "true",
  "feature.node.kubernetes.io/rdma.capable": "true",
  "feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
  "feature.node.kubernetes.io/system-os_release.ID": "rhcos",
  "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.17",
  "feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "417.94.202409121747-0",
  "feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.4",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.17",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "17"
}

Finally we can confirm there is a network device that is discovered.

$ oc describe node | grep -E 'Roles|pci' | grep pci-15b3
                    feature.node.kubernetes.io/pci-15b3.present=true
                    feature.node.kubernetes.io/pci-15b3.sriov.capable=true
                    feature.node.kubernetes.io/pci-15b3.present=true
                    feature.node.kubernetes.io/pci-15b3.sriov.capable=true

If everything looks good we can move onto the next operator.

Install and Configure NMState Operator

There might be a need to configure network interfaces on the nodes that were not configured at initial cluster creation time and the NMState operator is designed for those use cases. The first step is to create a custom resource file that contains the namespace, operator group and subscription.

$ cat <<EOF > nmstate-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: openshift-nmstate
    name: openshift-nmstate
  name: openshift-nmstate
spec:
  finalizers:
  - kubernetes
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    olm.providedAPIs: NMState.v1.nmstate.io
  name: openshift-nmstate
  namespace: openshift-nmstate
spec:
  targetNamespaces:
  - openshift-nmstate
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/kubernetes-nmstate-operator.openshift-nmstate: ""
  name: kubernetes-nmstate-operator
  namespace: openshift-nmstate
spec:
  channel: stable
  installPlanApproval: Automatic
  name: kubernetes-nmstate-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

Then we can take the custom resource file and create it on the cluster.

$ oc create -f nmstate-operator.yaml 
namespace/openshift-nmstate created
operatorgroup.operators.coreos.com/openshift-nmstate created
subscription.operators.coreos.com/kubernetes-nmstate-operator created

Next we should validate the operator is up and running.

$ oc get pods -n openshift-nmstate
NAME                               READY   STATUS    RESTARTS   AGE
nmstate-operator-d587966c9-qkl5m   1/1     Running   0          43s

A nmstate instance is required so we will create a custom resource file for that.

$ cat <<EOF > nmstate-instance.yaml
apiVersion: nmstate.io/v1
kind: NMState
metadata:
  name: nmstate
EOF

Then we will create the instance on the cluster.

$ oc create -f nmstate-instance.yaml 
nmstate.nmstate.io/nmstate created

Finally we will validate the instance is running.

$ oc get pods -n openshift-nmstate
NAME                                      READY   STATUS    RESTARTS   AGE
nmstate-cert-manager-6dc78dc6bf-ds7kj     1/1     Running   0          17s
nmstate-console-plugin-5b7595c56c-tgzbw   1/1     Running   0          17s
nmstate-handler-lxkd5                     1/1     Running   0          17s
nmstate-operator-d587966c9-qkl5m          1/1     Running   0          3m27s
nmstate-webhook-54dbd47d9d-cvsf6          0/1     Running   0          17s

Next we can build a NodeNetworkConfigurationPolicy. The example below will configure a static ipaddress on the ens8f0np0 interface on nvd-srv-32.

$ cat <<EOF > nncp-static-ip.yaml
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: ens8f0np0-policy 
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
  desiredState:
    interfaces:
    - name: ens8f0np0 
      description: Configuring ens8f0np0 on nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
      type: ethernet 
      state: up 
      ipv4:
        dhcp: false 
        address:
        - ip: 10.6.145.32
          prefix-length: 24
        enabled: true
EOF

Once we have the custom resource file we can create it on the cluster.

$ oc create -f nncp-static-ip.yaml 
nodenetworkconfigurationpolicy.nmstate.io/ens8f0np0-policy created

$ oc get nncp -A
NAME               STATUS      REASON
ens8f0np0-policy   Available   SuccessfullyConfigured

We can validate that the ipaddress is set by looking inside the node at the interface.

$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-8mx6q ...
To use host binaries, run `chroot /host`
Pod IP: 10.6.135.11
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host

sh-5.1# ip address show dev ens8f0np0
96: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 58:a2:e1:e1:42:78 brd ff:ff:ff:ff:ff:ff
    altname enp160s0f0np0
    inet 10.6.145.32/24 brd 10.6.145.255 scope global noprefixroute ens8f0np0
       valid_lft forever preferred_lft forever
    inet6 fe80::c397:5afa:d618:e752/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

If everything looks good we can proceed to the next operator.

Install and Configure SRIOV Operator

Now we need to create the SRIOV Operator custom resource file to create the namespace, operator group and subscription.

$ cat << EOF > openshift-sriov-network-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-sriov-network-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: sriov-network-operators
  namespace: openshift-sriov-network-operator
spec:
  targetNamespaces:
  - openshift-sriov-network-operator
  upgradeStrategy: Default
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: sriov-network-operator-subscription
  namespace: openshift-sriov-network-operator
spec:
  channel: stable
  installPlanApproval: Automatic
  name: sriov-network-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

Now we can create the SRIOV resource on the cluster.

$ oc create -f openshift-sriov-network-operator.yaml
namespace/openshift-sriov-network-operator created
operatorgroup.operators.coreos.com/sriov-network-operators created
subscription.operators.coreos.com/sriov-network-operator-subscription created

We can validate the operator is running by looking at the pod output.

$ oc get pods -n openshift-sriov-network-operator
NAME                                      READY   STATUS    RESTARTS   AGE
sriov-network-operator-7cb6c49868-89486   1/1     Running   0          22s

Next we will need to create the default SriovOperatorConfig configuration file.

$ cat <<EOF > sriov-operator-config.yaml 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
  name: default
  namespace: openshift-sriov-network-operator
spec:
  enableInjector: true
  enableOperatorWebhook: true
  logLevel: 2
EOF

Then create the resource on the cluster.

$ oc create -f sriov-operator-config.yaml 
sriovoperatorconfig.sriovnetwork.openshift.io/default created

For the default SriovOperatorConfig to work with the MLNX_OFED container, please run the following patch command.

$ oc patch sriovoperatorconfig default   --type=merge -n openshift-sriov-network-operator   --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
sriovoperatorconfig.sriovnetwork.openshift.io/default patched

If everything looks good we can proceed to installing the next operator.

Install and Configure Network Operator

To get started we will generate a NVIDIA Network Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > network-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-network-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: nvidia-network-operator
  namespace: nvidia-network-operator
spec:
  targetNamespaces:
  - nvidia-network-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nvidia-network-operator
  namespace: nvidia-network-operator
spec:
  channel: v24.10.0
  installPlanApproval: Automatic
  name: nvidia-network-operator
  source: certified-operators
  sourceNamespace: openshift-marketplace
EOF

Next we can create the resources on the cluster.

$ oc create -f network-operator.yaml 
namespace/nvidia-network-operator created
operatorgroup.operators.coreos.com/nvidia-network-operator created
subscription.operators.coreos.com/nvidia-network-operator created

We can then validate that the network operator has installed and is running by confirming the controller is running in the nvidia-network-operator namespace.

$ oc get pods -n nvidia-network-operator
NAME                                                          READY   STATUS             RESTARTS         AGE
nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg   1/1     Running            0                5m

With the operator up we can create the NicClusterPolicy custom resource file. Note in this file I have hard coded the Infiniband interface as ibs2f0 and ethernet interface as ens8f0np0 that I will be using as my shared rdma device. This could be a different devices depending on the system configuration.

$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  nicFeatureDiscovery:
    image: nic-feature-discovery
    repository: ghcr.io/mellanox
    version: v0.0.1
  docaTelemetryService:
    image: doca_telemetry
    repository: nvcr.io/nvidia/doca
    version: 1.16.5-doca2.6.0-host
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibs2f0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens8f0np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: ghcr.io/mellanox
    version: v1.5.1
  secondaryNetwork:
    ipoib:
      image: ipoib-cni
      repository: ghcr.io/mellanox
      version: v1.2.0
  nvIpam:
    enableWebhook: false
    image: nvidia-k8s-ipam
    repository: ghcr.io/mellanox
    version: v0.2.0
  ofedDriver:
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    forcePrecompiled: false
    terminationGracePeriodSeconds: 300
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        timeoutSeconds: 300
        podSelector: ''
      maxParallelUpgrades: 1
      safeLoad: false
      waitForCompletion:
        timeoutSeconds: 0
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 24.10-0.7.0.0-0
    env:
    - name: UNLOAD_STORAGE_MODULES
      value: "true"
    - name: RESTORE_DRIVER_ON_POD_TERMINATION
      value: "true"
    - name: CREATE_IFNAMES_UDEV
      value: "true"
EOF

Next we can create the NicClusterPolicy custom resource on the cluster.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml 
nicclusterpolicy.mellanox.com/nic-cluster-policy created

We can validate the NicClusterPolicy by running a few commands in the DOCA/MOFED container.

$ oc get pods -n nvidia-network-operator
NAME                                                          READY   STATUS    RESTARTS   AGE
doca-telemetry-service-hwj65                                  1/1     Running   2          160m
kube-ipoib-cni-ds-fsn8g                                       1/1     Running   2          160m
mofed-rhcos4.16-9b5ddf4c6-ds-ct2h5                            2/2     Running   4          160m
nic-feature-discovery-ds-dtksz                                1/1     Running   2          160m
nv-ipam-controller-854585f594-c5jpp                           1/1     Running   2          160m
nv-ipam-controller-854585f594-xrnp5                           1/1     Running   2          160m
nv-ipam-node-xqttl                                            1/1     Running   2          160m
nvidia-network-operator-controller-manager-5798b564cd-5cq99   1/1     Running   2         5d23h
rdma-shared-dp-ds-p9vvg                                       1/1     Running   0          85m

And we can rsh into the mofed container to check a few things.

$ MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed)
$ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD}
sh-5.1# ofed_info -s
OFED-internal-24.10-0.7.0.0-0:
sh-5.1# ibdev2netdev -v
0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket                                                       fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up)
0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket                                                       fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)

Now we need to create a IPoIBNetwork custom resource file (for infiniband based interfaces).

$ cat <<EOF > ipoib-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
  name: example-ipoibnetwork
spec:
  ipam: |
    {
      "type": "whereabouts",
      "range": "192.168.6.225/28",
      "exclude": [
       "192.168.6.229/30",
       "192.168.6.236/32"
      ]
    }
  master: ibs2f0
  networkNamespace: default
EOF

And then create the IPoIBNetwork resource on the cluster.

$ $ oc create -f ipoib-network.yaml 
ipoibnetwork.mellanox.com/example-ipoibnetwork created

We will do the same thing for our ethernet interface but this will be a MacvlanNetwork custom resource file.

$ cat <<EOF > macvlan-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdmashared-net
spec:
  networkNamespace: default
  master: ens8f0np0
  mode: bridge
  mtu: 1500
  ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}'
EOF

Then create the resource on the cluster.

$ oc create -f macvlan-network.yaml 
macvlannetwork.mellanox.com/rdmashared-net created

If everything looks good we can proceed to the next operator.

Install and Configure GPU Operator

The next operator we need to configured is the NVIDIA GPU Operator. As with most operators, we will need to configure a namespace, operator group and subscription.

To get started we will generate a NVIDIA GPU Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > gpu-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-gpu-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: nvidia-gpu-operator
  namespace: nvidia-gpu-operator
spec:
  targetNamespaces:
    - nvidia-gpu-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nvidia-gpu-operator
  namespace: nvidia-gpu-operator
spec:
  channel: "v24.9"
  installPlanApproval: Automatic
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
EOF

Next we can create the resources on the cluster.

$ oc create -f gpu-operator.yaml
namespace/nvidia-gpu-operator created
operatorgroup.operators.coreos.com/nvidia-gpu-operator created
subscription.operators.coreos.com/nvidia-gpu-operator created

We can check that the operator pod is running by looking at the pods under the namespace.

$ oc get pods -n nvidia-gpu-operator
NAME                          READY   STATUS    RESTARTS   AGE
gpu-operator-b4cb7d74-zxpwq   1/1     Running   0          32s

Now that we have the operator running we need to create a GPU cluster policy custom resource file like the one below.

$ cat <<EOF > gpu-cluster-policy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    serviceMonitor:
      enabled: true
    enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    licensingConfig:
      nlsEnabled: true
      configMapName: ''
    certConfig:
      name: ''
    rdma:
      enabled: true
    kernelModuleConfig:
      name: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig:
      configMapName: ''
    virtualTopology:
      config: ''
    enabled: true
    useNvidiaDriverCRD: false
    useOpenKernelModules: true
  devicePlugin:
    config:
      name: ''
      default: ''
    mps:
      root: /run/nvidia/mps
    enabled: true
  gdrcopy:
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: true
    image: nvidia-fs
    version: 2.20.5
    repository: nvcr.io/nvidia/cloud-native
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    installDir: /usr/local/nvidia
    enabled: true
EOF

With the GPU ClusterPolicy custom resource file generated, let's create it on the cluster.

$ oc create -f gpu-cluster-policy.yaml
clusterpolicy.nvidia.com/gpu-cluster-policy created

After some time, all the pods are up and running.

$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-d5ngn                           1/1     Running     0          3m20s
gpu-feature-discovery-z42rx                           1/1     Running     0          3m23s
gpu-operator-6bb4d4b4c5-njh78                         1/1     Running     0          4m35s
nvidia-container-toolkit-daemonset-bkh8l              1/1     Running     0          3m20s
nvidia-container-toolkit-daemonset-c4hzm              1/1     Running     0          3m23s
nvidia-cuda-validator-4blvg                           0/1     Completed   0          106s
nvidia-cuda-validator-tw8sl                           0/1     Completed   0          112s
nvidia-dcgm-exporter-rrw4g                            1/1     Running     0          3m20s
nvidia-dcgm-exporter-xc78t                            1/1     Running     0          3m23s
nvidia-dcgm-nvxpf                                     1/1     Running     0          3m20s
nvidia-dcgm-snj4j                                     1/1     Running     0          3m23s
nvidia-device-plugin-daemonset-fk2xz                  1/1     Running     0          3m23s
nvidia-device-plugin-daemonset-wq87j                  1/1     Running     0          3m20s
nvidia-driver-daemonset-416.94.202410211619-0-ngrjg   4/4     Running     0          3m58s
nvidia-driver-daemonset-416.94.202410211619-0-tm4x6   4/4     Running     0          3m58s
nvidia-node-status-exporter-jlzxh                     1/1     Running     0          3m57s
nvidia-node-status-exporter-zjffs                     1/1     Running     0          3m57s
nvidia-operator-validator-l49hx                       1/1     Running     0          3m20s
nvidia-operator-validator-n44nn                       1/1     Running     0          3m23s

Once we see the pods running above, we can remote shell into the NVIDIA driver daemonset pod and confirm two items. The first is that the nvidia modules are loaded and ensuring specifically the nvidia_peermem one is there. We can also run the nvidia-smi utility to show the details about the driver and the hardware.

$ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver)
sh-4.4# lsmod|grep nvidia
nvidia_fs             327680  0
nvidia_peermem         24576  0
nvidia_modeset       1507328  0
video                  73728  1 nvidia_modeset
nvidia_uvm           6889472  8
nvidia               8810496  43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
ib_uverbs             217088  3 nvidia_peermem,rdma_ucm,mlx5_ib
drm                   741376  5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200

sh-4.4# nvidia-smi 
Wed Nov  6 22:03:53 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     On  |   00000000:61:00.0 Off |                    0 |
|  0%   37C    P0             88W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A40                     On  |   00000000:E1:00.0 Off |                    0 |
|  0%   28C    P8             29W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

While we are in the driver pod we should also set the GPU clock to maximum using the following nvidia-smi command. This is optional but why not have it at full speed.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc
sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0
All done.
sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0
All done.

One last thing we can do is validate our resource are available from a node describe perspective.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9
Capacity:
  cpu:                          128
  ephemeral-storage:            1561525616Ki
  hugepages-1Gi:                0
  hugepages-2Mi:                0
  memory:                       263596712Ki
  nvidia.com/gpu:               2
  pods:                         250
  rdma/rdma_shared_device_eth:  63
  rdma/rdma_shared_device_ib:   63
Allocatable:
  cpu:                          127500m
  ephemeral-storage:            1438028263499
  hugepages-1Gi:                0
  hugepages-2Mi:                0
  memory:                       262445736Ki
  nvidia.com/gpu:               2
  pods:                         250
  rdma/rdma_shared_device_eth:  63
  rdma/rdma_shared_device_ib:   63
--
Capacity:
  cpu:                          128
  ephemeral-storage:            1561525616Ki
  hugepages-1Gi:                0
  hugepages-2Mi:                0
  memory:                       263596672Ki
  nvidia.com/gpu:               2
  pods:                         250
  rdma/rdma_shared_device_eth:  63
  rdma/rdma_shared_device_ib:   63
Allocatable:
  cpu:                          127500m
  ephemeral-storage:            1438028263499
  hugepages-1Gi:                0
  hugepages-2Mi:                0
  memory:                       262445696Ki
  nvidia.com/gpu:               2
  pods:                         250
  rdma/rdma_shared_device_eth:  63
  rdma/rdma_shared_device_ib:   63

If everything looks good we can proceed to actual RDMA testing.

The Shared Device RDMA Testing

This section will cover running workload pods across the nodes in the environment. We will setup the required privileges, create the workload pod, validate connectivity between the two hosts on the infiniband fabric and then run a performance test.

Create Service Account

First let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > default-serviceaccount.yaml 
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rdma
  namespace: default
EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml 
serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z rdma
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

If everything looks good we can move onto creating the workload pods.

Create Workload Pods for IB

With the service account setup we now need to create a workload pod that contains all the tooling for our testing. We can generate a custom pod resource file for each worker node as follows to meet that requirement.

$ cat <<EOF > rdma-ib-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-ib-32-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: example-ipoibnetwork
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: rdma-ib-32-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_ib: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_ib: 1
EOF

$ cat <<EOF > rdma-ib-33-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-ib-33-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: example-ipoibnetwork
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: rdma-ib-33-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_ib: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_ib: 1
EOF

Then we can create the pods on the cluster.

$ oc create -f rdma-ib-32-workload.yaml
pod/rdma-ib-32-workload created

$ oc create -f rdma-ib-33-workload.yaml
pod/rdma-ib-33-workload created

Let's validate the pods is running.

$ oc get pods
NAME                  READY   STATUS    RESTARTS   AGE
rdma-ib-32-workload   1/1     Running   0          10s
rdma-ib-33-workload   1/1     Running   0          3s

With the pods up and running we can validate connectivity.

Validate IB Connectivity

This section will cover confirming the infiniband connectivity is working between the systems. This section is option but provides a lot of good infiniband troubleshooting tips. First we should rsh into each rdma-workload-client pod.

$ oc rsh -n default rdma-ib-32-workload 
sh-5.1#

The first command we can run is the ibhosts command which shows infiniband host nodes in topology.

sh-5.1# ibhosts
Ca    : 0x58a2e10300e14446 ports 1 "nvd-srv-33 mlx5_0"
Ca    : 0x58a2e10300dfe416 ports 1 "nvd-srv-32 mlx5_0"

We can also run the ibnodes command which will show not only the nodes but also switches in the topology.

sh-5.1# ibnodes
Ca    : 0x58a2e10300e14446 ports 1 "nvd-srv-33 mlx5_0"
Ca    : 0x58a2e10300dfe416 ports 1 "nvd-srv-32 mlx5_0"
Switch    : 0xfc6a1c0300e7ecc0 ports 129 "MF0;qm9700-ib:MQM9700/U1" enhanced port 0 lid 1 lmc 0

We can look deeper into an interface state by using the ibstatus command and pass an interface. If no interface is passed all will display.

sh-5.1# ibstatus mlx5_0
Infiniband device 'mlx5_0' port 1 status:
    default gid:     fe80:0000:0000:0000:58a2:e103:00df:e416
    base lid:     0x4
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         400 Gb/sec (4X NDR)
    link_layer:     InfiniBand

Now that we have familiarized ourself with the environment we can run ibstat and grep out only certain key elements of the output. These will be needed for the ibping test.

The first ibstat output is that of our first node which will act as the server side for the ibping command.

sh-5.1# ibstat | egrep "Port|Base|Link"
    Port 1:
        Physical state: LinkUp
        Base lid: 4
        Port GUID: 0x58a2e10300e14446
        Link layer: InfiniBand
    Port 1:
        Physical state: LinkUp
        Base lid: 0
        Port GUID: 0x0000000000000000
        Link layer: Ethernet

The output above shows both an infiniband and ethernet interface. We are only interested in the infiniband in this use case. Make note of the lid number as that is used in the ibping command on the client side.

We can run the same command on the client side and notice while some of the details are similar the lid number is unique along with the port GUID.

sh-5.1# ibstat | egrep "Port|Base|Link"
    Port 1:
        Physical state: LinkUp
        Base lid: 5
        Port GUID: 0x58a2e10300e14446
        Link layer: InfiniBand
    Port 1:
        Physical state: LinkUp
        Base lid: 0
        Port GUID: 0x0000000000000000
        Link layer: Ethernet

Next we can run an ibping with the server switch on the first workload pod.

sh-5.1# ibping -S -P 1 -d
ibdebug: [114] ibping_serv: starting to serve...
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)

And on the second workload pod we can run an ibping command to ping the server side we started on the other pod.

sh-5.1# ibping -P 1 4 
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.011 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.014 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms

Once we have completed confirming connectivity we can move onto the performance testing.

Performance Test Across IB Link

Now we want to run a test across the two pods running. We will need to rsh into the first pod and run the ib_write_bw command. Then we will rsh into the second pod in a different terminal window and run the ib_write_bw <ipaddress> command.

$ oc get pods -n default
NAME                  READY   STATUS    RESTARTS   AGE
rdma-ib-32-workload   1/1     Running   0          8m12s
rdma-ib-33-workload   1/1     Running   0          8m5s

First let's get the ipaddress of the first pod.

$ oc get pod rdma-ib-32-workload -o yaml | grep -E 'default/example-ipoibnetwork' -A3
          "name": "default/example-ipoibnetwork",
          "interface": "net1",
          "ips": [
              "192.168.6.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default rdma-ib-32-workload
sh-5.1# ib_write_bw

************************************
* Waiting for client to connect... *
************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default rdma-ib-33-workload
sh-5.1# ib_write_bw 192.168.6.225      
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x05 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007fcbace2f000
 remote address: LID 0x04 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007f360e3d8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 2500.000000 != 3495.887000. CPU Frequency is not max.
 65536      5000             44604.62            44576.86           0.713230
---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x04 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007f360e3d8000
 remote address: LID 0x05 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007fcbace2f000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             44604.62            44576.86           0.713230
---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

Create Workload Pods for ETH

Now we need to test IB over ethernet. We can generate a custom pod resource file for both nodes as follows to meet that requirement.

$ cat <<EOF > rdma-eth-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-eth-32-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: rdma-eth-32-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1

EOF

$ cat <<EOF > rdma-eth-33-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-eth-33-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: rdma-eth-33-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
EOF

Then we can create the pods on the cluster.

$ oc create -f rdma-eth-32-workload.yaml
pod/rdma-eth-32-workload created

$ oc create -f rdma-eth-33-workload.yaml
pod/rdma-eth-33-workload created

Let's validate the pods is running.

$ oc get pods -n default
NAME                   READY   STATUS    RESTARTS   AGE
rdma-eth-32-workload   1/1     Running   0          25s
rdma-eth-33-workload   1/1     Running   0          22s

With the pods up and running we can move onto the actual test.

Performance Test Across ETH Link

$ oc get pods -n default
NAME                   READY   STATUS    RESTARTS   AGE
rdma-eth-32-workload   1/1     Running   0          106s
rdma-eth-33-workload   1/1     Running   0          103s

First let's get the ipaddress of the first pod.

$ oc get pod rdma-eth-32-workload -o yaml | grep -E 'default/rdmashared' -A3
          "name": "default/rdmashared-net",
          "interface": "net1",
          "ips": [
              "192.168.2.1"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default rdma-eth-32-workload
sh-5.1# ib_write_bw

************************************
* Waiting for client to connect... *
************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default rdma-eth-33-workload
sh-5.1# ib_write_bw 192.168.2.1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x05 QPN 0x0ce2 PSN 0x5389f7 RKey 0x1fff00 VAddr 0x007f7368df3000
 remote address: LID 0x04 QPN 0x0ce2 PSN 0x81fa7f RKey 0x1fff00 VAddr 0x007f7e8c890000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 2500.000000 != 3497.359000. CPU Frequency is not max.
 65536      5000             44490.32            44467.35           0.711478
---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x04 QPN 0x0ce2 PSN 0x81fa7f RKey 0x1fff00 VAddr 0x007f7e8c890000
 remote address: LID 0x05 QPN 0x0ce2 PSN 0x5389f7 RKey 0x1fff00 VAddr 0x007f7368df3000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             44490.32            44467.35           0.711478
---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

The Host Device RDMA Testing

This section will demonstrate how to configure host device RDMA for Nvidia Network Operator and then how to test per pod configuration.

Configure Nic Cluster Policy for Host Device

The operator should be running from previous steps. If a NicClusterPolicy exists we need to delete the existing one and generate a new hostdev NicClusterPolicy custom resource file.

$ cat <<EOF > network-hostdev-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 24.10-0.7.0.0-0
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    env:
    - name: UNLOAD_STORAGE_MODULES
      value: "true"
    - name: RESTORE_DRIVER_ON_POD_TERMINATION
      value: "true"
    - name: CREATE_IFNAMES_UDEV
      value: "true"
  sriovDevicePlugin:
      image: sriov-network-device-plugin
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.7.0
      config: |
        {
          "resourceList": [
              {
                  "resourcePrefix": "nvidia.com",
                  "resourceName": "hostdev",
                  "selectors": {
                      "vendors": ["15b3"],
                      "isRdma": true
                  }
              }
          ]
        }
EOF

Next we can create the NicClusterPolicy custom resource on the cluster.

$ oc create -f network-hostdev-nic-cluster-policy.yaml 
nicclusterpolicy.mellanox.com/nic-cluster-policy created

We can validate the host device NicClusterPolicy by running a few commands in the DOCA/MOFED container.

$ oc get pods -n nvidia-network-operator
NAME                                                          READY   STATUS    RESTARTS   AGE
mofed-rhcos4.16-696886fcb4-ds-9sgvd                           2/2     Running   0          2m37s
mofed-rhcos4.16-696886fcb4-ds-lkjd4                           2/2     Running   0          2m37s
nvidia-network-operator-controller-manager-68d547dbbd-qsdkf   1/1     Running   0          141m
sriov-device-plugin-6v2nz                                     1/1     Running   0          2m14s
sriov-device-plugin-hc4t8                                     1/1     Running   0          2m14s

We can also confirm that the resources show up in the cluster oc decribe node section.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A7
Capacity:
  cpu:                 128
  ephemeral-storage:   1561525616Ki
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              263596708Ki
  nvidia.com/hostdev:  2
  pods:                250
Allocatable:
  cpu:                 127500m
  ephemeral-storage:   1438028263499
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              262445732Ki
  nvidia.com/hostdev:  2
  pods:                250
--
Capacity:
  cpu:                 128
  ephemeral-storage:   1561525616Ki
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              263596704Ki
  nvidia.com/hostdev:  2
  pods:                250
Allocatable:
  cpu:                 127500m
  ephemeral-storage:   1438028263499
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              262445728Ki
  nvidia.com/hostdev:  2
  pods:                250

Now we need to create a HostDeviceNetwork custom resource file.

$ cat <<EOF >  hostdev-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
  name: hostdev-net
spec:
  networkNamespace: "default"
  resourceName: "hostdev"
  ipam: |
    {
      "type": "whereabouts",
      "range": "192.168.3.225/28",
      "exclude": [
       "192.168.3.229/30",
       "192.168.3.236/32"
      ]
    }
EOF

And then create the HostDeviceNetwork resource on the cluster.

$ oc create -f hostdev-network.yaml
hostdevicenetwork.mellanox.com/hostdev-net created

Let's validate our resources are showing up properly.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
Capacity:
  cpu:                 128
  ephemeral-storage:   1561525616Ki
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              263596708Ki
  nvidia.com/gpu:      2
  nvidia.com/hostdev:  2
  pods:                250
Allocatable:
  cpu:                 127500m
  ephemeral-storage:   1438028263499
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              262445732Ki
  nvidia.com/gpu:      2
  nvidia.com/hostdev:  2
  pods:                250
--
Capacity:
  cpu:                 128
  ephemeral-storage:   1561525616Ki
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              263596680Ki
  nvidia.com/gpu:      2
  nvidia.com/hostdev:  2
  pods:                250
Allocatable:
  cpu:                 127500m
  ephemeral-storage:   1438028263499
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              262445704Ki
  nvidia.com/gpu:      2
  nvidia.com/hostdev:  2
  pods:                250

End of nic cluster policy for host device section.

Create Workload Pods and Perf Test Host Device

Now we need to create a workload pod that contains all the tooling for our host device testing. We can generate a custom pod file for each node as follows to meet that requirement.

$ cat << EOF > hostdev-32-workload.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: hostdev-32-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: hostdev-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: hostdev-32-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        nvidia.com/hostdev: 1
      requests:
        nvidia.com/gpu: 1
        nvidia.com/hostdev: 1
EOF

$ cat <<EOF > hostdev-33-workload.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: hostdev-33-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: hostdev-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: hostdev-33-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        nvidia.com/hostdev: 1
      requests:
        nvidia.com/gpu: 1
        nvidia.com/hostdev: 1
EOF

Then we can create the pods on the cluster.

$ oc create -f hostdev-32-workload.yaml
pod/hostdev-32-workload created

$ oc create -f hostdev-33-workload.yaml
pod/hostdev-33-workload created

Let's validate the pods are running.

$ oc get pods -n default
NAME                  READY   STATUS    RESTARTS   AGE
hostdev-32-workload   1/1     Running   0          73s
hostdev-33-workload   1/1     Running   0          12s

First let's get the ipaddress of the first pod.

$ oc get pod hostdev-32-workload -o yaml | grep -E 'default/hostdev-net' -A3
          "name": "default/hostdev-net",
          "interface": "net1",
          "ips": [
              "192.168.3.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default hostdev-32-workload
sh-5.1# ib_write_bw

************************************
* Waiting for client to connect... *
************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default hostdev-33-workload
sh-5.1# ib_write_bw 192.168.3.225
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x05 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007fe688c97000
 remote address: LID 0x04 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007f1f0249d000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 2500.000000 != 3498.323000. CPU Frequency is not max.
 65536      5000             44351.41            44328.98           0.709264
---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x04 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007f1f0249d000
 remote address: LID 0x05 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007fe688c97000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             44351.41            44328.98           0.709264
---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

The SRIOV Legacy Mode RDMA Testing

This deployment mode supports SR-IOV in legacy mode.

Configure Nic Cluster Policy for SRIOV Legacy

First we need to create a NicClusterPolicy which for SRIOV legacy mode is fairly generic. Generate the following custom resource file below. If an existing NicClusterPolicy exists please remove it.

$ cat <<EOF > network-sriovleg-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 24.10-0.7.0.0-0
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    env:
    - name: UNLOAD_STORAGE_MODULES
      value: "true"
    - name: RESTORE_DRIVER_ON_POD_TERMINATION
      value: "true"
    - name: CREATE_IFNAMES_UDEV
      value: "true"
EOF

Now let's create the policy on the cluster.

$ oc create -f network-sriovleg-nic-cluster-policy.yaml
nicclusterpolicy.mellanox.com/nic-cluster-policy created

Before we continue we can validate the pods are up.

$ oc get pods -n nvidia-network-operator
NAME                                                          READY   STATUS    RESTARTS      AGE
mofed-rhcos4.16-696886fcb4-ds-4mb42                           2/2     Running   0             40s
mofed-rhcos4.16-696886fcb4-ds-8knwq                           2/2     Running   0             40s
nvidia-network-operator-controller-manager-68d547dbbd-qsdkf   1/1     Running   13 (4d ago)   4d21h

Now we need to create a SriovNetworkNodePolicy which will generate the VFs for the device we want to operate in SRIOV legacy mode. Generate the custom resource file below.

$ cat <<EOF > sriov-network-node-policy.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace:  openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens8f0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy
EOF

Next we can create the custom resource on the cluster. As a note make sure SR-IOV Global Enable is enabled as per Red Hat Knowledge Article.

$ oc create -f sriov-network-node-policy.yaml
sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created

The nodes should go through a reboot process. Each one will have scheduling disabled and reboot to make the configuration take place.

$ oc get nodes
NAME                                       STATUS                        ROLES                         AGE     VERSION
edge-19.edge.lab.eng.rdu2.redhat.com       Ready                         control-plane,master,worker   5d      v1.29.8+632b078
nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com   Ready                         worker                        4d22h   v1.29.8+632b078
nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com   NotReady,SchedulingDisabled   worker                        4d22h   v1.29.8+632b078

Once the nodes have reboot we can validate that the VF interfaces were created by opening up a debug pod on each node.

a$ oc debug node/nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-33nvidiaengrdu2dcredhatcom-debug-cqfjz ...
To use host binaries, run `chroot /host`
Pod IP: 10.6.135.12
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ip link show | grep ens8
26: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
42: ens8f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
43: ens8f0v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
44: ens8f0v2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
45: ens8f0v3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
46: ens8f0v4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
47: ens8f0v5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
48: ens8f0v6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
49: ens8f0v7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

We can repeat the same steps above on the second node if we want to feel complete.

We can also confirm via the node capabilities output.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
Capacity:
  cpu:                       128
  ephemeral-storage:         1561525616Ki
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    263596692Ki
  nvidia.com/gpu:            2
  nvidia.com/hostdev:        0
  openshift.io/sriovlegacy:  8
--
Allocatable:
  cpu:                       127500m
  ephemeral-storage:         1438028263499
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    262445716Ki
  nvidia.com/gpu:            2
  nvidia.com/hostdev:        0
  openshift.io/sriovlegacy:  8
--
Capacity:
  cpu:                       128
  ephemeral-storage:         1561525616Ki
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    263596688Ki
  nvidia.com/gpu:            2
  nvidia.com/hostdev:        0
  openshift.io/sriovlegacy:  8
--
Allocatable:
  cpu:                       127500m
  ephemeral-storage:         1438028263499
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    262445712Ki
  nvidia.com/gpu:            2
  nvidia.com/hostdev:        0
  openshift.io/sriovlegacy:  8

Now that the VFs for SRIOV legacy mode are in place we can generate the SriovNetwork custom resource file.

$ cat <<EOF > sriov-network.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-network
  namespace:  openshift-sriov-network-operator
spec:
  vlan: 0
  networkNamespace: "default"
  resourceName: "sriovlegacy"
  ipam: |
    {
      "type": "whereabouts",
      "range": "192.168.3.225/28",
      "exclude": [
       "192.168.3.229/30",
       "192.168.3.236/32"
      ]
    }
EOF

Then we can create the custom resource on the cluster.

$ oc create -f sriov-network.yaml
sriovnetwork.sriovnetwork.openshift.io/sriov-network created

End of nic cluster policy for host device section.

Create Workload and Perf Test SRIOV Legacy

Now we need to create a workload pod that contains all the tooling for our host device testing. We can generate a custom pod file for each node as follows to meet that requirement.

$ cat << EOF > sriovlegacy-32-workload.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: sriovlegacy-32-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: sriovlegacy-32-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        openshift.io/sriovlegacy: 1
      requests:
        nvidia.com/gpu: 1
        openshift.io/sriovlegacy: 1
EOF

$ cat <<EOF > sriovlegacy-33-workload.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: sriovlegacy-33-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  nodeSelector: 
    kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: sriovlegacy-33-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        openshift.io/sriovlegacy: 1
      requests:
        nvidia.com/gpu: 1
        openshift.io/sriovlegacy: 1
EOF

Then we can create the pods on the cluster.

$ oc create -f sriovlegacy-32-workload.yaml
pod/sriovlegacy-32-workload created

$ oc create -f sriovlegacy-33-workload.yaml
pod/sriovlegacy-33-workload created

Let's validate the pods are running.

$ oc get pods -n default
NAME                  READY   STATUS    RESTARTS   AGE
sriovlegacy-32-workload   1/1     Running   0          73s
sriovlegacy-33-workload   1/1     Running   0          12s

First let's get the ipaddress of the first pod.

$ oc get pod sriovlegacy-32-workload -o yaml | grep -E 'default/sriov-network' -A3
          "name": "default/sriov-network",
          "interface": "net1",
          "ips": [
              "192.168.3.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh sriovlegacy-33-workload
sh-5.1# ib_write_bw 192.168.3.225
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x05 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f397ace8000
 remote address: LID 0x04 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f0eeefac000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 2500.000000 != 3491.228000. CPU Frequency is not max.
 65536      5000             44414.44            44386.66           0.710187
---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x04 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f0eeefac000
 remote address: LID 0x05 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f397ace8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             44414.44            44386.66           0.710187
---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over.

Hopefully this blog was detailed enough to provide an understanding of RDMA testing with NVIDIA and OpenShift. It provide a brief example of how to configure the different RDMA methods: Shared, Hostdev and SRIOV Legacy.