Showing posts with label mellanox. Show all posts
Showing posts with label mellanox. Show all posts

Saturday, February 28, 2026

OpenShift Passthrough For Some


I wanted to provide a simple mechanism to configure vfio-pci devices of a certain device type when some of those device types are in use by the base operating system. For example on some Grace Hopper nodes the only network devices might be BlueField-3 interfaces. If I want one BlueField-3 to provide networking access to the base operating system I need to leave the kernel driver in place. However I might want to take the additional Bluefield-3 devices and use them in passthrough mode which would require them to be unbound from mlx5 drivers and bound to vfio-pci. The following writeup provides a working example both manually and then automatically in the context of OpenShift.  

Why

There are going to be use cases where the workloads running in virtual machines on OpenShift worker nodes will need to have the network devices in passthrough mode. While this is not a problem when the OpenShift worker node cluster interface is on a different network card type then those those that need to be passed to the virtual machine.   It does becomes an issue on systems that are outfitted with all the same network interface types. This means that the device id for all the network cards are the same. It also means that from a traditional sense I cannot use the current method of enabling passthrough for the network cards. That current method involves blacklisting the network kernel driver from loading and then configuring the device ids to attach to the vfio-pci driver. If we were to implement that on a system with all of the same network cards when the system rebooted to apply the machineconfig the node would come up without any networking and show as NotReady. That is why in the rest of this document we will demonstrate a different practical approach to this problem.

Manually Configure

Kernel driver unbinding and binding was introduces back in kernel 2.6.13 back in 2005 so its a technology that has been around for quite some time. This is the exact feature that we will be using to show how to only make some of our network cards vfio-pci bound. To begin let's take a look at our network interfaces via lspci where I have filtered out the devices by the device id 15b3:a2dc. We can see here that I have 4 network card ports on an OpenShift node in a debug pod.

sh-5.2# lspci -nn |grep 15b3:a2dc 0000:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) 0000:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) 0002:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) 0002:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)

Now let's examime the physical interface names for these 4 ports.

sh-5.2# grep PCI_SLOT_NAME /sys/class/net/*/device/uevent /sys/class/net/enP2s2f0np0/device/uevent:PCI_SLOT_NAME=0002:01:00.0 /sys/class/net/enP2s2f1np1/device/uevent:PCI_SLOT_NAME=0002:01:00.1 /sys/class/net/enp1s0f0np0/device/uevent:PCI_SLOT_NAME=0000:01:00.0 /sys/class/net/enp1s0f1np1/device/uevent:PCI_SLOT_NAME=0000:01:00.1

Now we have to see which one is already in use by OpenShift so we do not inadvertently work with the wrong card. This will always be the one where the master-

sh-5.2# ovs-vsctl --no-heading --format=table --columns=name,type find Interface type=system| awk '{print $1}' enp1s0f0np0

We can see enp1sf0np0 which correlates to the 0000:01:00.0 card. So we will focus on the 0002:01:00.0 & 0002:01:00.1.

Now that we have determined which cards we can use we will begin the process of unbinding them from their current driver which is mlx5_core.

echo -n "0002:01:00.0" > /sys/bus/pci/drivers/mlx5_core/unbind echo -n "0002:01:00.1" > /sys/bus/pci/drivers/mlx5_core/unbind

At this point if looked at the lspci output we would see these two devices no longer have a "Kernel driver in use" line in the output. Rather then four lines here we only see two which are the two ports related the system network card.

sh-5.2# lspci -k -s 0002:01:00.0 0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel modules: mlx5_core sh-5.2# lspci -k -s 0002:01:00.1 0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel modules: mlx5_core

We are now ready to for them to use the vfio-pci driver but first we may need to load that driver.

modprobe vfio-pci

We can validate that the vfio-pci driver is loaded with lsmod.

sh-5.2# lsmod|grep vfio vfio_pci 16384 0 vfio_pci_core 90112 1 vfio_pci vfio_iommu_type1 49152 0 vfio 73728 3 vfio_pci_core,vfio_iommu_type1,vfio_pci iommufd 131072 1 vfio

Now that we have unbound the two devices drivers let's override the kernel driver they should use with vfio-pci.

sh-5.2# echo vfio-pci > /sys/bus/pci/devices/0002:01:00.0/driver_override sh-5.2# echo vfio-pci > /sys/bus/pci/devices/0002:01:00.1/driver_override

With the vfio-driver override in place we can now bind our two devices to that driver.

sh-5.2# echo "0002:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind sh-5.2# echo "0002:01:00.1" > /sys/bus/pci/drivers/vfio-pci/bind

And finally we can validate that the driver for those devices is now using the vfio-pci driver.

sh-5.2# lspci -k -s 0002:01:00.0 0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core sh-5.2# lspci -k -s 0002:01:00.1 0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core

Automatically Configure

While one can manually configure the vfio-pci passthrough like we did above this won't be scalable in a large cluster especially after OpenShift upgrades so we need something that is more automatic. The answer to this is twofold in that we first need a script that can automate the process above and then a mechanism of running that script on OpenShift nodes.

For the automation script we can use the example code in this repository here. This script will identify all the interfaces of a certain device type and then determine which ones can be used as passthrough devices. The factor that prohibits the device from being used as a passthrough is if the device has an OVS bridge associated to it. Once we have idenfitied the list it will go ahead and unbind the kernel driver in use on that device and then override the driver and bind it to vfio-pci so it is available for passthrough.

Here is a manuall run of the system we had to test on.

sh-5.2# ./passthrough-some-nics.sh -n 15b3:a2dc NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible ==================================================================================================== enp1s0f0np0 0000:01:00.0 mlx5_core Yes No enp1s0f1np1 0000:01:00.1 mlx5_core Yes No enP2s2f0np0 0002:01:00.0 mlx5_core No Yes enP2s2f1np1 0002:01:00.1 mlx5_core No Yes Loading vfio-pci......Done! Unbinding device 0002:01:00.0 from mlx5_core kernel driver... Applying driver override to device 0002:01:00.0... Binding device 0002:01:00.0 to vfio-pci... Device kernel driver validation... 0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core Unbinding device 0002:01:00.1 from mlx5_core kernel driver... Applying driver override to device 0002:01:00.1... Binding device 0002:01:00.1 to vfio-pci... Device kernel driver validation... 0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core

Notice the script changes the kernel driver in use for the two devices. If we run the script again we should see that no changes can be made because there are no other eligible passthrough devices.

sh-5.2# ./passthrough-some-nics.sh -n 15b3:a2dc NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible ==================================================================================================== enp1s0f0np0 0000:01:00.0 mlx5_core Yes No enp1s0f1np1 0000:01:00.1 mlx5_core Yes No NA 0002:01:00.0 vfio-pci No Complete NA 0002:01:00.1 vfio-pci No Complete vfio_pci 16384 0 - Live 0xffffb968aee88000

Now that we have seen the script work let's make this more relatable to OpenShift. First we will have to base64 encode the script by piping it through base64 command.

$ BASE64_SCRIPT=$(cat passthrough-some-nics.sh | base64 -w 0) $ echo $BASE64_SCRIPT IyEvYmluL2Jhc2gKIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjCiMgVGhpcyBzY3JpcHQgcGFzc2VzIHRocm91Z2ggc29tZSBvZiB0aGUgTklDcyB3aGVuIGFsbCB0aGUgTklDcyBhcmUgdGhlIHNhbWUgZGV2aWNlIHR5cGUgICAgICAgICAgICAgICAgICAgIwojIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMKCiMgSG93IHRvIHVzZSB0aGUgc2NyaXB0IGlmIHVzZXIgZG9lcyBub3Qga25vdyBob3cKaG93dG8oKXsKICBlY2hvICJVc2FnZTogcGFzc3Rocm91Z2gtc29tZS1uaWNzLnNoIC1uIDxuaWMtZGV2aWNlLWlkPiIKICBlY2hvICJFeGFtcGxlIFNpbmdsZSBEZXZpY2UgSUQ6IHBhc3N0aHJvdWdoLXNvbWUtbmljcy5zaCAtbiAxNWIzOmEyZGMiCiAgZWNobyAiRXhhbXBsZSBNdWx0aSBEZXZpY2UgSUQ6IHBhc3N0aHJvdWdoLXNvbWUtbmljcy5zaCAtbiAxZGQ4OjEwMDJ8MTViMzoxMDIxIgp9CgojIEdldG9wdHMgc2V0dXAgZm9yIHZhcmlhYmxlcyB0byBwYXNzIGZyb20gb3B0aW9ucwp3aGlsZSBnZXRvcHRzIGc6bjp1OnI6aCBvcHRpb24KZG8KY2FzZSAiJHtvcHRpb259IgppbgpuKSBuaWNpZD0ke09QVEFSR307OwpoKSBob3d0bzsgZXhpdCAwOzsKXD8pIGhvd3RvOyBleGl0IDE7Owplc2FjCmRvbmUKCiMgTWFrZSBzdXJlIHRoZSB2YXJpYWJsZXMgYXJlIHBvcHVsYXRlZCB3aXRoIHZhbHVlcyBvdGhlcndpc2Ugc2hvdyBob3d0bwppZiAoWyAteiAiJG5pY2lkIiBdKSB0aGVuCiAgIGhvd3RvCiAgIGV4aXQgMQpmaQoKIyBTZXQgdGFibGUgaGVhZGVyIGZvcm1hdCAKZGl2aWRlcj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09CmRpdmlkZXI9JGRpdmlkZXIkZGl2aWRlciRkaXZpZGVyCmhlYWRlcj0iXG4gJS0xMnMgJS0xNnMgJS0xNHMgJS0xNHMgJS0xNHNcbiIKZm9ybWF0PSIgJS0xNHMgJS0xNHMgJS0xNHMgJS0xNHMgJS0xNHNcbiIKd2lkdGg9MTAwCgojIFNsdXJwIGluIG5pYyBkZXZpY2UgdHlwZSBpZHMgZnJvbSBsc3BjaQpuaWNpZD1gZWNobyAkbmljaWQgfHNlZCAncy8sL1x8L2cnYAptYXBmaWxlIC10IG15X25pY3MgPCA8KGxzcGNpIC1ufGdyZXAgLUUgJG5pY2lkKQoKIyBQcmludCBvdXQgaGVhZGVycyAKcHJpbnRmICIkaGVhZGVyIiAiTklDIE5hbWUiICJOSUMgQnVzIElEIiAiS2VybmVsIERyaXZlciIgIk9DUCBCUiBOSUMiICJQYXNzVGhydSBFbGlnaWJsZSIKcHJpbnRmICIlJHdpZHRoLiR7d2lkdGh9c1xuIiAiJGRpdmlkZXIiCgojIEdyYWIgaW50ZXJmYWNlIGFzc29jaWF0ZWQgdG8gb3ZzLXN5c3RlbSBicmlkZ2UuICBCb25kcyBkbyBub3Qgd29yayBoZXJlIHlldApicnBoeWludD1gb3ZzLXZzY3RsIC0tbm8taGVhZGluZyAtLWZvcm1hdD10YWJsZSAtLWNvbHVtbnM9bmFtZSx0eXBlIGZpbmQgSW50ZXJmYWNlIHR5cGU9c3lzdGVtfCBhd2sgJ3twcmludCAkMX0nYApicnBoeWJ1cz1gZ3JlcCBQQ0lfU0xPVF9OQU1FIC9zeXMvY2xhc3MvbmV0LyovZGV2aWNlL3VldmVudHxncmVwICRicnBoeWludHwgYXdrIC1GICI9IiAne3ByaW50ICQyfSdgCgojIERlY2xhcmUgZW1wdHkgYXJyYXkgdG8gc3RvcmUgbmljIGRldGFpbHMgb24gdGhvc2UgdGhhdCBjYW4gYmUgdW5ib3VuZApkZWNsYXJlIC1hIHBhc3N0aHJvdWdoPSgpCgpmb3IgKCggbmljPTA7IG5pYzwkeyNteV9uaWNzW0BdfTsgbmljKysgKSkKZG8KICAgbmljYnVzaWQ9YGVjaG8gJHtteV9uaWNzWyRuaWNdfSB8IGF3ayAne3ByaW50ICQxfSdgCiAgIG5pY2tkcnY9YGxzcGNpIC1rbiAtcyAkbmljYnVzaWQgfCBncmVwICJLZXJuZWwgZHJpdmVyIGluIHVzZToifCBhd2sgLUYgIjogIiAne3ByaW50ICQyfSdgCiAgIG5pY25hbWU9YGdyZXAgUENJX1NMT1RfTkFNRSAvc3lzL2NsYXNzL25ldC8qL2RldmljZS91ZXZlbnR8Z3JlcCAkbmljYnVzaWR8IGF3ayAtRiAnLycgJ3twcmludCAkNX0nYAogICBpZiBbICIkbmljbmFtZSIgPSAiIiBdOyB0aGVuCiAgICAgIG5pY25hbWU9Ik5BIgogICBmaQoKICAgIyBPYnRhaW4gZmlyc3QgMTEgY2hhcmFjdGVycyBvZiBlYWNoIHZhcmlhYmxlIHN0cmluZyB0byB1c2UgZm9yIGNvbXBhcmUKICAgc3VibmljYnVzaWQ9IiR7bmljYnVzaWQ6MDoxMX0iCiAgIHN1YmJycGh5YnVzPSIke2JycGh5YnVzOjA6MTF9IgoKICAgIyBDb21wYXJlIHRoZSBzdWJzdHJpbmdzCiAgIGlmIFtbICIkc3VibmljYnVzaWQiID09ICIkc3ViYnJwaHlidXMiIF1dOyB0aGVuCiAgICAgIHN5c25pYz0iWWVzIgogICAgICBwYXNzdGhydT0iTm8iCiAgICAgICMgRGlzcGxheSB0byBjb25zb2xlIHRoZSBkZXRhaWxzCiAgICAgIHByaW50ZiAiJGZvcm1hdCIgJG5pY25hbWUgJG5pY2J1c2lkICRuaWNrZHJ2ICRzeXNuaWMgJHBhc3N0aHJ1CiAgIGVsc2UKICAgICAgc3lzbmljPSJObyIKICAgICAgaWYgWyAiJG5pY2tkcnYiID0gInZmaW8tcGNpIiBdOyB0aGVuCiAgICAgICAgIHBhc3N0aHJ1PSJDb21wbGV0ZSIKICAgICAgZWxzZQogICAgICAgICBwYXNzdGhydT0iWWVzIgogICAgICAgICBwYXNzdGhyb3VnaCs9KCIkbmljYnVzaWR8JG5pY2tkcnYiKQogICAgICBmaQogICAgICAjIERpc3BsYXkgdG8gY29uc29sZSB0aGUgZGV0YWlscwogICAgICBwcmludGYgIiRmb3JtYXQiICRuaWNuYW1lICRuaWNidXNpZCAkbmlja2RydiAkc3lzbmljICRwYXNzdGhydQogICBmaQpkb25lCgppZiAhIGdyZXAgLUUgIl52ZmlvX3BjaSAiIC9wcm9jL21vZHVsZXM7IHRoZW4KICBlY2hvICIgIgogIGVjaG8gLW4gIkxvYWRpbmcgdmZpby1wY2kuLi4iCiAgbW9kcHJvYmUgdmZpby1wY2kKICBlY2hvICIuLi5Eb25lISIKICBlY2hvICIgIgpmaQoKCmZvciAoKCBwYXNzPTA7IHBhc3M8JHsjcGFzc3Rocm91Z2hbQF19OyBwYXNzKysgKSkKZG8KICAgbmljYnVzaWQ9YGVjaG8gJHtwYXNzdGhyb3VnaFskcGFzc119IHwgYXdrIC1GICJ8IiAne3ByaW50ICQxfSdgCiAgIG5pY2tkcnY9YGVjaG8gJHtwYXNzdGhyb3VnaFskcGFzc119IHwgYXdrIC1GICJ8IiAne3ByaW50ICQyfSdgCiAgIGVjaG8gIiAiCiAgIGVjaG8gIlVuYmluZGluZyBkZXZpY2UgJG5pY2J1c2lkIGZyb20gJG5pY2tkcnYga2VybmVsIGRyaXZlci4uLiIKICAgZWNobyAtbiAiJG5pY2J1c2lkIiA+IC9zeXMvYnVzL3BjaS9kcml2ZXJzL21seDVfY29yZS91bmJpbmQKICAgZWNobyAiQXBwbHlpbmcgZHJpdmVyIG92ZXJyaWRlIHRvIGRldmljZSAkbmljYnVzaWQuLi4iCiAgIGVjaG8gdmZpby1wY2kgPiAvc3lzL2J1cy9wY2kvZGV2aWNlcy8kbmljYnVzaWQvZHJpdmVyX292ZXJyaWRlCiAgIGVjaG8gIkJpbmRpbmcgZGV2aWNlICRuaWNidXNpZCB0byB2ZmlvLXBjaS4uLiIKICAgZWNobyAiJG5pY2J1c2lkIiA+IC9zeXMvYnVzL3BjaS9kcml2ZXJzL3ZmaW8tcGNpL2JpbmQKICAgZWNobyAiRGV2aWNlIGtlcm5lbCBkcml2ZXIgdmFsaWRhdGlvbi4uLiIKICAgbHNwY2kgLWsgLXMgJG5pY2J1c2lkCmRvbmUKZXhpdCAwCg==

We will also set our device id variable that will get embedded in the machineconfig as the argument for the script. Please note if we wanted to use multiple device ids we would pipe delimite them.

$ DEVICEID="15b3:a2dc" # Single device id $ DEVICEID="1dd8:1002|15b3:1021" # Multiple device ids

We also have to set the the length of wait time to allow system to come up. 120 seconds is a good rule of thumb.

$ SLP="120"

Then we have to configure a MachineConfig that will place the base64 encoded script on the system and establish a systemd service to run the script everytime the node boots.

$ cat > passthrough-for-some-machineconfig.yaml << EOF kind: MachineConfig apiVersion: machineconfiguration.openshift.io/v1 metadata: name: passthrough-for-some-systemd-service labels: machineconfiguration.openshift.io/role: master spec: config: ignition: version: 3.2.0 systemd: units: - name: passthrough-for-some.service enabled: true contents: | [Unit] Description=Identifies and enabled passthough on select network interfaces After=NetworkManager-wait-online.service openvswitch.service Wants=NetworkManager-wait-online.service openvswitch.service [Service] RemainAfterExit=yes ExecStart=/etc/scripts/passthrough-some-nics.sh -n $DEVICEID -s $SLP Type=oneshot [Install] WantedBy=multi-user.target storage: files: - filesystem: root path: "/etc/scripts/passthrough-some-nics.sh" contents: source: data:text/plain;charset=utf-8;base64,$BASE64_SCRIPT verification: {} mode: 0755 overwrite: true EOF

Now let's create the MachineConfig on the cluster.

$ oc create -f passthrough-for-some-machineconfig.yaml machineconfig.machineconfiguration.openshift.io/passthrough-for-some-systemd-service created

We need to wait for the node to reboot. Once oc get mcp is responsive and confirms the node is updated we can start to validate.

$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c88d4164a5bd26edb3d4025d24a5d2f8 True False False 1 1 1 0 6d7h worker rendered-worker-9890b2fbe760e8e731e68bf217b87278 True False False 0 0 0 0 6d7h

Let's check the status of the service on the node. We can see from the below output it already identified the interfaces that can be made passthrough.

# systemctl status passthrough-for-some.service ● passthrough-for-some.service - Identifies and enabled passthough on select network interfaces Loaded: loaded (/etc/systemd/system/passthrough-for-some.service; enabled; preset: disabled) Active: activating (start) since Thu 2026-02-19 22:27:01 UTC; 5min ago Job: 408 Invocation: 29eaf89183be4424a9f2fb4a2bd249a4 Main PID: 4282 (passthrough-som) Tasks: 1 (limit: 3084134) Memory: 1.5M (peak: 10.8M) CPU: 213ms CGroup: /system.slice/passthrough-for-some.service └─4282 /bin/bash /etc/scripts/passthrough-some-nics.sh -n 15b3:a2dc Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: ==================================================================================================== Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enp1s0f0np0 0000:01:00.0 mlx5_core Yes No Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enp1s0f1np1 0000:01:00.1 mlx5_core Yes No Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enP2s2f0np0 0002:01:00.0 mlx5_core No Yes Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enP2s2f1np1 0002:01:00.1 mlx5_core No Yes Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Loading vfio-pci......Done! Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Unbinding device 0002:01:00.0 from mlx5_core kernel driver...

Let's look at the lspci output for the devices we saw in the logs. We can see the first two interfaces stayed bound to mlx5_core because those ports are part of the same card and associated to the OVS bridge. The last two ports though were unbound from mlx5_core and bound to vfio-pci to enable passthrough.

# lspci -k -s 0000:01:00.0 0000:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: mlx5_core Kernel modules: mlx5_core # lspci -k -s 0000:01:00.1 0000:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: mlx5_core Kernel modules: mlx5_core # lspci -k -s 0002:01:00.0 0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core # lspci -k -s 0002:01:00.1 0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core

One final thing we can do is run the script manually on the node again to also confirm our findings.

# /etc/scripts/passthrough-some-nics.sh -n 15b3:a2dc NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible ==================================================================================================== enp1s0f0np0 0000:01:00.0 mlx5_core Yes No enp1s0f1np1 0000:01:00.1 mlx5_core Yes No NA 0002:01:00.0 vfio-pci No Complete NA 0002:01:00.1 vfio-pci No Complete vfio_pci 16384 0 - Live 0xffffd5d69072b000

Openshift Virtualization Passthrough

Now that our devices are set to passthrough we can configure OpenShift Virtualization to see them as an available resource. We will need to edite the hyperconverged setup on our OpenShift cluster and add the following section.

permittedHostDevices: pciHostDevices: - pciDeviceSelector: 15b3:a2dc resourceName: nvidia.com/BF3_CX7 resourceRequirements:

We can make the edit by doing the following and inserting the section above right before the resourceRequirements section of the spec file.

$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

Then we can confirm the resources are exposed by the OpenShift node using oc describe node.

$ oc describe node | grep -E 'Capacity:|Allocatable:' -A12 Capacity: cpu: 72 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 936709572Ki hugepages-1Gi: 0 hugepages-2Mi: 0 hugepages-32Mi: 0 hugepages-64Ki: 0 memory: 493510268Ki nvidia.com/BF3_CX7: 2 pods: 250 Allocatable: cpu: 71500m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 862197798302 hugepages-1Gi: 0 hugepages-2Mi: 0 hugepages-32Mi: 0 hugepages-64Ki: 0 memory: 492359292Ki nvidia.com/BF3_CX7: 2 pods: 250

Now when we go launch a virtual machine in OpenShift we will want to include the following section in our virtual machine spec file nested under spec->domain->devices.

hostDevices: - deviceName: nvidia.com/BF3_CX7 name: hostDevices-turquoise-hornet-42

And if all goes well once we launch our virtual machine and it's running we should be able to see the passthrough ethernet interface.

$ oc get vmi -n openshift-cnv NAMESPACE NAME AGE PHASE IP NODENAME READY openshift-cnv rhel9-red-locust-96 10m Running 10.128.0.49 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com True $ virtctl console rhel9-red-locust-96 -n openshift-cnv Successfully connected to rhel9-red-locust-96 console. The escape sequence is ^] rhel9-red-locust-96 login: cloud-user Password: Last login: Fri Feb 20 08:08:53 on tty1 [cloud-user@rhel9-red-locust-96 ~]$ sudo bash [root@rhel9-red-locust-96 cloud-user]# lspci -nn|grep Mellanox 0a:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)

Hopefully this provides a decent example of enabling passthrough for a subset of devices on a server where all the devices are the same but not all can be passed through due to the need for base networking at the OS level.

Saturday, January 04, 2025

RDMA with NVIDIA on OpenShift


The rise of artificial intelligence(AI) has generated some really challenging problems with data movement.  In traditional environments if I needed to move data from one node to another it would need to be manipulated by the central processor (CPU) of the host.   While this was reasonable with small amounts of data a better and more efficient method is needed for AI workloads and their large datasets. 

To solve this challenge we can use RDMA or remote direct memory access which enables direct memory access from the memory of one compute node to another compute node without involving the CPU of the hosts.  This enables high-throughput, low-latency networking which is especially useful in massive compute clusters with large datasets.

The rest of this blog will cover example(s) of using RDMA with NVIDIA's Network Operator and GPU Operator along with Red Hat OpenShift Container Platform.   The three primary examples covered in this document will be: RDMA Shared Device, RDMA Host Device and RDMA in Legacy SRIOV.

Lab Environment

The following configurations and testing were done a OpenShift environment that consisted of the following:

  • OpenShift 4.16.19 x86
  • Network Operator 24.10
  • All other operators used the default values for OCP 4.16.
  • 3 physical nodes: 1 SNO master, 2 workers
  • The workers consisted of Dell R760xa with 2 NVIDIA BF3 cards in them.
  • One BF3 card was attached to the NVIDIA Spectrum SN5600 switch for RDMA over ethernet
  • One BF3 card was attached to the NVIDIA Quantum QM9700 switch for RDMA over infiniband

Blacklist IRDMA Module

On some systems, including the DellR750xa I used for testing, the irdma kernel module creates problems for the NVIDIA Network Operator on unload/load of the DOCA drivers so we need to blacklist it with a machine configuration that gets applied to all worker nodes.

Generate the following machine configuration file yaml specifying the module irdma to blacklist.

$ cat <<EOF > 99-machine-config-blacklist-irdma.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-blacklist-irdma spec: kernelArguments: - "module_blacklist=irdma" EOF

Then create the machine configuration on the cluster and wait for the worker nodes to reboot.

$ oc create -f 99-machine-config-blacklist-irdma.yaml machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created

Validate in a debug pod on each node that the module has not loaded.

$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-btfj2 ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.11 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# lsmod|grep irdma sh-5.1#

At this point, if everything looks good, we can move onto the next steps of the workflow.

Persistent Naming Rules

Sometimes there is a need to make sure the device names persist on reboots. On the R760xa systems and where nodes had a large number of networking cards, I was noticing the Mellanox devices were being renamed on reboots so I decided to use a MachineConfig to set persistence. 

First gather the the MAC address names into a file from the worker nodes for the node(s) and also provide names for the interfaces that need to persist. We will call the file 70-persistent-net.rules and stash the details in it.

$ cat <<EOF > 70-persistent-net.rules SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:28",ATTR{type}=="1",NAME="ibs2f0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:29",ATTR{type}=="1",NAME="ens8f0np0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d0",ATTR{type}=="1",NAME="ibs2f0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d1",ATTR{type}=="1",NAME="ens8f0np0" EOF

Now we need to convert that file into a base64 string without line breaks and set the output to the variable PERSIST.

$ PERSIST=`cat 70-persistent-net.rules| base64 -w 0` $ echo $PERSIST U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK

Now we can create a machine configuration and set the base64 encoding in our custom resource file.  Notice how I am using the PERSIST variable in my yaml creation to mitigate copy/paste type errors.

$ cat <<EOF > 99-machine-config-udev-network.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-machine-config-udev-network spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;base64,$PERSIST filesystem: root mode: 420 path: /etc/udev/rules.d/70-persistent-net.rules EOF

Once we have the machine configuration we can create it on the cluster.

$ oc create -f 99-machine-config-udev-network.yaml machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9adfe851c2c14d9598eea5ec3df6c187 True False False 1 1 1 0 6h21m worker rendered-worker-4568f1b174066b4b1a4de794cf538fee False True False 2 0 0 0 6h21m

The worker nodes will reboot and once the updating field goes back to false we can validate on the nodes by looking at the devices in a debug pod if we chose to do so.

If everything looks good we can move onto configuring the operators of the OpenShift cluster.

Install and Configure Required Operators

This next section will cover the installation and configurations of the required operators we need for the RDMA testing.

Install and Configure NFD Operator

The Node Feature Discovery (NFD) operator manages the detection of hardware features and configuration in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, operating system version, and so on.

To get started we will generate a NFD Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > nfd-operator.yaml apiVersion: v1 kind: Namespace metadata: name: openshift-nfd --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: openshift-nfd namespace: openshift-nfd spec: targetNamespaces: - openshift-nfd --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nfd namespace: openshift-nfd spec: channel: "stable" installPlanApproval: Automatic name: nfd source: redhat-operators sourceNamespace: openshift-marketplace EOF

Next we can create the resources on the cluster.

$ oc create -f nfd-operator.yaml namespace/openshift-nfd created operatorgroup.operators.coreos.com/openshift-nfd created subscription.operators.coreos.com/nfd created

We can validate that the operator is installed and running by looking at the pods in the openshift-nfd namespace.

$ oc get pods -n openshift-nfd NAME READY STATUS RESTARTS AGE nfd-controller-manager-8698c88cdd-t8gbc 2/2 Running 0 2m

With the NFD controller running we can move onto generating the NodeFeatureDiscovery instance and adding it to the cluster.

The ClusterServiceVersion specification for NFD operator provides default values, including the NFD operand image that is part of the operator payload. We retrieve its value with the following command line and assign it to the variable NFD_OPERAND_IMAGE.

$ NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`

We can now create the NodeFeatureDiscovery instance. Note that we add entries to the default deviceClasseWhiteList field, so that to support more network adapters, such as the NVIDIA BlueField DPUs and the NVIDIA GPUs.

$ cat <<EOF > nfd-instance.yaml apiVersion: nfd.openshift.io/v1 kind: NodeFeatureDiscovery metadata: name: nfd-instance namespace: openshift-nfd spec: instance: '' operand: image: '${NFD_OPERAND_IMAGE}' servicePort: 12000 prunerOnDelete: false topologyUpdater: false workerConfig: configData: | core: sleepInterval: 60s sources: pci: deviceClassWhitelist: - "02" - "03" - "0200" - "0207" - "12" deviceLabelFields: - "vendor" EOF $ oc create -f nfd-instance.yaml nodefeaturediscovery.nfd.openshift.io/nfd-instance created

Finally we can validate our instance is up and running by again looking at the pods under the openshift-nfd namespace.

$ oc get pods -n openshift-nfd NAME READY STATUS RESTARTS AGE nfd-controller-manager-7cb6d656-jcnqb 2/2 Running 0 4m nfd-gc-7576d64889-s28k9 1/1 Running 0 21s nfd-master-b7bcf5cfd-qnrmz 1/1 Running 0 21s nfd-worker-96pfh 1/1 Running 0 21s nfd-worker-b2gkg 1/1 Running 0 21s nfd-worker-bd9bk 1/1 Running 0 21s nfd-worker-cswf4 1/1 Running 0 21s nfd-worker-kp6gg 1/1 Running 0 21s

After a minute or so, we can verify that NFD has added labels to the node. The NFD labels are prefixed with feature.node.kubernetes.io, so we can easily filter them.

$ oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))' { "feature.node.kubernetes.io/cpu-cpuid.ADX": "true", "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true", "feature.node.kubernetes.io/cpu-cpuid.AVX": "true", "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true", "feature.node.kubernetes.io/cpu-cpuid.CETSS": "true", "feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true", "feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true", "feature.node.kubernetes.io/cpu-cpuid.CPBOOST": "true", "feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS": "true", "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true", "feature.node.kubernetes.io/cpu-cpuid.FP256": "true", "feature.node.kubernetes.io/cpu-cpuid.FSRM": "true", "feature.node.kubernetes.io/cpu-cpuid.FXSR": "true", "feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSFFV": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST": "true", "feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD": "true", "feature.node.kubernetes.io/cpu-cpuid.INVLPGB": "true", "feature.node.kubernetes.io/cpu-cpuid.LAHF": "true", "feature.node.kubernetes.io/cpu-cpuid.LBRVIRT": "true", "feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true", "feature.node.kubernetes.io/cpu-cpuid.MCOMMIT": "true", "feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true", "feature.node.kubernetes.io/cpu-cpuid.MOVU": "true", "feature.node.kubernetes.io/cpu-cpuid.MSRIRC": "true", "feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH": "true", "feature.node.kubernetes.io/cpu-cpuid.NRIPS": "true", "feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true", "feature.node.kubernetes.io/cpu-cpuid.PPIN": "true", "feature.node.kubernetes.io/cpu-cpuid.PSFD": "true", "feature.node.kubernetes.io/cpu-cpuid.RDPRU": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_ES": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_SNP": "true", "feature.node.kubernetes.io/cpu-cpuid.SHA": "true", "feature.node.kubernetes.io/cpu-cpuid.SME": "true", "feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT": "true", "feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true", "feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true", "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true", "feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON": "true", "feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true", "feature.node.kubernetes.io/cpu-cpuid.SVM": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMDA": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMFBASID": "true", "feature.node.kubernetes.io/cpu-cpuid.SVML": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMNP": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMPF": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMPFT": "true", "feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true", "feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true", "feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED": "true", "feature.node.kubernetes.io/cpu-cpuid.TOPEXT": "true", "feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR": "true", "feature.node.kubernetes.io/cpu-cpuid.VAES": "true", "feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN": "true", "feature.node.kubernetes.io/cpu-cpuid.VMPL": "true", "feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT": "true", "feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true", "feature.node.kubernetes.io/cpu-cpuid.VTE": "true", "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true", "feature.node.kubernetes.io/cpu-cpuid.X87": "true", "feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true", "feature.node.kubernetes.io/cpu-hardware_multithreading": "false", "feature.node.kubernetes.io/cpu-model.family": "25", "feature.node.kubernetes.io/cpu-model.id": "1", "feature.node.kubernetes.io/cpu-model.vendor_id": "AMD", "feature.node.kubernetes.io/kernel-config.NO_HZ": "true", "feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true", "feature.node.kubernetes.io/kernel-selinux.enabled": "true", "feature.node.kubernetes.io/kernel-version.full": "5.14.0-427.35.1.el9_4.x86_64", "feature.node.kubernetes.io/kernel-version.major": "5", "feature.node.kubernetes.io/kernel-version.minor": "14", "feature.node.kubernetes.io/kernel-version.revision": "0", "feature.node.kubernetes.io/memory-numa": "true", "feature.node.kubernetes.io/network-sriov.capable": "true", "feature.node.kubernetes.io/pci-102b.present": "true", "feature.node.kubernetes.io/pci-10de.present": "true", "feature.node.kubernetes.io/pci-10de.sriov.capable": "true", "feature.node.kubernetes.io/pci-15b3.present": "true", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true", "feature.node.kubernetes.io/rdma.available": "true", "feature.node.kubernetes.io/rdma.capable": "true", "feature.node.kubernetes.io/storage-nonrotationaldisk": "true", "feature.node.kubernetes.io/system-os_release.ID": "rhcos", "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.17", "feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "417.94.202409121747-0", "feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.4", "feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.17", "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4", "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "17" }

Finally we can confirm there is a network device that is discovered.

$ oc describe node | grep -E 'Roles|pci' | grep pci-15b3 feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-15b3.sriov.capable=true feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-15b3.sriov.capable=true

If everything looks good we can move onto the next operator.

Install and Configure NMState Operator

There might be a need to configure network interfaces on the nodes that were not configured at initial cluster creation time and the NMState operator is designed for those use cases.  The first step is to create a custom resource file that contains the namespace, operator group and subscription.

$ cat <<EOF > nmstate-operator.yaml apiVersion: v1 kind: Namespace metadata: labels: kubernetes.io/metadata.name: openshift-nmstate name: openshift-nmstate name: openshift-nmstate spec: finalizers: - kubernetes --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: annotations: olm.providedAPIs: NMState.v1.nmstate.io name: openshift-nmstate namespace: openshift-nmstate spec: targetNamespaces: - openshift-nmstate --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: labels: operators.coreos.com/kubernetes-nmstate-operator.openshift-nmstate: "" name: kubernetes-nmstate-operator namespace: openshift-nmstate spec: channel: stable installPlanApproval: Automatic name: kubernetes-nmstate-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF

Then we can take the custom resource file and create it on the cluster.

$ oc create -f nmstate-operator.yaml namespace/openshift-nmstate created operatorgroup.operators.coreos.com/openshift-nmstate created subscription.operators.coreos.com/kubernetes-nmstate-operator created

Next we should validate the operator is up and running.

$ oc get pods -n openshift-nmstate NAME READY STATUS RESTARTS AGE nmstate-operator-d587966c9-qkl5m 1/1 Running 0 43s

A nmstate instance is required so we will create a custom resource file for that.

$ cat <<EOF > nmstate-instance.yaml apiVersion: nmstate.io/v1 kind: NMState metadata: name: nmstate EOF

Then we will create the instance on the cluster.

$ oc create -f nmstate-instance.yaml nmstate.nmstate.io/nmstate created

Finally we will validate the instance is running.

$ oc get pods -n openshift-nmstate NAME READY STATUS RESTARTS AGE nmstate-cert-manager-6dc78dc6bf-ds7kj 1/1 Running 0 17s nmstate-console-plugin-5b7595c56c-tgzbw 1/1 Running 0 17s nmstate-handler-lxkd5 1/1 Running 0 17s nmstate-operator-d587966c9-qkl5m 1/1 Running 0 3m27s nmstate-webhook-54dbd47d9d-cvsf6 0/1 Running 0 17s

Next we can build a NodeNetworkConfigurationPolicy. The example below will configure a static ipaddress on the ens8f0np0 interface on nvd-srv-32.

$ cat <<EOF > nncp-static-ip.yaml apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: ens8f0np0-policy spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com desiredState: interfaces: - name: ens8f0np0 description: Configuring ens8f0np0 on nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com type: ethernet state: up ipv4: dhcp: false address: - ip: 10.6.145.32 prefix-length: 24 enabled: true EOF

Once we have the custom resource file we can create it on the cluster.

$ oc create -f nncp-static-ip.yaml nodenetworkconfigurationpolicy.nmstate.io/ens8f0np0-policy created $ oc get nncp -A NAME STATUS REASON ens8f0np0-policy Available SuccessfullyConfigured

We can validate that the ipaddress is set by looking inside the node at the interface.

$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-8mx6q ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.11 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# ip address show dev ens8f0np0 96: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 58:a2:e1:e1:42:78 brd ff:ff:ff:ff:ff:ff altname enp160s0f0np0 inet 10.6.145.32/24 brd 10.6.145.255 scope global noprefixroute ens8f0np0 valid_lft forever preferred_lft forever inet6 fe80::c397:5afa:d618:e752/64 scope link noprefixroute valid_lft forever preferred_lft forever

If everything looks good we can proceed to the next operator.

Install and Configure SRIOV Operator

Now we need to create the SRIOV Operator custom resource file to create the namespace, operator group and subscription.

$ cat << EOF > openshift-sriov-network-operator.yaml apiVersion: v1 kind: Namespace metadata: name: openshift-sriov-network-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: sriov-network-operators namespace: openshift-sriov-network-operator spec: targetNamespaces: - openshift-sriov-network-operator upgradeStrategy: Default --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: sriov-network-operator-subscription namespace: openshift-sriov-network-operator spec: channel: stable installPlanApproval: Automatic name: sriov-network-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF

Now we can create the SRIOV resource on the cluster.

$ oc create -f openshift-sriov-network-operator.yaml namespace/openshift-sriov-network-operator created operatorgroup.operators.coreos.com/sriov-network-operators created subscription.operators.coreos.com/sriov-network-operator-subscription created

We can validate the operator is running by looking at the pod output.

$ oc get pods -n openshift-sriov-network-operator NAME READY STATUS RESTARTS AGE sriov-network-operator-7cb6c49868-89486 1/1 Running 0 22s

Next we will need to create the default SriovOperatorConfig configuration file.

$ cat <<EOF > sriov-operator-config.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovOperatorConfig metadata: name: default namespace: openshift-sriov-network-operator spec: enableInjector: true enableOperatorWebhook: true logLevel: 2 EOF

Then create the resource on the cluster.

$ oc create -f sriov-operator-config.yaml sriovoperatorconfig.sriovnetwork.openshift.io/default created

For the default SriovOperatorConfig to work with the MLNX_OFED container, please run the following patch command.

$ oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }' sriovoperatorconfig.sriovnetwork.openshift.io/default patched

If everything looks good we can proceed to installing the next operator.

Install and Configure Network Operator

To get started we will generate a NVIDIA Network Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > network-operator.yaml apiVersion: v1 kind: Namespace metadata: name: nvidia-network-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: nvidia-network-operator namespace: nvidia-network-operator spec: targetNamespaces: - nvidia-network-operator --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nvidia-network-operator namespace: nvidia-network-operator spec: channel: v24.10.0 installPlanApproval: Automatic name: nvidia-network-operator source: certified-operators sourceNamespace: openshift-marketplace EOF

Next we can create the resources on the cluster.

$ oc create -f network-operator.yaml namespace/nvidia-network-operator created operatorgroup.operators.coreos.com/nvidia-network-operator created subscription.operators.coreos.com/nvidia-network-operator created

We can then validate that the network operator has installed and is running by confirming the controller is running in the nvidia-network-operator namespace.

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m

With the operator up we can create the NicClusterPolicy custom resource file. Note in this file I have hard coded the Infiniband interface as ibs2f0 and ethernet interface as ens8f0np0 that I will be using as my shared rdma device.   This could be a different devices depending on the system configuration.

$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: nicFeatureDiscovery: image: nic-feature-discovery repository: ghcr.io/mellanox version: v0.0.1 docaTelemetryService: image: doca_telemetry repository: nvcr.io/nvidia/doca version: 1.16.5-doca2.6.0-host rdmaSharedDevicePlugin: config: | { "configList": [ { "resourceName": "rdma_shared_device_ib", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ibs2f0"] } }, { "resourceName": "rdma_shared_device_eth", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ens8f0np0"] } } ] } image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: v1.5.1 secondaryNetwork: ipoib: image: ipoib-cni repository: ghcr.io/mellanox version: v1.2.0 nvIpam: enableWebhook: false image: nvidia-k8s-ipam repository: ghcr.io/mellanox version: v0.2.0 ofedDriver: readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 forcePrecompiled: false terminationGracePeriodSeconds: 300 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: true enable: true force: true timeoutSeconds: 300 podSelector: '' maxParallelUpgrades: 1 safeLoad: false waitForCompletion: timeoutSeconds: 0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 image: doca-driver repository: nvcr.io/nvidia/mellanox version: 24.10-0.7.0.0-0 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" EOF

Next we can create the NicClusterPolicy custom resource on the cluster.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created

We can validate the NicClusterPolicy by running a few commands in the DOCA/MOFED container.

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE doca-telemetry-service-hwj65 1/1 Running 2 160m kube-ipoib-cni-ds-fsn8g 1/1 Running 2 160m mofed-rhcos4.16-9b5ddf4c6-ds-ct2h5 2/2 Running 4 160m nic-feature-discovery-ds-dtksz 1/1 Running 2 160m nv-ipam-controller-854585f594-c5jpp 1/1 Running 2 160m nv-ipam-controller-854585f594-xrnp5 1/1 Running 2 160m nv-ipam-node-xqttl 1/1 Running 2 160m nvidia-network-operator-controller-manager-5798b564cd-5cq99 1/1 Running 2 5d23h rdma-shared-dp-ds-p9vvg 1/1 Running 0 85m

And we can rsh into the mofed container to check a few things.

$ MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed) $ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD} sh-5.1# ofed_info -s OFED-internal-24.10-0.7.0.0-0: sh-5.1# ibdev2netdev -v 0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up) 0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)

Now we need to create a IPoIBNetwork custom resource file (for infiniband based interfaces).

$ cat <<EOF > ipoib-network.yaml apiVersion: mellanox.com/v1alpha1 kind: IPoIBNetwork metadata: name: example-ipoibnetwork spec: ipam: | { "type": "whereabouts", "range": "192.168.6.225/28", "exclude": [ "192.168.6.229/30", "192.168.6.236/32" ] } master: ibs2f0 networkNamespace: default EOF

And then create the IPoIBNetwork resource on the cluster.

$ $ oc create -f ipoib-network.yaml ipoibnetwork.mellanox.com/example-ipoibnetwork created

We will do the same thing for our ethernet interface but this will be a MacvlanNetwork custom resource file.

$ cat <<EOF > macvlan-network.yaml apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdmashared-net spec: networkNamespace: default master: ens8f0np0 mode: bridge mtu: 1500 ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}' EOF

Then create the resource on the cluster.

$ oc create -f macvlan-network.yaml macvlannetwork.mellanox.com/rdmashared-net created

If everything looks good we can proceed to the next operator.

Install and Configure GPU Operator

The next operator we need to configured is the NVIDIA GPU Operator. As with most operators, we will need to configure a namespace, operator group and subscription.

To get started we will generate a NVIDIA GPU Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > gpu-operator.yaml apiVersion: v1 kind: Namespace metadata: name: nvidia-gpu-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: nvidia-gpu-operator namespace: nvidia-gpu-operator spec: targetNamespaces: - nvidia-gpu-operator --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nvidia-gpu-operator namespace: nvidia-gpu-operator spec: channel: "v24.9" installPlanApproval: Automatic name: gpu-operator-certified source: certified-operators sourceNamespace: openshift-marketplace EOF

Next we can create the resources on the cluster.

$ oc create -f gpu-operator.yaml namespace/nvidia-gpu-operator created operatorgroup.operators.coreos.com/nvidia-gpu-operator created subscription.operators.coreos.com/nvidia-gpu-operator created

We can check that the operator pod is running by looking at the pods under the namespace.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s

Now that we have the operator running we need to create a GPU cluster policy custom resource file like the one below.

$ cat <<EOF > gpu-cluster-policy.yaml apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' serviceMonitor: enabled: true enabled: true cdi: default: false enabled: false driver: licensingConfig: nlsEnabled: true configMapName: '' certConfig: name: '' rdma: enabled: true kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' enabled: true useNvidiaDriverCRD: false useOpenKernelModules: true devicePlugin: config: name: '' default: '' mps: root: /run/nvidia/mps enabled: true gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: nvidia-fs version: 2.20.5 repository: nvcr.io/nvidia/cloud-native vgpuManager: enabled: false vfioManager: enabled: true toolkit: installDir: /usr/local/nvidia enabled: true EOF

With the GPU ClusterPolicy custom resource file generated, let's create it on the cluster.

$ oc create -f gpu-cluster-policy.yaml clusterpolicy.nvidia.com/gpu-cluster-policy created

After some time, all the pods are up and running.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-d5ngn 1/1 Running 0 3m20s gpu-feature-discovery-z42rx 1/1 Running 0 3m23s gpu-operator-6bb4d4b4c5-njh78 1/1 Running 0 4m35s nvidia-container-toolkit-daemonset-bkh8l 1/1 Running 0 3m20s nvidia-container-toolkit-daemonset-c4hzm 1/1 Running 0 3m23s nvidia-cuda-validator-4blvg 0/1 Completed 0 106s nvidia-cuda-validator-tw8sl 0/1 Completed 0 112s nvidia-dcgm-exporter-rrw4g 1/1 Running 0 3m20s nvidia-dcgm-exporter-xc78t 1/1 Running 0 3m23s nvidia-dcgm-nvxpf 1/1 Running 0 3m20s nvidia-dcgm-snj4j 1/1 Running 0 3m23s nvidia-device-plugin-daemonset-fk2xz 1/1 Running 0 3m23s nvidia-device-plugin-daemonset-wq87j 1/1 Running 0 3m20s nvidia-driver-daemonset-416.94.202410211619-0-ngrjg 4/4 Running 0 3m58s nvidia-driver-daemonset-416.94.202410211619-0-tm4x6 4/4 Running 0 3m58s nvidia-node-status-exporter-jlzxh 1/1 Running 0 3m57s nvidia-node-status-exporter-zjffs 1/1 Running 0 3m57s nvidia-operator-validator-l49hx 1/1 Running 0 3m20s nvidia-operator-validator-n44nn 1/1 Running 0 3m23s

Once we see the pods running above, we can remote shell into the NVIDIA driver daemonset pod and confirm two items. The first is that the nvidia modules are loaded and ensuring specifically the nvidia_peermem one is there. We can also run the nvidia-smi utility to show the details about the driver and the hardware.

$ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver) sh-4.4# lsmod|grep nvidia nvidia_fs 327680 0 nvidia_peermem 24576 0 nvidia_modeset 1507328 0 video 73728 1 nvidia_modeset nvidia_uvm 6889472 8 nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200 sh-4.4# nvidia-smi Wed Nov 6 22:03:53 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A40 On | 00000000:61:00.0 Off | 0 | | 0% 37C P0 88W / 300W | 1MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 | | 0% 28C P8 29W / 300W | 1MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

While we are in the driver pod we should also set the GPU clock to maximum using the following nvidia-smi command.  This is optional but why not have it at full speed.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1) GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0 All done. sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1) GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0 All done.

One last thing we can do is validate our resource are available from a node describe perspective.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9 Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596712Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445736Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596672Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445696Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63

If everything looks good we can proceed to actual RDMA testing.

The Shared Device RDMA Testing

This section will cover running workload pods across the nodes in the environment. We will setup the required privileges, create the workload pod, validate connectivity between the two hosts on the infiniband fabric and then run a performance test.

Create Service Account

First let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > default-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: rdma namespace: default EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z rdma clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

If everything looks good we can move onto creating the workload pods.

Create Workload Pods for IB

With the service account setup we now need to create a workload pod that contains all the tooling for our testing. We can generate a custom pod resource file for each worker node as follows to meet that requirement.

$ cat <<EOF > rdma-ib-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-ib-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: example-ipoibnetwork spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-ib-32-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_ib: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_ib: 1 EOF $ cat <<EOF > rdma-ib-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-ib-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: example-ipoibnetwork spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-ib-33-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_ib: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_ib: 1 EOF

Then we can create the pods on the cluster.

$ oc create -f rdma-ib-32-workload.yaml pod/rdma-ib-32-workload created $ oc create -f rdma-ib-33-workload.yaml pod/rdma-ib-33-workload created

Let's validate the pods is running.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-ib-32-workload 1/1 Running 0 10s rdma-ib-33-workload 1/1 Running 0 3s

With the pods up and running we can validate connectivity.

Validate IB Connectivity

This section will cover confirming the infiniband connectivity is working between the systems.  This section is option but provides a lot of  good infiniband troubleshooting tips.  First we should rsh into each rdma-workload-client pod.

$ oc rsh -n default rdma-ib-32-workload sh-5.1#

The first command we can run is the ibhosts command which shows infiniband host nodes in topology.

sh-5.1# ibhosts Ca : 0x58a2e10300e14446 ports 1 "nvd-srv-33 mlx5_0" Ca : 0x58a2e10300dfe416 ports 1 "nvd-srv-32 mlx5_0"

We can also run the ibnodes command which will show not only the nodes but also switches in the topology.

sh-5.1# ibnodes Ca : 0x58a2e10300e14446 ports 1 "nvd-srv-33 mlx5_0" Ca : 0x58a2e10300dfe416 ports 1 "nvd-srv-32 mlx5_0" Switch : 0xfc6a1c0300e7ecc0 ports 129 "MF0;qm9700-ib:MQM9700/U1" enhanced port 0 lid 1 lmc 0

We can look deeper into an interface state by using the ibstatus command and pass an interface. If no interface is passed all will display.

sh-5.1# ibstatus mlx5_0 Infiniband device 'mlx5_0' port 1 status: default gid: fe80:0000:0000:0000:58a2:e103:00df:e416 base lid: 0x4 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 400 Gb/sec (4X NDR) link_layer: InfiniBand

Now that we have familiarized ourself with the environment we can run ibstat and grep out only certain key elements of the output. These will be needed for the ibping test.

The first ibstat output is that of our first node which will act as the server side for the ibping command.

sh-5.1# ibstat | egrep "Port|Base|Link" Port 1: Physical state: LinkUp Base lid: 4 Port GUID: 0x58a2e10300e14446 Link layer: InfiniBand Port 1: Physical state: LinkUp Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet

The output above shows both an infiniband and ethernet interface. We are only interested in the infiniband in this use case. Make note of the lid number as that is used in the ibping command on the client side.

We can run the same command on the client side and notice while some of the details are similar the lid number is unique along with the port GUID.

sh-5.1# ibstat | egrep "Port|Base|Link" Port 1: Physical state: LinkUp Base lid: 5 Port GUID: 0x58a2e10300e14446 Link layer: InfiniBand Port 1: Physical state: LinkUp Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet

Next we can run an ibping with the server switch on the first workload pod.

sh-5.1# ibping -S -P 1 -d ibdebug: [114] ibping_serv: starting to serve... ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)

And on the second workload pod we can run an ibping command to ping the server side we started on the other pod.

sh-5.1# ibping -P 1 4 Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.011 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.014 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms

Once we have completed confirming connectivity we can move onto the performance testing.

Now we want to run a test across the two pods running. We will need to rsh into the first pod and run the ib_write_bw command. Then we will rsh into the second pod in a different terminal window and run the ib_write_bw <ipaddress> command.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE rdma-ib-32-workload 1/1 Running 0 8m12s rdma-ib-33-workload 1/1 Running 0 8m5s

First let's get the ipaddress of the first pod.

$ oc get pod rdma-ib-32-workload -o yaml | grep -E 'default/example-ipoibnetwork' -A3 "name": "default/example-ipoibnetwork", "interface": "net1", "ips": [ "192.168.6.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default rdma-ib-32-workload sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default rdma-ib-33-workload sh-5.1# ib_write_bw 192.168.6.225 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x05 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007fcbace2f000 remote address: LID 0x04 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007f360e3d8000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 2500.000000 != 3495.887000. CPU Frequency is not max. 65536 5000 44604.62 44576.86 0.713230 ---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007f360e3d8000 remote address: LID 0x05 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007fcbace2f000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 5000 44604.62 44576.86 0.713230 ---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

Create Workload Pods for ETH

Now we need to test IB over ethernet. We can generate a custom pod resource file for both nodes as follows to meet that requirement.

$ cat <<EOF > rdma-eth-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-eth-32-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF $ cat <<EOF > rdma-eth-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-eth-33-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF

Then we can create the pods on the cluster.

$ oc create -f rdma-eth-32-workload.yaml pod/rdma-eth-32-workload created $ oc create -f rdma-eth-33-workload.yaml pod/rdma-eth-33-workload created

Let's validate the pods is running.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE rdma-eth-32-workload 1/1 Running 0 25s rdma-eth-33-workload 1/1 Running 0 22s

With the pods up and running we can move onto the actual test.

Now we want to run a test across the two pods running. We will need to rsh into the first pod and run the ib_write_bw command. Then we will rsh into the second pod in a different terminal window and run the ib_write_bw <ipaddress> command.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE rdma-eth-32-workload 1/1 Running 0 106s rdma-eth-33-workload 1/1 Running 0 103s

First let's get the ipaddress of the first pod.

$ oc get pod rdma-eth-32-workload -o yaml | grep -E 'default/rdmashared' -A3 "name": "default/rdmashared-net", "interface": "net1", "ips": [ "192.168.2.1"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default rdma-eth-32-workload sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default rdma-eth-33-workload sh-5.1# ib_write_bw 192.168.2.1 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x05 QPN 0x0ce2 PSN 0x5389f7 RKey 0x1fff00 VAddr 0x007f7368df3000 remote address: LID 0x04 QPN 0x0ce2 PSN 0x81fa7f RKey 0x1fff00 VAddr 0x007f7e8c890000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 2500.000000 != 3497.359000. CPU Frequency is not max. 65536 5000 44490.32 44467.35 0.711478 ---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x0ce2 PSN 0x81fa7f RKey 0x1fff00 VAddr 0x007f7e8c890000 remote address: LID 0x05 QPN 0x0ce2 PSN 0x5389f7 RKey 0x1fff00 VAddr 0x007f7368df3000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 5000 44490.32 44467.35 0.711478 ---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

The Host Device RDMA Testing

This section will demonstrate how to configure host device RDMA for Nvidia Network Operator and then how to test per pod configuration.

Configure Nic Cluster Policy for Host Device

The operator should be running from previous steps. If a NicClusterPolicy exists we need to delete the existing one and generate a new hostdev NicClusterPolicy custom resource file.

$ cat <<EOF > network-hostdev-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: 24.10-0.7.0.0-0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" sriovDevicePlugin: image: sriov-network-device-plugin repository: ghcr.io/k8snetworkplumbingwg version: v3.7.0 config: | { "resourceList": [ { "resourcePrefix": "nvidia.com", "resourceName": "hostdev", "selectors": { "vendors": ["15b3"], "isRdma": true } } ] } EOF

Next we can create the NicClusterPolicy custom resource on the cluster.

$ oc create -f network-hostdev-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created

We can validate the host device NicClusterPolicy by running a few commands in the DOCA/MOFED container.

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE mofed-rhcos4.16-696886fcb4-ds-9sgvd 2/2 Running 0 2m37s mofed-rhcos4.16-696886fcb4-ds-lkjd4 2/2 Running 0 2m37s nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 0 141m sriov-device-plugin-6v2nz 1/1 Running 0 2m14s sriov-device-plugin-hc4t8 1/1 Running 0 2m14s

We can also confirm that the resources show up in the cluster oc decribe node section.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A7 Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596708Ki nvidia.com/hostdev: 2 pods: 250 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445732Ki nvidia.com/hostdev: 2 pods: 250 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596704Ki nvidia.com/hostdev: 2 pods: 250 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445728Ki nvidia.com/hostdev: 2 pods: 250

Now we need to create a HostDeviceNetwork custom resource file.

$ cat <<EOF > hostdev-network.yaml apiVersion: mellanox.com/v1alpha1 kind: HostDeviceNetwork metadata: name: hostdev-net spec: networkNamespace: "default" resourceName: "hostdev" ipam: | { "type": "whereabouts", "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ] } EOF

And then create the HostDeviceNetwork resource on the cluster.

$ oc create -f hostdev-network.yaml hostdevicenetwork.mellanox.com/hostdev-net created

Let's validate our resources are showing up properly.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8 Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596708Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 2 pods: 250 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445732Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 2 pods: 250 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596680Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 2 pods: 250 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445704Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 2 pods: 250

End of nic cluster policy for host device section.

Create Workload Pods and Perf Test Host Device

Now we need to create a workload pod that contains all the tooling for our host device testing. We can generate a custom pod file for each node as follows to meet that requirement.

$ cat << EOF > hostdev-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: hostdev-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: hostdev-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: hostdev-32-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 nvidia.com/hostdev: 1 requests: nvidia.com/gpu: 1 nvidia.com/hostdev: 1 EOF $ cat <<EOF > hostdev-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: hostdev-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: hostdev-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: hostdev-33-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 nvidia.com/hostdev: 1 requests: nvidia.com/gpu: 1 nvidia.com/hostdev: 1 EOF

Then we can create the pods on the cluster.

$ oc create -f hostdev-32-workload.yaml pod/hostdev-32-workload created $ oc create -f hostdev-33-workload.yaml pod/hostdev-33-workload created

Let's validate the pods are running.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE hostdev-32-workload 1/1 Running 0 73s hostdev-33-workload 1/1 Running 0 12s

First let's get the ipaddress of the first pod.

$ oc get pod hostdev-32-workload -o yaml | grep -E 'default/hostdev-net' -A3 "name": "default/hostdev-net", "interface": "net1", "ips": [ "192.168.3.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default hostdev-32-workload sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default hostdev-33-workload sh-5.1# ib_write_bw 192.168.3.225 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x05 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007fe688c97000 remote address: LID 0x04 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007f1f0249d000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 2500.000000 != 3498.323000. CPU Frequency is not max. 65536 5000 44351.41 44328.98 0.709264 ---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007f1f0249d000 remote address: LID 0x05 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007fe688c97000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 5000 44351.41 44328.98 0.709264 ---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

The SRIOV Legacy Mode RDMA Testing

This deployment mode supports SR-IOV in legacy mode.

Configure Nic Cluster Policy for SRIOV Legacy

First we need to create a NicClusterPolicy which for SRIOV legacy mode is fairly generic. Generate the following custom resource file below.  If an existing NicClusterPolicy exists please remove it.

$ cat <<EOF > network-sriovleg-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: 24.10-0.7.0.0-0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" EOF

Now let's create the policy on the cluster.

$ oc create -f network-sriovleg-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created

Before we continue we can validate the pods are up.

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE mofed-rhcos4.16-696886fcb4-ds-4mb42 2/2 Running 0 40s mofed-rhcos4.16-696886fcb4-ds-8knwq 2/2 Running 0 40s nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 13 (4d ago) 4d21h

Now we need to create a SriovNetworkNodePolicy which will generate the VFs for the device we want to operate in SRIOV legacy mode. Generate the custom resource file below.

$ cat <<EOF > sriov-network-node-policy.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: sriov-legacy-policy namespace: openshift-sriov-network-operator spec: deviceType: netdevice mtu: 1500 nicSelector: vendor: "15b3" pfNames: ["ens8f0np0#0-7"] nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" numVfs: 8 priority: 90 isRdma: true resourceName: sriovlegacy EOF

Next we can create the custom resource on the cluster. As a note make sure SR-IOV Global Enable is enabled as per Red Hat Knowledge Article.

$ oc create -f sriov-network-node-policy.yaml sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created

The nodes should go through a reboot process. Each one will have scheduling disabled and reboot to make the configuration take place.

$ oc get nodes NAME STATUS ROLES AGE VERSION edge-19.edge.lab.eng.rdu2.redhat.com Ready control-plane,master,worker 5d v1.29.8+632b078 nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Ready worker 4d22h v1.29.8+632b078 nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com NotReady,SchedulingDisabled worker 4d22h v1.29.8+632b078

Once the nodes have reboot we can validate that the VF interfaces were created by opening up a debug pod on each node.

a$ oc debug node/nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-33nvidiaengrdu2dcredhatcom-debug-cqfjz ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.12 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# ip link show | grep ens8 26: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 42: ens8f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 43: ens8f0v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 44: ens8f0v2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 45: ens8f0v3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 46: ens8f0v4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 47: ens8f0v5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 48: ens8f0v6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 49: ens8f0v7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

We can repeat the same steps above on the second node if we want to feel complete.

We can also confirm via the node capabilities output.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8 Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596692Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 0 openshift.io/sriovlegacy: 8 -- Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445716Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 0 openshift.io/sriovlegacy: 8 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596688Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 0 openshift.io/sriovlegacy: 8 -- Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445712Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 0 openshift.io/sriovlegacy: 8

Now that the VFs for SRIOV legacy mode are in place we can generate the SriovNetwork custom resource file.

$ cat <<EOF > sriov-network.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: sriov-network namespace: openshift-sriov-network-operator spec: vlan: 0 networkNamespace: "default" resourceName: "sriovlegacy" ipam: | { "type": "whereabouts", "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ] } EOF

Then we can create the custom resource on the cluster.

$ oc create -f sriov-network.yaml sriovnetwork.sriovnetwork.openshift.io/sriov-network created

End of nic cluster policy for host device section.

Create Workload and Perf Test SRIOV Legacy

Now we need to create a workload pod that contains all the tooling for our host device testing. We can generate a custom pod file for each node as follows to meet that requirement.

$ cat << EOF > sriovlegacy-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: sriovlegacy-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: sriov-network spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: sriovlegacy-32-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 requests: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 EOF $ cat <<EOF > sriovlegacy-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: sriovlegacy-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: sriov-network spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: sriovlegacy-33-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 requests: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 EOF

Then we can create the pods on the cluster.

$ oc create -f sriovlegacy-32-workload.yaml pod/sriovlegacy-32-workload created $ oc create -f sriovlegacy-33-workload.yaml pod/sriovlegacy-33-workload created

Let's validate the pods are running.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE sriovlegacy-32-workload 1/1 Running 0 73s sriovlegacy-33-workload 1/1 Running 0 12s

First let's get the ipaddress of the first pod.

$ oc get pod sriovlegacy-32-workload -o yaml | grep -E 'default/sriov-network' -A3 "name": "default/sriov-network", "interface": "net1", "ips": [ "192.168.3.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh sriovlegacy-33-workload sh-5.1# ib_write_bw 192.168.3.225 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x05 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f397ace8000 remote address: LID 0x04 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f0eeefac000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 2500.000000 != 3491.228000. CPU Frequency is not max. 65536 5000 44414.44 44386.66 0.710187 ---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f0eeefac000 remote address: LID 0x05 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f397ace8000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 5000 44414.44 44386.66 0.710187 ---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over.

Hopefully this blog was detailed enough to provide an understanding of RDMA testing with NVIDIA and OpenShift.  It provide a brief example of how to configure the different RDMA methods: Shared, Hostdev and SRIOV Legacy.