Why
There are going to be use cases where the workloads running in virtual machines on OpenShift worker nodes will need to have the network devices in passthrough mode. While this is not a problem when the OpenShift worker node cluster interface is on a different network card type then those those that need to be passed to the virtual machine. It does becomes an issue on systems that are outfitted with all the same network interface types. This means that the device id for all the network cards are the same. It also means that from a traditional sense I cannot use the current method of enabling passthrough for the network cards. That current method involves blacklisting the network kernel driver from loading and then configuring the device ids to attach to the vfio-pci driver. If we were to implement that on a system with all of the same network cards when the system rebooted to apply the machineconfig the node would come up without any networking and show as NotReady. That is why in the rest of this document we will demonstrate a different practical approach to this problem.
Manually Configure
Kernel driver unbinding and binding was introduces back in kernel 2.6.13 back in 2005 so its a technology that has been around for quite some time. This is the exact feature that we will be using to show how to only make some of our network cards vfio-pci bound. To begin let's take a look at our network interfaces via lspci where I have filtered out the devices by the device id 15b3:a2dc. We can see here that I have 4 network card ports on an OpenShift node in a debug pod.
sh-5.2# lspci -nn |grep 15b3:a2dc
0000:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0000:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0002:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0002:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
Now let's examime the physical interface names for these 4 ports.
sh-5.2# grep PCI_SLOT_NAME /sys/class/net/*/device/uevent
/sys/class/net/enP2s2f0np0/device/uevent:PCI_SLOT_NAME=0002:01:00.0
/sys/class/net/enP2s2f1np1/device/uevent:PCI_SLOT_NAME=0002:01:00.1
/sys/class/net/enp1s0f0np0/device/uevent:PCI_SLOT_NAME=0000:01:00.0
/sys/class/net/enp1s0f1np1/device/uevent:PCI_SLOT_NAME=0000:01:00.1
Now we have to see which one is already in use by OpenShift so we do not inadvertently work with the wrong card. This will always be the one where the master-
sh-5.2# ovs-vsctl --no-heading --format=table --columns=name,type find Interface type=system| awk '{print $1}'
enp1s0f0np0
We can see enp1sf0np0 which correlates to the 0000:01:00.0 card. So we will focus on the 0002:01:00.0 & 0002:01:00.1.
Now that we have determined which cards we can use we will begin the process of unbinding them from their current driver which is mlx5_core.
echo -n "0002:01:00.0" > /sys/bus/pci/drivers/mlx5_core/unbind
echo -n "0002:01:00.1" > /sys/bus/pci/drivers/mlx5_core/unbind
At this point if looked at the lspci output we would see these two devices no longer have a "Kernel driver in use" line in the output. Rather then four lines here we only see two which are the two ports related the system network card.
sh-5.2# lspci -k -s 0002:01:00.0
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel modules: mlx5_core
sh-5.2# lspci -k -s 0002:01:00.1
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel modules: mlx5_core
We are now ready to for them to use the vfio-pci driver but first we may need to load that driver.
modprobe vfio-pci
We can validate that the vfio-pci driver is loaded with lsmod.
sh-5.2# lsmod|grep vfio
vfio_pci 16384 0
vfio_pci_core 90112 1 vfio_pci
vfio_iommu_type1 49152 0
vfio 73728 3 vfio_pci_core,vfio_iommu_type1,vfio_pci
iommufd 131072 1 vfio
Now that we have unbound the two devices drivers let's override the kernel driver they should use with vfio-pci.
sh-5.2# echo vfio-pci > /sys/bus/pci/devices/0002:01:00.0/driver_override
sh-5.2# echo vfio-pci > /sys/bus/pci/devices/0002:01:00.1/driver_override
With the vfio-driver override in place we can now bind our two devices to that driver.
sh-5.2# echo "0002:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
sh-5.2# echo "0002:01:00.1" > /sys/bus/pci/drivers/vfio-pci/bind
And finally we can validate that the driver for those devices is now using the vfio-pci driver.
sh-5.2# lspci -k -s 0002:01:00.0
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel driver in use: vfio-pci
Kernel modules: mlx5_core
sh-5.2# lspci -k -s 0002:01:00.1
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel driver in use: vfio-pci
Kernel modules: mlx5_core
Automatically Configure
While one can manually configure the vfio-pci passthrough like we did above this won't be scalable in a large cluster especially after OpenShift upgrades so we need something that is more automatic. The answer to this is twofold in that we first need a script that can automate the process above and then a mechanism of running that script on OpenShift nodes.
For the automation script we can use the example code in this repository here. This script will identify all the interfaces of a certain device type and then determine which ones can be used as passthrough devices. The factor that prohibits the device from being used as a passthrough is if the device has an OVS bridge associated to it. Once we have idenfitied the list it will go ahead and unbind the kernel driver in use on that device and then override the driver and bind it to vfio-pci so it is available for passthrough.
Here is a manuall run of the system we had to test on.
sh-5.2# ./passthrough-some-nics.sh -n 15b3:a2dc
NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible
====================================================================================================
enp1s0f0np0 0000:01:00.0 mlx5_core Yes No
enp1s0f1np1 0000:01:00.1 mlx5_core Yes No
enP2s2f0np0 0002:01:00.0 mlx5_core No Yes
enP2s2f1np1 0002:01:00.1 mlx5_core No Yes
Loading vfio-pci......Done!
Unbinding device 0002:01:00.0 from mlx5_core kernel driver...
Applying driver override to device 0002:01:00.0...
Binding device 0002:01:00.0 to vfio-pci...
Device kernel driver validation...
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel driver in use: vfio-pci
Kernel modules: mlx5_core
Unbinding device 0002:01:00.1 from mlx5_core kernel driver...
Applying driver override to device 0002:01:00.1...
Binding device 0002:01:00.1 to vfio-pci...
Device kernel driver validation...
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel driver in use: vfio-pci
Kernel modules: mlx5_core
Notice the script changes the kernel driver in use for the two devices. If we run the script again we should see that no changes can be made because there are no other eligible passthrough devices.
sh-5.2# ./passthrough-some-nics.sh -n 15b3:a2dc
NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible
====================================================================================================
enp1s0f0np0 0000:01:00.0 mlx5_core Yes No
enp1s0f1np1 0000:01:00.1 mlx5_core Yes No
NA 0002:01:00.0 vfio-pci No Complete
NA 0002:01:00.1 vfio-pci No Complete
vfio_pci 16384 0 - Live 0xffffb968aee88000
Now that we have seen the script work let's make this more relatable to OpenShift. First we will have to base64 encode the script by piping it through base64 command.
$ BASE64_SCRIPT=$(cat passthrough-some-nics.sh | base64 -w 0)
$ echo $BASE64_SCRIPT
IyEvYmluL2Jhc2gKIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjCiMgVGhpcyBzY3JpcHQgcGFzc2VzIHRocm91Z2ggc29tZSBvZiB0aGUgTklDcyB3aGVuIGFsbCB0aGUgTklDcyBhcmUgdGhlIHNhbWUgZGV2aWNlIHR5cGUgICAgICAgICAgICAgICAgICAgIwojIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMKCiMgSG93IHRvIHVzZSB0aGUgc2NyaXB0IGlmIHVzZXIgZG9lcyBub3Qga25vdyBob3cKaG93dG8oKXsKICBlY2hvICJVc2FnZTogcGFzc3Rocm91Z2gtc29tZS1uaWNzLnNoIC1uIDxuaWMtZGV2aWNlLWlkPiIKICBlY2hvICJFeGFtcGxlIFNpbmdsZSBEZXZpY2UgSUQ6IHBhc3N0aHJvdWdoLXNvbWUtbmljcy5zaCAtbiAxNWIzOmEyZGMiCiAgZWNobyAiRXhhbXBsZSBNdWx0aSBEZXZpY2UgSUQ6IHBhc3N0aHJvdWdoLXNvbWUtbmljcy5zaCAtbiAxZGQ4OjEwMDJ8MTViMzoxMDIxIgp9CgojIEdldG9wdHMgc2V0dXAgZm9yIHZhcmlhYmxlcyB0byBwYXNzIGZyb20gb3B0aW9ucwp3aGlsZSBnZXRvcHRzIGc6bjp1OnI6aCBvcHRpb24KZG8KY2FzZSAiJHtvcHRpb259IgppbgpuKSBuaWNpZD0ke09QVEFSR307OwpoKSBob3d0bzsgZXhpdCAwOzsKXD8pIGhvd3RvOyBleGl0IDE7Owplc2FjCmRvbmUKCiMgTWFrZSBzdXJlIHRoZSB2YXJpYWJsZXMgYXJlIHBvcHVsYXRlZCB3aXRoIHZhbHVlcyBvdGhlcndpc2Ugc2hvdyBob3d0bwppZiAoWyAteiAiJG5pY2lkIiBdKSB0aGVuCiAgIGhvd3RvCiAgIGV4aXQgMQpmaQoKIyBTZXQgdGFibGUgaGVhZGVyIGZvcm1hdCAKZGl2aWRlcj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09CmRpdmlkZXI9JGRpdmlkZXIkZGl2aWRlciRkaXZpZGVyCmhlYWRlcj0iXG4gJS0xMnMgJS0xNnMgJS0xNHMgJS0xNHMgJS0xNHNcbiIKZm9ybWF0PSIgJS0xNHMgJS0xNHMgJS0xNHMgJS0xNHMgJS0xNHNcbiIKd2lkdGg9MTAwCgojIFNsdXJwIGluIG5pYyBkZXZpY2UgdHlwZSBpZHMgZnJvbSBsc3BjaQpuaWNpZD1gZWNobyAkbmljaWQgfHNlZCAncy8sL1x8L2cnYAptYXBmaWxlIC10IG15X25pY3MgPCA8KGxzcGNpIC1ufGdyZXAgLUUgJG5pY2lkKQoKIyBQcmludCBvdXQgaGVhZGVycyAKcHJpbnRmICIkaGVhZGVyIiAiTklDIE5hbWUiICJOSUMgQnVzIElEIiAiS2VybmVsIERyaXZlciIgIk9DUCBCUiBOSUMiICJQYXNzVGhydSBFbGlnaWJsZSIKcHJpbnRmICIlJHdpZHRoLiR7d2lkdGh9c1xuIiAiJGRpdmlkZXIiCgojIEdyYWIgaW50ZXJmYWNlIGFzc29jaWF0ZWQgdG8gb3ZzLXN5c3RlbSBicmlkZ2UuICBCb25kcyBkbyBub3Qgd29yayBoZXJlIHlldApicnBoeWludD1gb3ZzLXZzY3RsIC0tbm8taGVhZGluZyAtLWZvcm1hdD10YWJsZSAtLWNvbHVtbnM9bmFtZSx0eXBlIGZpbmQgSW50ZXJmYWNlIHR5cGU9c3lzdGVtfCBhd2sgJ3twcmludCAkMX0nYApicnBoeWJ1cz1gZ3JlcCBQQ0lfU0xPVF9OQU1FIC9zeXMvY2xhc3MvbmV0LyovZGV2aWNlL3VldmVudHxncmVwICRicnBoeWludHwgYXdrIC1GICI9IiAne3ByaW50ICQyfSdgCgojIERlY2xhcmUgZW1wdHkgYXJyYXkgdG8gc3RvcmUgbmljIGRldGFpbHMgb24gdGhvc2UgdGhhdCBjYW4gYmUgdW5ib3VuZApkZWNsYXJlIC1hIHBhc3N0aHJvdWdoPSgpCgpmb3IgKCggbmljPTA7IG5pYzwkeyNteV9uaWNzW0BdfTsgbmljKysgKSkKZG8KICAgbmljYnVzaWQ9YGVjaG8gJHtteV9uaWNzWyRuaWNdfSB8IGF3ayAne3ByaW50ICQxfSdgCiAgIG5pY2tkcnY9YGxzcGNpIC1rbiAtcyAkbmljYnVzaWQgfCBncmVwICJLZXJuZWwgZHJpdmVyIGluIHVzZToifCBhd2sgLUYgIjogIiAne3ByaW50ICQyfSdgCiAgIG5pY25hbWU9YGdyZXAgUENJX1NMT1RfTkFNRSAvc3lzL2NsYXNzL25ldC8qL2RldmljZS91ZXZlbnR8Z3JlcCAkbmljYnVzaWR8IGF3ayAtRiAnLycgJ3twcmludCAkNX0nYAogICBpZiBbICIkbmljbmFtZSIgPSAiIiBdOyB0aGVuCiAgICAgIG5pY25hbWU9Ik5BIgogICBmaQoKICAgIyBPYnRhaW4gZmlyc3QgMTEgY2hhcmFjdGVycyBvZiBlYWNoIHZhcmlhYmxlIHN0cmluZyB0byB1c2UgZm9yIGNvbXBhcmUKICAgc3VibmljYnVzaWQ9IiR7bmljYnVzaWQ6MDoxMX0iCiAgIHN1YmJycGh5YnVzPSIke2JycGh5YnVzOjA6MTF9IgoKICAgIyBDb21wYXJlIHRoZSBzdWJzdHJpbmdzCiAgIGlmIFtbICIkc3VibmljYnVzaWQiID09ICIkc3ViYnJwaHlidXMiIF1dOyB0aGVuCiAgICAgIHN5c25pYz0iWWVzIgogICAgICBwYXNzdGhydT0iTm8iCiAgICAgICMgRGlzcGxheSB0byBjb25zb2xlIHRoZSBkZXRhaWxzCiAgICAgIHByaW50ZiAiJGZvcm1hdCIgJG5pY25hbWUgJG5pY2J1c2lkICRuaWNrZHJ2ICRzeXNuaWMgJHBhc3N0aHJ1CiAgIGVsc2UKICAgICAgc3lzbmljPSJObyIKICAgICAgaWYgWyAiJG5pY2tkcnYiID0gInZmaW8tcGNpIiBdOyB0aGVuCiAgICAgICAgIHBhc3N0aHJ1PSJDb21wbGV0ZSIKICAgICAgZWxzZQogICAgICAgICBwYXNzdGhydT0iWWVzIgogICAgICAgICBwYXNzdGhyb3VnaCs9KCIkbmljYnVzaWR8JG5pY2tkcnYiKQogICAgICBmaQogICAgICAjIERpc3BsYXkgdG8gY29uc29sZSB0aGUgZGV0YWlscwogICAgICBwcmludGYgIiRmb3JtYXQiICRuaWNuYW1lICRuaWNidXNpZCAkbmlja2RydiAkc3lzbmljICRwYXNzdGhydQogICBmaQpkb25lCgppZiAhIGdyZXAgLUUgIl52ZmlvX3BjaSAiIC9wcm9jL21vZHVsZXM7IHRoZW4KICBlY2hvICIgIgogIGVjaG8gLW4gIkxvYWRpbmcgdmZpby1wY2kuLi4iCiAgbW9kcHJvYmUgdmZpby1wY2kKICBlY2hvICIuLi5Eb25lISIKICBlY2hvICIgIgpmaQoKCmZvciAoKCBwYXNzPTA7IHBhc3M8JHsjcGFzc3Rocm91Z2hbQF19OyBwYXNzKysgKSkKZG8KICAgbmljYnVzaWQ9YGVjaG8gJHtwYXNzdGhyb3VnaFskcGFzc119IHwgYXdrIC1GICJ8IiAne3ByaW50ICQxfSdgCiAgIG5pY2tkcnY9YGVjaG8gJHtwYXNzdGhyb3VnaFskcGFzc119IHwgYXdrIC1GICJ8IiAne3ByaW50ICQyfSdgCiAgIGVjaG8gIiAiCiAgIGVjaG8gIlVuYmluZGluZyBkZXZpY2UgJG5pY2J1c2lkIGZyb20gJG5pY2tkcnYga2VybmVsIGRyaXZlci4uLiIKICAgZWNobyAtbiAiJG5pY2J1c2lkIiA+IC9zeXMvYnVzL3BjaS9kcml2ZXJzL21seDVfY29yZS91bmJpbmQKICAgZWNobyAiQXBwbHlpbmcgZHJpdmVyIG92ZXJyaWRlIHRvIGRldmljZSAkbmljYnVzaWQuLi4iCiAgIGVjaG8gdmZpby1wY2kgPiAvc3lzL2J1cy9wY2kvZGV2aWNlcy8kbmljYnVzaWQvZHJpdmVyX292ZXJyaWRlCiAgIGVjaG8gIkJpbmRpbmcgZGV2aWNlICRuaWNidXNpZCB0byB2ZmlvLXBjaS4uLiIKICAgZWNobyAiJG5pY2J1c2lkIiA+IC9zeXMvYnVzL3BjaS9kcml2ZXJzL3ZmaW8tcGNpL2JpbmQKICAgZWNobyAiRGV2aWNlIGtlcm5lbCBkcml2ZXIgdmFsaWRhdGlvbi4uLiIKICAgbHNwY2kgLWsgLXMgJG5pY2J1c2lkCmRvbmUKZXhpdCAwCg==
We will also set our device id variable that will get embedded in the machineconfig as the argument for the script. Please note if we wanted to use multiple device ids we would pipe delimite them.
$ DEVICEID="15b3:a2dc" # Single device id
$ DEVICEID="1dd8:1002|15b3:1021" # Multiple device ids
We also have to set the the length of wait time to allow system to come up. 120 seconds is a good rule of thumb.
$ SLP="120"
Then we have to configure a MachineConfig that will place the base64 encoded script on the system and establish a systemd service to run the script everytime the node boots.
$ cat > passthrough-for-some-machineconfig.yaml << EOF
kind: MachineConfig
apiVersion: machineconfiguration.openshift.io/v1
metadata:
name: passthrough-for-some-systemd-service
labels:
machineconfiguration.openshift.io/role: master
spec:
config:
ignition:
version: 3.2.0
systemd:
units:
- name: passthrough-for-some.service
enabled: true
contents: |
[Unit]
Description=Identifies and enabled passthough on select network interfaces
After=NetworkManager-wait-online.service openvswitch.service
Wants=NetworkManager-wait-online.service openvswitch.service
[Service]
RemainAfterExit=yes
ExecStart=/etc/scripts/passthrough-some-nics.sh -n $DEVICEID -s $SLP
Type=oneshot
[Install]
WantedBy=multi-user.target
storage:
files:
- filesystem: root
path: "/etc/scripts/passthrough-some-nics.sh"
contents:
source: data:text/plain;charset=utf-8;base64,$BASE64_SCRIPT
verification: {}
mode: 0755
overwrite: true
EOF
Now let's create the MachineConfig on the cluster.
$ oc create -f passthrough-for-some-machineconfig.yaml
machineconfig.machineconfiguration.openshift.io/passthrough-for-some-systemd-service created
We need to wait for the node to reboot. Once oc get mcp is responsive and confirms the node is updated we can start to validate.
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-c88d4164a5bd26edb3d4025d24a5d2f8 True False False 1 1 1 0 6d7h
worker rendered-worker-9890b2fbe760e8e731e68bf217b87278 True False False 0 0 0 0 6d7h
Let's check the status of the service on the node. We can see from the below output it already identified the interfaces that can be made passthrough.
# systemctl status passthrough-for-some.service
● passthrough-for-some.service - Identifies and enabled passthough on select network interfaces
Loaded: loaded (/etc/systemd/system/passthrough-for-some.service; enabled; preset: disabled)
Active: activating (start) since Thu 2026-02-19 22:27:01 UTC; 5min ago
Job: 408
Invocation: 29eaf89183be4424a9f2fb4a2bd249a4
Main PID: 4282 (passthrough-som)
Tasks: 1 (limit: 3084134)
Memory: 1.5M (peak: 10.8M)
CPU: 213ms
CGroup: /system.slice/passthrough-for-some.service
└─4282 /bin/bash /etc/scripts/passthrough-some-nics.sh -n 15b3:a2dc
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: ====================================================================================================
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enp1s0f0np0 0000:01:00.0 mlx5_core Yes No
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enp1s0f1np1 0000:01:00.1 mlx5_core Yes No
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enP2s2f0np0 0002:01:00.0 mlx5_core No Yes
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enP2s2f1np1 0002:01:00.1 mlx5_core No Yes
Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:
Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Loading vfio-pci......Done!
Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:
Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]:
Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Unbinding device 0002:01:00.0 from mlx5_core kernel driver...
Let's look at the lspci output for the devices we saw in the logs. We can see the first two interfaces stayed bound to mlx5_core because those ports are part of the same card and associated to the OVS bridge. The last two ports though were unbound from mlx5_core and bound to vfio-pci to enable passthrough.
# lspci -k -s 0000:01:00.0
0000:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
# lspci -k -s 0000:01:00.1
0000:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
# lspci -k -s 0002:01:00.0
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel driver in use: vfio-pci
Kernel modules: mlx5_core
# lspci -k -s 0002:01:00.1
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Subsystem: Mellanox Technologies Device 0009
Kernel driver in use: vfio-pci
Kernel modules: mlx5_core
One final thing we can do is run the script manually on the node again to also confirm our findings.
# /etc/scripts/passthrough-some-nics.sh -n 15b3:a2dc
NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible
====================================================================================================
enp1s0f0np0 0000:01:00.0 mlx5_core Yes No
enp1s0f1np1 0000:01:00.1 mlx5_core Yes No
NA 0002:01:00.0 vfio-pci No Complete
NA 0002:01:00.1 vfio-pci No Complete
vfio_pci 16384 0 - Live 0xffffd5d69072b000
Openshift Virtualization Passthrough
Now that our devices are set to passthrough we can configure OpenShift Virtualization to see them as an available resource. We will need to edite the hyperconverged setup on our OpenShift cluster and add the following section.
permittedHostDevices:
pciHostDevices:
- pciDeviceSelector: 15b3:a2dc
resourceName: nvidia.com/BF3_CX7
resourceRequirements:
We can make the edit by doing the following and inserting the section above right before the resourceRequirements section of the spec file.
$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited
Then we can confirm the resources are exposed by the OpenShift node using oc describe node.
$ oc describe node | grep -E 'Capacity:|Allocatable:' -A12
Capacity:
cpu: 72
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 936709572Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
hugepages-32Mi: 0
hugepages-64Ki: 0
memory: 493510268Ki
nvidia.com/BF3_CX7: 2
pods: 250
Allocatable:
cpu: 71500m
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 862197798302
hugepages-1Gi: 0
hugepages-2Mi: 0
hugepages-32Mi: 0
hugepages-64Ki: 0
memory: 492359292Ki
nvidia.com/BF3_CX7: 2
pods: 250
Now when we go launch a virtual machine in OpenShift we will want to include the following section in our virtual machine spec file nested under spec->domain->devices.
hostDevices:
- deviceName: nvidia.com/BF3_CX7
name: hostDevices-turquoise-hornet-42
And if all goes well once we launch our virtual machine and it's running we should be able to see the passthrough ethernet interface.
$ oc get vmi -n openshift-cnv
NAMESPACE NAME AGE PHASE IP NODENAME READY
openshift-cnv rhel9-red-locust-96 10m Running 10.128.0.49 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com True
$ virtctl console rhel9-red-locust-96 -n openshift-cnv
Successfully connected to rhel9-red-locust-96 console. The escape sequence is ^]
rhel9-red-locust-96 login: cloud-user
Password:
Last login: Fri Feb 20 08:08:53 on tty1
[cloud-user@rhel9-red-locust-96 ~]$ sudo bash
[root@rhel9-red-locust-96 cloud-user]# lspci -nn|grep Mellanox
0a:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
Hopefully this provides a decent example of enabling passthrough for a subset of devices on a server where all the devices are the same but not all can be passed through due to the need for base networking at the OS level.