Thursday, May 07, 2026

NVIDIA OVS-DOCA via On-Cluster Layer OpenShift

In a previous blog I wrote about using off-cluster layering of NVIDIA's OVS-DOCA.  In this blog I want to use on-clustering layering which makes manageability even easier than with the off-cluster layer.   Before we begin though let's review what OVS-DOCA is all about.

What Is NVIDIA OVS-DOCA?

Open vSwitch (OVS) is a software-based network technology that enhances virtual machine (VM) communication within internal and external networks. Typically deployed in the hypervisor, OVS employs a software-based approach for packet switching, which can strain CPU resources, impacting system performance and network bandwidth utilization. Addressing this, NVIDIA's Accelerated Switching and Packet Processing (ASAP2) technology offloads OVS data-plane tasks to specialized hardware, like the embedded switch (eSwitch) within the NIC subsystem, while maintaining an unmodified OVS control-plane. This results in notably improved OVS performance without burdening the CPU.

NVIDIA's OVS-DOCA extends the traditional OVS-DPDK and OVS-Kernel data-path offload interfaces (DPIF), introducing OVS-DOCA as an additional DPIF implementation. OVS-DOCA, built upon NVIDIA's networking API, preserves the same interfaces as OVS-DPDK and OVS-Kernel while utilizing the DOCA Flow library with the additional OVS-DOCA DPIF. Unlike the use of the other DPIFs (DPDK, Kernel), OVS-DOCA DPIF exploits unique hardware offload mechanisms and application techniques, maximizing performance and features for NVIDA NICs and DPUs. This mode is especially efficient due to its architecture and DOCA library integration, enhancing e-switch configuration and accelerating hardware offloads beyond what the other modes can achieve.


Disclaimer: The following workflow of replacing OVS-DOCA in OpenShift is not currently a supported exercise. While using on-clustering is supported the process of switching out OVS with OVS-DOCA is purely experimental and for experimental purposes.

Workflow

The following on-cluster experiment of layering on NVIDIA OVS-DOCA was done on a single node OpenShift 4.20.15 environment on an x86 architecture. We assume in this document that the following has already been configured as its outside the scope of this document:

  • Hugepages have been configured by a machine configuration
  • The Logical Volume Manager Storage Operator has been installed and a LVM Cluster resource has been created

The workflow is broken down into five sections which cover environment configuration, creating the layer and then validating that it all works.

  • Configure OpenShift Internal Registry
  • Configure Configmap
  • Generate MachineOSConfig Custom Resource
  • Create MachineOSConfig Layer
  • Validate MachineOSConfig Layer

Configure OpenShift Internal Registry

The first thing we need to do is configured the OpenShift Internal Registry which is disabled by default. We will start by creating a persistent volume claim that will be consumed by the image registry. We will need to generate the following custom resource file.

$ cat <<EOF > imageregistry-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: registry-claim namespace: openshift-image-registry annotations: imageregistry.openshift.io: "true" spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 100Gi storageClassName: lvms-vg1 EOF

Once we have the file created we can create it on the cluster.

$ oc create -f imageregistry-pvc.yaml persistentvolumeclaim/registry-claim created

The results will show the following which is normal at this point.

$ oc get pvc -n openshift-image-registry NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE registry-claim Pending lvms-vg1 <unset> 22s

Next we will enable route creation for the registry and this could take a few minutes.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge config.imageregistry.operator.openshift.io/cluster patched

We can proceed to allow the image registry operator to use our pvc we created above.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge --patch '{"spec": {"storage":{"pvc":{"claim":"registry-claim"}}}}' config.imageregistry.operator.openshift.io/cluster patched

Then we can also set the rollout strategy. Since this is a single node a replica of 1 will do.

$ oc patch config.imageregistry.operator.openshift.io/cluster --type=merge -p '{"spec":{"rolloutStrategy":"Recreate","replicas":1}}' config.imageregistry.operator.openshift.io/cluster patched

Finally we will set the image registry to managed which will signal the operator to start the registry.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge --patch '{"spec":{"managementState":"Managed"}}' config.imageregistry.operator.openshift.io/cluster patched

If everything went like planned we should now see the image registry pod running.

$ oc get pods -n openshift-image-registry NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-6bcb795b48-jgzw9 1/1 Running 5 7h34m image-registry-676f99fd9c-8gf4m 1/1 Running 0 58s node-ca-bt8ns 1/1 Running 4 7h16m

We should also see that our persistent volume claim is now bound and also has an associated persistent volume from our lvms-vg1 storage class.

$ oc get pvc -n openshift-image-registry NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE registry-claim Bound pvc-c58510da-9eb9-4220-b101-4086f58b3311 100Gi RWO lvms-vg1 <unset> 6m $ oc get pv -n openshift-image-registry NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE pvc-c58510da-9eb9-4220-b101-4086f58b3311 100Gi RWO Delete Bound openshift-image-registry/registry-claim lvms-vg1 <unset> 2m34s

Now that our registry is running we need to make sure we can pull and push to the registry. First let's set some environment variables for our internal registry, the user, the namespace and the token creation.

$ export REGISTRY=image-registry.openshift-image-registry.svc:5000 $ export REGISTRY_USER=builder $ export REGISTRY_NAMESPACE=openshift-machine-config-operator $ export TOKEN=$(oc create token $REGISTRY_USER -n $REGISTRY_NAMESPACE --duration=$((900*24))h)

Next let's create the push-secret using the variables we set in the openshift-machine-config-operator namespace.

$ oc create secret docker-registry push-secret -n openshift-machine-config-operator --docker-server=$REGISTRY --docker-username=$REGISTRY_USER --docker-password=$TOKEN secret/push-secret created

Now we need to extract the push secret and the clusters global pull-secret.

$ oc extract secret/push-secret -n openshift-machine-config-operator --to=- > push-secret.json # .dockerconfigjson $ oc extract secret/pull-secret -n openshift-config --to=- > pull-secret.json # .dockerconfigjson

We will now take the push-secret and global pull secret into one merged secret.

$ jq -s '.[0] * .[1]' pull-secret.json push-secret.json > pull-and-push-secret.json

This new merged secret needs to be create as well in the openshift-machine-config-operator namespace.

$ oc create secret generic pull-and-push-secret -n openshift-machine-config-operator --from-file=.dockerconfigjson=pull-and-push-secret.json --type=kubernetes.io/dockerconfigjson secret/pull-and-push-secret created

We can also validate the secrets have been created.

$ oc get secrets -n openshift-machine-config-operator |grep push pull-and-push-secret kubernetes.io/dockerconfigjson 1 10s push-secret kubernetes.io/dockerconfigjson 1 114s

At this point our registry is configured and running and we can move onto the next workflow.

Configure Configmap

With the Machine Config Operator which will be doing the heavy lifting of building the OVS-DOCA RHCOS layer we can use configuration maps to map in values or files. In this case we want to generate a ConfigMap that will contain our DOCA repo. We will first generate the following custom resource file.

$ cat <<EOF > repos-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: etc-yum-repos-d namespace: openshift-machine-config-operator data: doca.repo: | [doca] name=DOCA Online Repo baseurl=https://linux.mellanox.com/public/repo/doca/3.3.0/rhel9/x86_64/ enabled=1 gpgcheck=0 centos.repos: | [baseos] name=CentOS Stream $releasever - BaseOS baseurl=https://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial gpgcheck=0 repo_gpgcheck=0 metadata_expire=6h countme=1 enabled=1 [appstream] name=CentOS Stream $releasever - AppStream baseurl=https://mirror.stream.centos.org/9-stream/AppStream/x86_64/os gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial gpgcheck=0 repo_gpgcheck=0 metadata_expire=6h countme=1 enabled=1 [CRB] name=Centos Stream $releasever - CRB baseurl=https://mirror.stream.centos.org/9-stream/CRB/x86_64/os gpgcheck=1 repo_gpgcheck=0 metadata_expire=6h countme=1 enabled=1 EOF

With the file created we will go ahead and create it on the cluster.

$ oc create -f repos-configmap.yaml configmap/etc-yum-repos-d created

For sanity we can check that it was created.

$ oc get configmap -n openshift-machine-config-operator etc-yum-repos-d NAME DATA AGE etc-yum-repos-d 2 22s

What this ConfigMap will do is mount up the repo files created in the builder container when we go to create the OVS-DOCA RHCOS layer. If everything looks good we can move onto the next step in the workflow.

Generate MachineOSConfig Custom Resource

When using on-cluster layering we have to create a MachineOSConfig that will basically do the steps that the Dockerfile would have done if we were building an off-cluster layer. For our on-cluster layer we need to ensure it does the following:

  • Installs the dependencies needed
  • Upgrades some of the existing packages to a version suitable for OVS-DOCA
  • Removes the current Red Hat based version of OpenvSwitch
  • Replaces some packages with those from the DOCA repo
  • Installs the doca-all bundle which will contain the OVS-DOCA rpms including OpenvSwitch
  • Note that since this is a SNO setup our metadata name and machineConfigPool is master. On a multi-node cluster this would most likely be worker or the name of a custom machineConfigPool.

Below is the following custom resource file we will use.

$ cat <<EOF > on-cluster-rhcos-layer-mc.yaml.works apiVersion: machineconfiguration.openshift.io/v1 kind: MachineOSConfig metadata: name: master spec: machineConfigPool: name: master containerFile: - containerfileArch: NoArch content: |- FROM configs AS final RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \ mkdir /var/opt && \ dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libyaml-devel-0.2.5-7.el9.x86_64.rpm && \ dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libpcap-devel-1.10.0-4.el9.x86_64.rpm && \ dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libzip-devel-1.7.3-7.el9.x86_64.rpm && \ dnf install -y libunwind jsoncpp openssl-devel kernel-devel kernel-headers && \ dnf upgrade -y unbound-libs unbound bzip2-libs bzip2-devel && \ rpm-ostree override remove openvswitch3.5 && \ rpm-ostree override replace libibverbs rdma-core --experimental --from repo='doca' && \ dnf install doca-all -y && \ rm -r -f /etc/yum.repos.d/* && \ dnf clean all -y && \ bootc container lint imageBuilder: imageBuilderType: Job baseImagePullSecret: name: pull-and-push-secret renderedImagePushSecret: name: push-secret renderedImagePushSpec: image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image:latest EOF

If everything looks correct we can move onto the next section in the workflow.

Create MachineOSConfig Layer

At this point we are ready to create our MachineOSConfig which will do the following:

  • Build the OVS-DOCA RHCOS layer image based on our requirements in the MachineOSConfig
  • Push that OVS-DOCA RHCOS layer image to the local OpenShift registry
  • Apply that OVS-DOCA RHCOS layer image to the system
  • Reboot the node for the changes to take effect

To kick things off we need to create MachineOSConfig on the cluster.

$ oc create -f on-cluster-rhcos-layer-mc.yaml machineosconfig.machineconfiguration.openshift.io/master created

Next we can look at the state of the build process by looking at the machineOSbuild state.

$ oc get machineOSbuild NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE master-9c55a02933a10f5fc31c6bb5329e1f38 False True False False False 16s

While the machineOSbuild is building we should notice that two additional pods were created in the openshift-machine-config-operator namespace: machine-os-builer and build-master.

$ oc get pods -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE build-master-9c55a02933a10f5fc31c6bb5329e1f38-qr262 0/1 Init:0/1 0 30s kube-rbac-proxy-crio-nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com 1/1 Running 7 27h machine-config-controller-847595d69d-q9jzb 2/2 Running 9 27h machine-config-daemon-php6s 2/2 Running 14 27h machine-config-nodes-crd-cleanup-29633172-tv2ml 0/1 Completed 0 27h machine-config-operator-6d4cbf84b4-q4z6c 2/2 Running 9 27h machine-config-server-vvxkk 1/1 Running 4 27h machine-os-builder-9d9d855dd-9xjjv 1/1 Running 0 39s

The build-master is where the actual build is happening and we can watch the execution of the build by tailing the logs of that pod and the image-build container inside. Below is an example snippet (the log can be incredibly long) of right at the end of a successful build.

$ oc logs -f -n openshift-machine-config-operator build-master-9c55a02933a10f5fc31c6bb5329e1f38-qr262 image-build (...) Copying blob sha256:25e5c12c08ced2f786717e0303aff37e3ce37f8e8171ae91fb298eee4e7af424 Copying blob sha256:7e6009a201a327b83674a7491fd077e59ecfb63b63c7a570359dbf84081c6aa0 Copying config sha256:a7f6fd648ae51526c4c113756eb1af4beac6c05c6c6f4988665f3258a89e42bf Writing manifest to image destination

Once the image-build container has finished building the image we can go back and check the status of the machineOSbuild to see it succeeded.

$ oc get machineOSbuild NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE master-9c55a02933a10f5fc31c6bb5329e1f38 False False True False False 7m16s

We will also notice now that the master machineConfigPool is now in an updating state because the new RHCOS layer is being applied.

$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c4363e454702b9a652e29fb24f28c7c7 False True False 1 0 0 0 27h worker rendered-worker-91833d06ff7a4b4563d843249bc12228 True False False 0 0 0 0 27h

Remember that as the OVS-DOCA RHCOS layer is being applied the node will reboot. Once the node comes back the image should be applied and we can see the update is complete by the following.

$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c4363e454702b9a652e29fb24f28c7c7 True False False 1 1 1 0 27h worker rendered-worker-91833d06ff7a4b4563d843249bc12228 True False False 0 0 0 0 27h

If everything looks good we can move onto the next section of the workflow.

Validate MachineOSConfig Layer

Now that the RHCOS layer is applied to the system we can validate that the OVS-DOCA Openvswitch is in use. First we need to open a debug pod on the node where the image was applied.

$ oc debug node/nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-26nvidiaengrdu2dcredhatcom-debug-hlcdm ... To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`. Pod IP: 10.6.135.5 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host

Next we can check for what version of openvswitch is installed. The output below shows that we do in fact have the doca-openvswitch package installed.

sh-5.1# rpm -qa|grep openvswitch openvswitch-selinux-extra-policy-1.0-39.el9fdp.noarch doca-openvswitch-3.3.0040-1.el9.x86_64

Next we can use the ovs-vsctl command check further if the running openvswitch is in fact doca version. Below in the output we can see the datapath_types have doca listed and when we look at the dpdk_version we can see it references doca as well.

sh-5.1# ovs-vsctl list open_vswitch _uuid : bb0f2e33-856e-49f3-b063-a25084cb7894 bridges : [7d465f8c-2332-4cd8-9115-f5f53161d838, e137b176-e5ee-4df8-8c79-60b364c2c368] cur_cfg : 703 datapath_types : [doca, netdev, system] datapaths : {system=748f4bb7-442e-479a-9923-811ee91408cc} db_version : "8.5.1" doca_initialized : false doca_version : "3.3.0109" dpdk_initialized : false dpdk_version : "DPDK 25.11.0+doca2601" external_ids : {hostname=nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-bridge-remote-probe-interval="0", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.5", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="01916ed0-7268-43dd-8355-68fef87a1761"} iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan] manager_options : [] next_cfg : 703 other_config : {bundle-idle-timeout="0", ovn-chassis-idx-01916ed0-7268-43dd-8355-68fef87a1761="", vlan-limit="0"} ovs_version : "3.3.0040" ssl : [] statistics : {} system_type : rhel system_version : "9.6"

Now I noticed above doca and dpdk were not initialized. Let's see if we can get them going with the following. Then restart openvswitch on the node.

sh-5.1# ovs-vsctl --no-wait set Open_vSwitch . other_config:doca-init=true sh-5.1# systemctl restart openvswitch

And now if we look at the open_vswitch values again we can see they are now true.

sh-5.1# ovs-vsctl list open_vswitch _uuid : bb0f2e33-856e-49f3-b063-a25084cb7894 bridges : [7b6aaed2-2e60-45bf-b98c-90e6806482dc, 7d465f8c-2332-4cd8-9115-f5f53161d838] cur_cfg : 898 datapath_types : [doca, netdev, system] datapaths : {system=748f4bb7-442e-479a-9923-811ee91408cc} db_version : "8.5.1" doca_initialized : true doca_version : "3.3.0109" dpdk_initialized : true dpdk_version : "DPDK 25.11.0+doca2601" external_ids : {hostname=nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-bridge-remote-probe-interval="0", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.5", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="01916ed0-7268-43dd-8355-68fef87a1761"} iface_types : [bareudp, doca, docavdpa, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan] manager_options : [] next_cfg : 898 other_config : {bundle-idle-timeout="0", doca-init="true", ovn-chassis-idx-01916ed0-7268-43dd-8355-68fef87a1761="", vlan-limit="0"} ovs_version : "3.3.0040" ssl : [] statistics : {} system_type : rhel system_version : "9.6"

One other thing I will note here is that in this SNO environment I did go ahead and upgrade it from the starting 4.20.15 version to 4.21.11 with the OVS-DOCA RHCOS on-cluster image in place. After getting to 4.21.11 the OVS-DOCA openvswitch was still in place and running appropriately.  OpenShift showed no issues with the swap out replacement and/or the upgrade process.

Hopefully this provided a good example of how to build and apply a on-cluster OVS-DOCA RHCOS layer for experimental purposes.

Saturday, February 28, 2026

OpenShift Passthrough For Some


I wanted to provide a simple mechanism to configure vfio-pci devices of a certain device type when some of those device types are in use by the base operating system. For example on some Grace Hopper nodes the only network devices might be BlueField-3 interfaces. If I want one BlueField-3 to provide networking access to the base operating system I need to leave the kernel driver in place. However I might want to take the additional Bluefield-3 devices and use them in passthrough mode which would require them to be unbound from mlx5 drivers and bound to vfio-pci. The following writeup provides a working example both manually and then automatically in the context of OpenShift.  

Why

There are going to be use cases where the workloads running in virtual machines on OpenShift worker nodes will need to have the network devices in passthrough mode. While this is not a problem when the OpenShift worker node cluster interface is on a different network card type then those those that need to be passed to the virtual machine.   It does becomes an issue on systems that are outfitted with all the same network interface types. This means that the device id for all the network cards are the same. It also means that from a traditional sense I cannot use the current method of enabling passthrough for the network cards. That current method involves blacklisting the network kernel driver from loading and then configuring the device ids to attach to the vfio-pci driver. If we were to implement that on a system with all of the same network cards when the system rebooted to apply the machineconfig the node would come up without any networking and show as NotReady. That is why in the rest of this document we will demonstrate a different practical approach to this problem.

Manually Configure

Kernel driver unbinding and binding was introduces back in kernel 2.6.13 back in 2005 so its a technology that has been around for quite some time. This is the exact feature that we will be using to show how to only make some of our network cards vfio-pci bound. To begin let's take a look at our network interfaces via lspci where I have filtered out the devices by the device id 15b3:a2dc. We can see here that I have 4 network card ports on an OpenShift node in a debug pod.

sh-5.2# lspci -nn |grep 15b3:a2dc 0000:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) 0000:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) 0002:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01) 0002:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)

Now let's examime the physical interface names for these 4 ports.

sh-5.2# grep PCI_SLOT_NAME /sys/class/net/*/device/uevent /sys/class/net/enP2s2f0np0/device/uevent:PCI_SLOT_NAME=0002:01:00.0 /sys/class/net/enP2s2f1np1/device/uevent:PCI_SLOT_NAME=0002:01:00.1 /sys/class/net/enp1s0f0np0/device/uevent:PCI_SLOT_NAME=0000:01:00.0 /sys/class/net/enp1s0f1np1/device/uevent:PCI_SLOT_NAME=0000:01:00.1

Now we have to see which one is already in use by OpenShift so we do not inadvertently work with the wrong card. This will always be the one where the master-

sh-5.2# ovs-vsctl --no-heading --format=table --columns=name,type find Interface type=system| awk '{print $1}' enp1s0f0np0

We can see enp1sf0np0 which correlates to the 0000:01:00.0 card. So we will focus on the 0002:01:00.0 & 0002:01:00.1.

Now that we have determined which cards we can use we will begin the process of unbinding them from their current driver which is mlx5_core.

echo -n "0002:01:00.0" > /sys/bus/pci/drivers/mlx5_core/unbind echo -n "0002:01:00.1" > /sys/bus/pci/drivers/mlx5_core/unbind

At this point if looked at the lspci output we would see these two devices no longer have a "Kernel driver in use" line in the output. Rather then four lines here we only see two which are the two ports related the system network card.

sh-5.2# lspci -k -s 0002:01:00.0 0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel modules: mlx5_core sh-5.2# lspci -k -s 0002:01:00.1 0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel modules: mlx5_core

We are now ready to for them to use the vfio-pci driver but first we may need to load that driver.

modprobe vfio-pci

We can validate that the vfio-pci driver is loaded with lsmod.

sh-5.2# lsmod|grep vfio vfio_pci 16384 0 vfio_pci_core 90112 1 vfio_pci vfio_iommu_type1 49152 0 vfio 73728 3 vfio_pci_core,vfio_iommu_type1,vfio_pci iommufd 131072 1 vfio

Now that we have unbound the two devices drivers let's override the kernel driver they should use with vfio-pci.

sh-5.2# echo vfio-pci > /sys/bus/pci/devices/0002:01:00.0/driver_override sh-5.2# echo vfio-pci > /sys/bus/pci/devices/0002:01:00.1/driver_override

With the vfio-driver override in place we can now bind our two devices to that driver.

sh-5.2# echo "0002:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind sh-5.2# echo "0002:01:00.1" > /sys/bus/pci/drivers/vfio-pci/bind

And finally we can validate that the driver for those devices is now using the vfio-pci driver.

sh-5.2# lspci -k -s 0002:01:00.0 0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core sh-5.2# lspci -k -s 0002:01:00.1 0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core

Automatically Configure

While one can manually configure the vfio-pci passthrough like we did above this won't be scalable in a large cluster especially after OpenShift upgrades so we need something that is more automatic. The answer to this is twofold in that we first need a script that can automate the process above and then a mechanism of running that script on OpenShift nodes.

For the automation script we can use the example code in this repository here. This script will identify all the interfaces of a certain device type and then determine which ones can be used as passthrough devices. The factor that prohibits the device from being used as a passthrough is if the device has an OVS bridge associated to it. Once we have idenfitied the list it will go ahead and unbind the kernel driver in use on that device and then override the driver and bind it to vfio-pci so it is available for passthrough.

Here is a manuall run of the system we had to test on.

sh-5.2# ./passthrough-some-nics.sh -n 15b3:a2dc NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible ==================================================================================================== enp1s0f0np0 0000:01:00.0 mlx5_core Yes No enp1s0f1np1 0000:01:00.1 mlx5_core Yes No enP2s2f0np0 0002:01:00.0 mlx5_core No Yes enP2s2f1np1 0002:01:00.1 mlx5_core No Yes Loading vfio-pci......Done! Unbinding device 0002:01:00.0 from mlx5_core kernel driver... Applying driver override to device 0002:01:00.0... Binding device 0002:01:00.0 to vfio-pci... Device kernel driver validation... 0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core Unbinding device 0002:01:00.1 from mlx5_core kernel driver... Applying driver override to device 0002:01:00.1... Binding device 0002:01:00.1 to vfio-pci... Device kernel driver validation... 0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core

Notice the script changes the kernel driver in use for the two devices. If we run the script again we should see that no changes can be made because there are no other eligible passthrough devices.

sh-5.2# ./passthrough-some-nics.sh -n 15b3:a2dc NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible ==================================================================================================== enp1s0f0np0 0000:01:00.0 mlx5_core Yes No enp1s0f1np1 0000:01:00.1 mlx5_core Yes No NA 0002:01:00.0 vfio-pci No Complete NA 0002:01:00.1 vfio-pci No Complete vfio_pci 16384 0 - Live 0xffffb968aee88000

Now that we have seen the script work let's make this more relatable to OpenShift. First we will have to base64 encode the script by piping it through base64 command.

$ BASE64_SCRIPT=$(cat passthrough-some-nics.sh | base64 -w 0) $ echo $BASE64_SCRIPT IyEvYmluL2Jhc2gKIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjCiMgVGhpcyBzY3JpcHQgcGFzc2VzIHRocm91Z2ggc29tZSBvZiB0aGUgTklDcyB3aGVuIGFsbCB0aGUgTklDcyBhcmUgdGhlIHNhbWUgZGV2aWNlIHR5cGUgICAgICAgICAgICAgICAgICAgIwojIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMKCiMgSG93IHRvIHVzZSB0aGUgc2NyaXB0IGlmIHVzZXIgZG9lcyBub3Qga25vdyBob3cKaG93dG8oKXsKICBlY2hvICJVc2FnZTogcGFzc3Rocm91Z2gtc29tZS1uaWNzLnNoIC1uIDxuaWMtZGV2aWNlLWlkPiIKICBlY2hvICJFeGFtcGxlIFNpbmdsZSBEZXZpY2UgSUQ6IHBhc3N0aHJvdWdoLXNvbWUtbmljcy5zaCAtbiAxNWIzOmEyZGMiCiAgZWNobyAiRXhhbXBsZSBNdWx0aSBEZXZpY2UgSUQ6IHBhc3N0aHJvdWdoLXNvbWUtbmljcy5zaCAtbiAxZGQ4OjEwMDJ8MTViMzoxMDIxIgp9CgojIEdldG9wdHMgc2V0dXAgZm9yIHZhcmlhYmxlcyB0byBwYXNzIGZyb20gb3B0aW9ucwp3aGlsZSBnZXRvcHRzIGc6bjp1OnI6aCBvcHRpb24KZG8KY2FzZSAiJHtvcHRpb259IgppbgpuKSBuaWNpZD0ke09QVEFSR307OwpoKSBob3d0bzsgZXhpdCAwOzsKXD8pIGhvd3RvOyBleGl0IDE7Owplc2FjCmRvbmUKCiMgTWFrZSBzdXJlIHRoZSB2YXJpYWJsZXMgYXJlIHBvcHVsYXRlZCB3aXRoIHZhbHVlcyBvdGhlcndpc2Ugc2hvdyBob3d0bwppZiAoWyAteiAiJG5pY2lkIiBdKSB0aGVuCiAgIGhvd3RvCiAgIGV4aXQgMQpmaQoKIyBTZXQgdGFibGUgaGVhZGVyIGZvcm1hdCAKZGl2aWRlcj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09CmRpdmlkZXI9JGRpdmlkZXIkZGl2aWRlciRkaXZpZGVyCmhlYWRlcj0iXG4gJS0xMnMgJS0xNnMgJS0xNHMgJS0xNHMgJS0xNHNcbiIKZm9ybWF0PSIgJS0xNHMgJS0xNHMgJS0xNHMgJS0xNHMgJS0xNHNcbiIKd2lkdGg9MTAwCgojIFNsdXJwIGluIG5pYyBkZXZpY2UgdHlwZSBpZHMgZnJvbSBsc3BjaQpuaWNpZD1gZWNobyAkbmljaWQgfHNlZCAncy8sL1x8L2cnYAptYXBmaWxlIC10IG15X25pY3MgPCA8KGxzcGNpIC1ufGdyZXAgLUUgJG5pY2lkKQoKIyBQcmludCBvdXQgaGVhZGVycyAKcHJpbnRmICIkaGVhZGVyIiAiTklDIE5hbWUiICJOSUMgQnVzIElEIiAiS2VybmVsIERyaXZlciIgIk9DUCBCUiBOSUMiICJQYXNzVGhydSBFbGlnaWJsZSIKcHJpbnRmICIlJHdpZHRoLiR7d2lkdGh9c1xuIiAiJGRpdmlkZXIiCgojIEdyYWIgaW50ZXJmYWNlIGFzc29jaWF0ZWQgdG8gb3ZzLXN5c3RlbSBicmlkZ2UuICBCb25kcyBkbyBub3Qgd29yayBoZXJlIHlldApicnBoeWludD1gb3ZzLXZzY3RsIC0tbm8taGVhZGluZyAtLWZvcm1hdD10YWJsZSAtLWNvbHVtbnM9bmFtZSx0eXBlIGZpbmQgSW50ZXJmYWNlIHR5cGU9c3lzdGVtfCBhd2sgJ3twcmludCAkMX0nYApicnBoeWJ1cz1gZ3JlcCBQQ0lfU0xPVF9OQU1FIC9zeXMvY2xhc3MvbmV0LyovZGV2aWNlL3VldmVudHxncmVwICRicnBoeWludHwgYXdrIC1GICI9IiAne3ByaW50ICQyfSdgCgojIERlY2xhcmUgZW1wdHkgYXJyYXkgdG8gc3RvcmUgbmljIGRldGFpbHMgb24gdGhvc2UgdGhhdCBjYW4gYmUgdW5ib3VuZApkZWNsYXJlIC1hIHBhc3N0aHJvdWdoPSgpCgpmb3IgKCggbmljPTA7IG5pYzwkeyNteV9uaWNzW0BdfTsgbmljKysgKSkKZG8KICAgbmljYnVzaWQ9YGVjaG8gJHtteV9uaWNzWyRuaWNdfSB8IGF3ayAne3ByaW50ICQxfSdgCiAgIG5pY2tkcnY9YGxzcGNpIC1rbiAtcyAkbmljYnVzaWQgfCBncmVwICJLZXJuZWwgZHJpdmVyIGluIHVzZToifCBhd2sgLUYgIjogIiAne3ByaW50ICQyfSdgCiAgIG5pY25hbWU9YGdyZXAgUENJX1NMT1RfTkFNRSAvc3lzL2NsYXNzL25ldC8qL2RldmljZS91ZXZlbnR8Z3JlcCAkbmljYnVzaWR8IGF3ayAtRiAnLycgJ3twcmludCAkNX0nYAogICBpZiBbICIkbmljbmFtZSIgPSAiIiBdOyB0aGVuCiAgICAgIG5pY25hbWU9Ik5BIgogICBmaQoKICAgIyBPYnRhaW4gZmlyc3QgMTEgY2hhcmFjdGVycyBvZiBlYWNoIHZhcmlhYmxlIHN0cmluZyB0byB1c2UgZm9yIGNvbXBhcmUKICAgc3VibmljYnVzaWQ9IiR7bmljYnVzaWQ6MDoxMX0iCiAgIHN1YmJycGh5YnVzPSIke2JycGh5YnVzOjA6MTF9IgoKICAgIyBDb21wYXJlIHRoZSBzdWJzdHJpbmdzCiAgIGlmIFtbICIkc3VibmljYnVzaWQiID09ICIkc3ViYnJwaHlidXMiIF1dOyB0aGVuCiAgICAgIHN5c25pYz0iWWVzIgogICAgICBwYXNzdGhydT0iTm8iCiAgICAgICMgRGlzcGxheSB0byBjb25zb2xlIHRoZSBkZXRhaWxzCiAgICAgIHByaW50ZiAiJGZvcm1hdCIgJG5pY25hbWUgJG5pY2J1c2lkICRuaWNrZHJ2ICRzeXNuaWMgJHBhc3N0aHJ1CiAgIGVsc2UKICAgICAgc3lzbmljPSJObyIKICAgICAgaWYgWyAiJG5pY2tkcnYiID0gInZmaW8tcGNpIiBdOyB0aGVuCiAgICAgICAgIHBhc3N0aHJ1PSJDb21wbGV0ZSIKICAgICAgZWxzZQogICAgICAgICBwYXNzdGhydT0iWWVzIgogICAgICAgICBwYXNzdGhyb3VnaCs9KCIkbmljYnVzaWR8JG5pY2tkcnYiKQogICAgICBmaQogICAgICAjIERpc3BsYXkgdG8gY29uc29sZSB0aGUgZGV0YWlscwogICAgICBwcmludGYgIiRmb3JtYXQiICRuaWNuYW1lICRuaWNidXNpZCAkbmlja2RydiAkc3lzbmljICRwYXNzdGhydQogICBmaQpkb25lCgppZiAhIGdyZXAgLUUgIl52ZmlvX3BjaSAiIC9wcm9jL21vZHVsZXM7IHRoZW4KICBlY2hvICIgIgogIGVjaG8gLW4gIkxvYWRpbmcgdmZpby1wY2kuLi4iCiAgbW9kcHJvYmUgdmZpby1wY2kKICBlY2hvICIuLi5Eb25lISIKICBlY2hvICIgIgpmaQoKCmZvciAoKCBwYXNzPTA7IHBhc3M8JHsjcGFzc3Rocm91Z2hbQF19OyBwYXNzKysgKSkKZG8KICAgbmljYnVzaWQ9YGVjaG8gJHtwYXNzdGhyb3VnaFskcGFzc119IHwgYXdrIC1GICJ8IiAne3ByaW50ICQxfSdgCiAgIG5pY2tkcnY9YGVjaG8gJHtwYXNzdGhyb3VnaFskcGFzc119IHwgYXdrIC1GICJ8IiAne3ByaW50ICQyfSdgCiAgIGVjaG8gIiAiCiAgIGVjaG8gIlVuYmluZGluZyBkZXZpY2UgJG5pY2J1c2lkIGZyb20gJG5pY2tkcnYga2VybmVsIGRyaXZlci4uLiIKICAgZWNobyAtbiAiJG5pY2J1c2lkIiA+IC9zeXMvYnVzL3BjaS9kcml2ZXJzL21seDVfY29yZS91bmJpbmQKICAgZWNobyAiQXBwbHlpbmcgZHJpdmVyIG92ZXJyaWRlIHRvIGRldmljZSAkbmljYnVzaWQuLi4iCiAgIGVjaG8gdmZpby1wY2kgPiAvc3lzL2J1cy9wY2kvZGV2aWNlcy8kbmljYnVzaWQvZHJpdmVyX292ZXJyaWRlCiAgIGVjaG8gIkJpbmRpbmcgZGV2aWNlICRuaWNidXNpZCB0byB2ZmlvLXBjaS4uLiIKICAgZWNobyAiJG5pY2J1c2lkIiA+IC9zeXMvYnVzL3BjaS9kcml2ZXJzL3ZmaW8tcGNpL2JpbmQKICAgZWNobyAiRGV2aWNlIGtlcm5lbCBkcml2ZXIgdmFsaWRhdGlvbi4uLiIKICAgbHNwY2kgLWsgLXMgJG5pY2J1c2lkCmRvbmUKZXhpdCAwCg==

We will also set our device id variable that will get embedded in the machineconfig as the argument for the script. Please note if we wanted to use multiple device ids we would pipe delimite them.

$ DEVICEID="15b3:a2dc" # Single device id $ DEVICEID="1dd8:1002|15b3:1021" # Multiple device ids

We also have to set the the length of wait time to allow system to come up. 120 seconds is a good rule of thumb.

$ SLP="120"

Then we have to configure a MachineConfig that will place the base64 encoded script on the system and establish a systemd service to run the script everytime the node boots.

$ cat > passthrough-for-some-machineconfig.yaml << EOF kind: MachineConfig apiVersion: machineconfiguration.openshift.io/v1 metadata: name: passthrough-for-some-systemd-service labels: machineconfiguration.openshift.io/role: master spec: config: ignition: version: 3.2.0 systemd: units: - name: passthrough-for-some.service enabled: true contents: | [Unit] Description=Identifies and enabled passthough on select network interfaces After=NetworkManager-wait-online.service openvswitch.service Wants=NetworkManager-wait-online.service openvswitch.service [Service] RemainAfterExit=yes ExecStart=/etc/scripts/passthrough-some-nics.sh -n $DEVICEID -s $SLP Type=oneshot [Install] WantedBy=multi-user.target storage: files: - filesystem: root path: "/etc/scripts/passthrough-some-nics.sh" contents: source: data:text/plain;charset=utf-8;base64,$BASE64_SCRIPT verification: {} mode: 0755 overwrite: true EOF

Now let's create the MachineConfig on the cluster.

$ oc create -f passthrough-for-some-machineconfig.yaml machineconfig.machineconfiguration.openshift.io/passthrough-for-some-systemd-service created

We need to wait for the node to reboot. Once oc get mcp is responsive and confirms the node is updated we can start to validate.

$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c88d4164a5bd26edb3d4025d24a5d2f8 True False False 1 1 1 0 6d7h worker rendered-worker-9890b2fbe760e8e731e68bf217b87278 True False False 0 0 0 0 6d7h

Let's check the status of the service on the node. We can see from the below output it already identified the interfaces that can be made passthrough.

# systemctl status passthrough-for-some.service ● passthrough-for-some.service - Identifies and enabled passthough on select network interfaces Loaded: loaded (/etc/systemd/system/passthrough-for-some.service; enabled; preset: disabled) Active: activating (start) since Thu 2026-02-19 22:27:01 UTC; 5min ago Job: 408 Invocation: 29eaf89183be4424a9f2fb4a2bd249a4 Main PID: 4282 (passthrough-som) Tasks: 1 (limit: 3084134) Memory: 1.5M (peak: 10.8M) CPU: 213ms CGroup: /system.slice/passthrough-for-some.service └─4282 /bin/bash /etc/scripts/passthrough-some-nics.sh -n 15b3:a2dc Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: ==================================================================================================== Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enp1s0f0np0 0000:01:00.0 mlx5_core Yes No Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enp1s0f1np1 0000:01:00.1 mlx5_core Yes No Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enP2s2f0np0 0002:01:00.0 mlx5_core No Yes Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: enP2s2f1np1 0002:01:00.1 mlx5_core No Yes Feb 19 22:32:01 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Loading vfio-pci......Done! Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Feb 19 22:32:02 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com passthrough-some-nics.sh[4282]: Unbinding device 0002:01:00.0 from mlx5_core kernel driver...

Let's look at the lspci output for the devices we saw in the logs. We can see the first two interfaces stayed bound to mlx5_core because those ports are part of the same card and associated to the OVS bridge. The last two ports though were unbound from mlx5_core and bound to vfio-pci to enable passthrough.

# lspci -k -s 0000:01:00.0 0000:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: mlx5_core Kernel modules: mlx5_core # lspci -k -s 0000:01:00.1 0000:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: mlx5_core Kernel modules: mlx5_core # lspci -k -s 0002:01:00.0 0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core # lspci -k -s 0002:01:00.1 0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) Subsystem: Mellanox Technologies Device 0009 Kernel driver in use: vfio-pci Kernel modules: mlx5_core

One final thing we can do is run the script manually on the node again to also confirm our findings.

# /etc/scripts/passthrough-some-nics.sh -n 15b3:a2dc NIC Name NIC Bus ID Kernel Driver OCP BR NIC PassThru Eligible ==================================================================================================== enp1s0f0np0 0000:01:00.0 mlx5_core Yes No enp1s0f1np1 0000:01:00.1 mlx5_core Yes No NA 0002:01:00.0 vfio-pci No Complete NA 0002:01:00.1 vfio-pci No Complete vfio_pci 16384 0 - Live 0xffffd5d69072b000

Openshift Virtualization Passthrough

Now that our devices are set to passthrough we can configure OpenShift Virtualization to see them as an available resource. We will need to edite the hyperconverged setup on our OpenShift cluster and add the following section.

permittedHostDevices: pciHostDevices: - pciDeviceSelector: 15b3:a2dc resourceName: nvidia.com/BF3_CX7 resourceRequirements:

We can make the edit by doing the following and inserting the section above right before the resourceRequirements section of the spec file.

$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

Then we can confirm the resources are exposed by the OpenShift node using oc describe node.

$ oc describe node | grep -E 'Capacity:|Allocatable:' -A12 Capacity: cpu: 72 devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 936709572Ki hugepages-1Gi: 0 hugepages-2Mi: 0 hugepages-32Mi: 0 hugepages-64Ki: 0 memory: 493510268Ki nvidia.com/BF3_CX7: 2 pods: 250 Allocatable: cpu: 71500m devices.kubevirt.io/kvm: 1k devices.kubevirt.io/tun: 1k devices.kubevirt.io/vhost-net: 1k ephemeral-storage: 862197798302 hugepages-1Gi: 0 hugepages-2Mi: 0 hugepages-32Mi: 0 hugepages-64Ki: 0 memory: 492359292Ki nvidia.com/BF3_CX7: 2 pods: 250

Now when we go launch a virtual machine in OpenShift we will want to include the following section in our virtual machine spec file nested under spec->domain->devices.

hostDevices: - deviceName: nvidia.com/BF3_CX7 name: hostDevices-turquoise-hornet-42

And if all goes well once we launch our virtual machine and it's running we should be able to see the passthrough ethernet interface.

$ oc get vmi -n openshift-cnv NAMESPACE NAME AGE PHASE IP NODENAME READY openshift-cnv rhel9-red-locust-96 10m Running 10.128.0.49 nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com True $ virtctl console rhel9-red-locust-96 -n openshift-cnv Successfully connected to rhel9-red-locust-96 console. The escape sequence is ^] rhel9-red-locust-96 login: cloud-user Password: Last login: Fri Feb 20 08:08:53 on tty1 [cloud-user@rhel9-red-locust-96 ~]$ sudo bash [root@rhel9-red-locust-96 cloud-user]# lspci -nn|grep Mellanox 0a:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)

Hopefully this provides a decent example of enabling passthrough for a subset of devices on a server where all the devices are the same but not all can be passed through due to the need for base networking at the OS level.