NVIDIA's OVS-DOCA extends the traditional OVS-DPDK and OVS-Kernel data-path offload interfaces (DPIF), introducing OVS-DOCA as an additional DPIF implementation. OVS-DOCA, built upon NVIDIA's networking API, preserves the same interfaces as OVS-DPDK and OVS-Kernel while utilizing the DOCA Flow library with the additional OVS-DOCA DPIF. Unlike the use of the other DPIFs (DPDK, Kernel), OVS-DOCA DPIF exploits unique hardware offload mechanisms and application techniques, maximizing performance and features for NVIDA NICs and DPUs. This mode is especially efficient due to its architecture and DOCA library integration, enhancing e-switch configuration and accelerating hardware offloads beyond what the other modes can achieve.
Workflow
The following experiment, which is not supported, was done in multiple OpenShift 4.18.18 environments on x86 server architecture. I tried this first on a bare metal single node OpenShift node and then on a multi-node cluster setup. Conceptually the steps are the same its a matter of where should OVS-DOCA run. This document is broken down into three primary sections which cover creating the image, applying the image and then testing and validation.
- Build Image Layer
- Apply Image Layer
- Validate Image Layer
- Rolling Back Image Layer
Build Image Layer
Because OpenShift uses RHCOS, which is image based, as the underlying operating system we first need to create an image overlay. The first step is to get the current rhel-coreos image from the cluster where we will be applying the image overlay. This image layer will be different depending on version of OpenShift.
$ oc adm release info --image-for rhel-coreos
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fea916602e93f2d504affb82cb6eceb0d45c2c80fdc26c9c363bd61ade8c064
Next we need to create a dockerfile that contains the steps to generate the image overlay. This requires us to add some additional dependency packages, upgrade a few and remove openvswitch. We will also be install the Doca packages using doca-all. The below example is the dockerfile used in my example environment.
$ cat <<EOF > dockerfile.ovs-doca
### Grab oc adm release info --image-for rhel-coreos
### This example was done with 4.18.18
FROM quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fea916602e93f2d504affb82cb6eceb0d45c2c80fdc26c9c363bd61ade8c064
### Copy in the rpm packages from the host. Need to download them first since they are not part of this repo.
COPY *.rpm /root/
### Install the dependencies that were copied into image
RUN rpm-ostree install /root/libunwind-1.6.2-1.el9.x86_64.rpm
RUN rpm-ostree install /root/libzip-devel-1.7.3-8.el9.x86_64.rpm
RUN rpm-ostree install /root/libpcap-devel-1.10.0-4.el9.x86_64.rpm
RUN rpm-ostree install /root/jsoncpp-1.9.5-1.el9.x86_64.rpm
RUN rpm-ostree install /root/libyaml-devel-0.2.5-7.el9.x86_64.rpm
RUN rpm-ostree install /root/openssl-devel-3.0.7-29.el9_4.x86_64.rpm
### These packages need to replace existing ones with packages copied into image
RUN rpm-ostree override replace /root/unbound-libs-1.16.2-18.el9_6.x86_64.rpm
RUN rpm-ostree override replace /root/unbound-1.16.2-18.el9_6.x86_64.rpm
RUN rpm-ostree override replace /root/bzip2-libs-1.0.8-10.el9_5.x86_64.rpm
RUN rpm-ostree override replace /root/bzip2-devel-1.0.8-10.el9_5.x86_64.rpm
### Remove current openvswitch from RHCOS image
RUN rpm-ostree override remove openvswitch3.4
### Install Doca Repo Local
RUN rpm-ostree install /root/doca-host-3.0.0-058000_25.04_rhel94.x86_64.rpm
### Replace packages that come from doca repo
RUN rpm-ostree override replace libibverbs rdma-core --experimental --from repo='doca'
### Create this directory otherwise the doca-all will fail halfway through
RUN mkdir /var/opt
### Install the doca-all which includes Openvswitch and all the drivers etc. Maybe heavy handed but this is test.
RUN rpm-ostree install doca-all
### Remove the Doca Repo Local
RUN rpm-ostree override remove doca-host
### Remove the repos in image
RUN rm -r -f /etc/yum.repos.d/*
### Remove the installation rpms
RUN rm -r -f /root/*.rpm
### Create the commit
RUN ostree container commit
EOF
Notice inside the dockerfile we reference some packages that will get copied into the image and then installed/upgraded. The Red Hat packages I just grabbed from my entitled RHEL9 system where I was building the image using a simple dnf download <packagename>. The libunwind came from the EPEL repository. Finally the Doca 3.0 package came from NVIDIA here.
Once all the required packages are downloaded into the same directory as the dockerfile we can go ahead and build the image.
$ podman build -t quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18 -f dockerfile.ovs-doca
STEP 1/21: FROM quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fea916602e93f2d504affb82cb6eceb0d45c2c80fdc26c9c363bd61ade8c064
STEP 2/21: COPY *.rpm /root/
--> Using cache 0338d4eddd5279dfb287838e9a5c654f597a27fbe1024abc33c65c362e7b27d3
--> 0338d4eddd52
STEP 3/21: RUN rpm-ostree install /root/libunwind-1.6.2-1.el9.x86_64.rpm
--> Using cache e769aad5beeadb1fb71e599f91e312ea4db57b266aec4a5a2b09737dbe0aa3ef
--> e769aad5beea
STEP 4/21: RUN rpm-ostree install /root/libzip-devel-1.7.3-8.el9.x86_64.rpm
--> Using cache ace83c90c17411afbc294eedb695e981d14beb15695bde824d67117b1cc076a9
--> ace83c90c174
STEP 5/21: RUN rpm-ostree install /root/libpcap-devel-1.10.0-4.el9.x86_64.rpm
--> Using cache 56ffa9d4e4aeeb8b6286caa99ac528e97f740c200abc1cdc6e9dfa7cc78ff8ae
--> 56ffa9d4e4ae
STEP 6/21: RUN rpm-ostree install /root/jsoncpp-1.9.5-1.el9.x86_64.rpm
--> Using cache 3ab7f72e664c4f0a10da85dffbc0b15e16bb248a60e323de380860fa989401e1
--> 3ab7f72e664c
STEP 7/21: RUN rpm-ostree install /root/libyaml-devel-0.2.5-7.el9.x86_64.rpm
--> Using cache ff99f8e9c0e563b30b947fa1b12dd01d8296681ed4731b18f2f091c3449ca324
--> ff99f8e9c0e5
STEP 8/21: RUN rpm-ostree install /root/openssl-devel-3.0.7-29.el9_4.x86_64.rpm
--> Using cache 0b8775c2baec4a9cea8af488ab3926184fdae74bfaf61308c622a088358260d1
--> 0b8775c2baec
STEP 9/21: RUN rpm-ostree override replace /root/unbound-libs-1.16.2-18.el9_6.x86_64.rpm
--> Using cache 74189b3a141874ad92c514f40d31a3c31136c9600eeaef809bc0b029a67dcc05
--> 74189b3a1418
STEP 10/21: RUN rpm-ostree override replace /root/unbound-1.16.2-18.el9_6.x86_64.rpm
--> Using cache 45a3d5a2c905fc5d6718af925f7d9e362450c81f92b97fe3312cded64831c6fd
--> 45a3d5a2c905
STEP 11/21: RUN rpm-ostree override replace /root/bzip2-libs-1.0.8-10.el9_5.x86_64.rpm
--> Using cache b6535e27d919fbbd0b39e9c319e30dbc05f2b2c304aa6836d3e748f83b7ef521
--> b6535e27d919
STEP 12/21: RUN rpm-ostree override replace /root/bzip2-devel-1.0.8-10.el9_5.x86_64.rpm
--> Using cache c35bdde7a95ae7d83bb53d0503dfa9fb1743bbe8db0f03e24d770d49d35997e1
--> c35bdde7a95a
STEP 13/21: RUN rpm-ostree override remove openvswitch3.4
--> Using cache 2e486cfa147a2339e3d91881938686509f73efea1621759758ab8a9cc7839f31
--> 2e486cfa147a
STEP 14/21: RUN rpm-ostree install /root/doca-host-3.0.0-058000_25.04_rhel94.x86_64.rpm
--> Using cache f71acdfb1a5d823858a6d56f32734e1dc552cd5048457fc54b6b60a29e09aaa9
--> f71acdfb1a5d
STEP 15/21: RUN rpm-ostree override replace libibverbs rdma-core --experimental --from repo='doca'
--> Using cache 5281c478b6930c903e2fc3988e2d7f05e0a365e992240594684a7c1d28c90981
--> 5281c478b693
STEP 16/21: RUN mkdir /var/opt
--> Using cache dc9c0e3fa4c63f0370826ac8bfb75ae5e4c25cce39082239f37f5a8e029bd636
--> dc9c0e3fa4c6
STEP 17/21: RUN rpm-ostree install doca-all
--> Using cache 2938b9c1b177984f200e7a694a4fd7a1072b301ff5f1436a28363940dfb700e7
--> 2938b9c1b177
STEP 18/21: RUN rpm-ostree override remove doca-host
--> Using cache e013cc9d87d83eedf30962f6849bd69f08999f5576c21ed456c567b2de406d49
--> e013cc9d87d8
STEP 19/21: RUN rm -r -f /etc/yum.repos.d/*
--> Using cache 395dce56b87285d6b76c08bd0fef64bc0c03058b2b132de7175ef4d5b9c91f11
--> 395dce56b872
STEP 20/21: RUN rm -r -f /root/*.rpm
--> Using cache 3b71079501cea2600d2068237fc60213d5a92f16bbc2e800b92bd0040fea4dea
--> 3b71079501ce
STEP 21/21: RUN ostree container commit
--> Using cache f51e6f47355f0eac8eb68a7ab4f001c1d71787b9dde6dc06307af01e338d524e
COMMIT quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18
--> f51e6f47355f
Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18
Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18-new
f51e6f47355f0eac8eb68a7ab4f001c1d71787b9dde6dc06307af01e338d524e
Once the image is created we can push it to a registry our OpenShift cluster will be able to access.
podman push quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18
If everything went well we can move onto applying the image to our OpenShift cluster.
Apply Image Layer
Now that the image has been created and pushed to a registry we can take that image and apply it to an OpenShift cluster. Remember we derived the image for a OpenShift 4.18.18 cluster so make sure that is the version in use. If the version is different go back and generate a new image with the correct base rhcos image for that version of OpenShift. To apply the image we have to generate a machineconfig that looks something like the example below and references the image we generated.
$ cat <<EOF > doca-ovs-machineconfig.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: doca-ovs-layer-machineconfig
spec:
osImageURL: quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18
EOF
I found accidentally that for worker nodes only I needed to have hugepages configured. Further in this blog in the rolling back image section below there is an example of rolling off the image in the event hughpages were not enabled. We will need to create the following hugepage machineconfig and apply it to our cluster.
$ cat 50-kargs-1g-hugepages.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 50-kargs-1g-hugepages
spec:
kernelArguments:
- default_hugepagesz=1G
- hugepagesz=1G
- hugepages=16
$ oc create -f 50-kargs-1g-hugepages.yaml
machineconfig.machineconfiguration.openshift.io/50-kargs-1g-hugepages created
Once the nodes reboot for the hugepage machineconfig we can then apply the machineconfig resource file to apply the OVS-DOCA image to the cluster.
$ oc create -f doca-ovs-machineconfig.yaml
machineconfig.machineconfiguration.openshift.io/doca-ovs-layer-machineconfig created
Once the machine config is created we can observe that oc get mcp will show the node (since this is a SNO example) is updating. This will take a bit and the node will reboot.
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-a202f531d9140f133bed92703e0f6757 False True False 1 0 0 0 173m
worker rendered-worker-7c2aa2d41cd8936b50979161c38c5eb8 True False False 0 0 0 0 173m
Once the node reboots and starts to come up we should see it come all the way back up to the point where all services are accessible. If the node does not come back then something went terribly wrong or a step was missed.
Validate Image Layer
Hopefully the node came back and once it does we can do some validation. First let's open a debug pod.
$ oc debug node/nvd-srv-28.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-28nvidiaengrdu2dcredhatcom-debug-6z55q ...
To use host binaries, run `chroot /host`
Pod IP: 10.6.135.7
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
We can see if we look for any openv packages that only the doca-openvswitch one is listed.
sh-5.1# rpm -qa|grep openv
openvswitch-selinux-extra-policy-1.0-39.el9fdp.noarch
doca-openvswitch-3.0.0-0056_25.04_based_3.3.5.el9.x86_64
We can also dump out the Open vSwitch and see the references to DOCA. In this example the host did not have the Nvidia Network Operator nor its NicClusterPolicy deployed.
sh-5.1# ovs-vsctl list open_vswitch
_uuid : 16de7e20-158c-47f8-94e4-05584185e14c
bridges : [322924a2-25fc-4d4a-945a-646a7b36f9f8, 9faee574-fa6b-41be-be92-e52c3a8d083e]
cur_cfg : 364
datapath_types : [doca, netdev, system]
datapaths : {system=9feef28c-87e7-4c12-9519-9fe5946093df}
db_version : "8.5.1"
doca_initialized : false
doca_version : "3.0.0058"
dpdk_initialized : false
dpdk_version : "MLNX_DPDK 22.11.2504.1.0"
external_ids : {hostname=nvd-srv-28.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.7", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-openflow-probe-interval="180", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="28e16fc6-eca6-4165-bd1c-235bf5884961"}
iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan]
manager_options : []
next_cfg : 364
other_config : {bundle-idle-timeout="180", ovn-chassis-idx-28e16fc6-eca6-4165-bd1c-235bf5884961="", vlan-limit="0"}
ovs_version : "3.0.0-0056-25.04-based-3.3.5"
ssl : []
statistics : {}
system_type : rhcos
system_version : "4.18"
In comparison a OpenShift worker node that does not have OVS-DOCA installed looks like this.
sh-5.1# ovs-vsctl list open_vswitch
_uuid : 1bd70b6e-60d2-45b5-9c66-8b535aa5b8ff
bridges : [5fa50e55-6306-4e3b-aee4-68e13364f861, 846a2c77-92d5-479a-8531-a1c9955c3934]
cur_cfg : 4019
datapath_types : [netdev, system]
datapaths : {system=3ec8dbee-824f-4e1b-999d-9ed77bcccee7}
db_version : "8.8.0"
dpdk_initialized : false
dpdk_version : "DPDK 23.11.3"
external_ids : {hostname=nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.8", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-openflow-probe-interval="180", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="512f7a47-01d9-42fd-bdf9-906ecda172bb"}
iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan]
manager_options : []
next_cfg : 4019
other_config : {bundle-idle-timeout="180", doca-init="true", hw-offload="true", ovn-chassis-idx-512f7a47-01d9-42fd-bdf9-906ecda172bb="", vlan-limit="0"}
ovs_version : "3.4.3-66.el9fdp"
ssl : []
statistics : {}
system_type : rhcos
system_version : "4.18"
Here is an example on a Dell R760xa with Network Operator and its NicClusterPolicy also deployed. We can see that both doca_initialized and dpdk_initialized are set to true. Further in the other_config options we can see that doca-init and hw-offload is also set to true.
sh-5.1# ovs-vsctl list open_vswitch
_uuid : 901442c2-069e-424a-92b0-40d5dd785ba2
bridges : [0c21ea64-1bb4-48b5-a9c3-39f9b08bb41a, bc402e73-8dd0-494b-83fc-06b7de9ce13e]
cur_cfg : 3959
datapath_types : [doca, netdev, system]
datapaths : {system=f2f9d8f2-be83-4d04-aaab-73d9b83d3765}
db_version : "8.5.1"
doca_initialized : true
doca_version : "3.0.0058"
dpdk_initialized : true
dpdk_version : "MLNX_DPDK 22.11.2504.1.0"
external_ids : {hostname=nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.9", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-openflow-probe-interval="180", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="bfaff038-def5-433b-ac43-1cc421728f88"}
iface_types : [bareudp, doca, docavdpa, docavhostuser, docavhostuserclient, dpdk, dpdkvhostuser, dpdkvhostuserclient, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan]
manager_options : []
next_cfg : 3959
other_config : {bundle-idle-timeout="180", doca-init="true", hw-offload="true", ovn-chassis-idx-bfaff038-def5-433b-ac43-1cc421728f88="", vlan-limit="0"}
ovs_version : "3.0.0-0056-25.04-based-3.3.5"
ssl : []
statistics : {}
system_type : rhcos
system_version : "4.18"
Now let's see if GPUDirect RDMA works any different with OVS-DOCA. I will use the standard test I have been using in previous blogs but found I got the same results on similar hardware. For brevity of this blog I am not going to show those test steps or the results here.
Rolling Back Image Layer
Sometimes things do not go the way we expect. When using a bare metal worker I found that the image layering will fail if hugepages were not enabled. This is what I saw from Open vSwitch when the OVS-DOCA layer was applied to worker nodes.
2025-07-10T14:53:22.723Z|00014|dpdk|INFO|Using MLNX_DPDK 22.11.2504.1.0
2025-07-10T14:53:22.723Z|00015|dpdk|INFO|DPDK Enabled - initializing...
2025-07-10T14:53:22.723Z|00016|dpdk|INFO|Setting max memzones to 10000
2025-07-10T14:53:22.723Z|00017|dpdk|INFO|EAL ARGS: ovs-vswitchd -a 0000:00:00.0 --file-prefix=ovs-5338 --in-memory -l 0.
2025-07-10T14:53:22.726Z|00018|dpdk|INFO|EAL: Detected CPU lcores: 128
2025-07-10T14:53:22.726Z|00019|dpdk|INFO|EAL: Detected NUMA nodes: 2
2025-07-10T14:53:22.726Z|00020|dpdk|INFO|EAL: Detected static linkage of DPDK
2025-07-10T14:53:22.727Z|00021|dpdk|INFO|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
2025-07-10T14:53:22.727Z|00022|dpdk|INFO|EAL: Selected IOVA mode 'VA'
2025-07-10T14:53:22.727Z|00023|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 0
2025-07-10T14:53:22.727Z|00024|dpdk|WARN|EAL: No free 2048 kB hugepages reported on node 1
2025-07-10T14:53:22.727Z|00025|dpdk|WARN|EAL: No free 1048576 kB hugepages reported on node 0
2025-07-10T14:53:22.727Z|00026|dpdk|WARN|EAL: No free 1048576 kB hugepages reported on node 1
2025-07-10T14:53:22.727Z|00027|dpdk|ERR|EAL: Cannot get hugepage information.
2025-07-10T14:53:22.727Z|00028|dpdk|EMER|Unable to initialize DPDK: Permission denied
Now we did not see this behavior on a bare metal single node OpenShift node but only on workers. The first step was to rollback my change but just deleting the machineconfig here was not enough because the worker node never got back to a ready state. First I ssh'd into the node as the core network because it did have network connectivity.
$ ssh core@nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com
Red Hat Enterprise Linux CoreOS 418.94.202506121335-0
Part of OpenShift 4.18, RHCOS is a Kubernetes-native operating system
managed by the Machine Config Operator (`clusteroperator/machine-config`).
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
https://docs.openshift.com/container-platform/4.18/architecture/architecture-rhcos.html
---
Last login: Thu Jul 10 14:51:08 2025 from 10.22.89.172
[systemd]
Failed Units: 2
NetworkManager-wait-online.service
ovs-vswitchd.service
[core@nvd-srv-30 ~]$ sudo bash
[systemd]
Failed Units: 2
NetworkManager-wait-online.service
ovs-vswitchd.service
I could see once I became root that Open vSwitch had not started and as I pointed above the logs showed issues with hugepages not being configured. Next I looked at the rpm-ostree status which shows we have our running image and our previous image.
[root@nvd-srv-30 core]# rpm-ostree status
State: idle
Deployments:
● ostree-unverified-registry:quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18
Digest: sha256:9fabd9c17f9124b443aa5d43d67a7b118ef510ee938aa7970ae41bd4d8d7697e
Version: 418.94.202506121335-0 (2025-07-09T13:28:35Z)
ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fea916602e93f2d504affb82cb6eceb0d45c2c80fdc26c9c363bd61ade8c064
Digest: sha256:8fea916602e93f2d504affb82cb6eceb0d45c2c80fdc26c9c363bd61ade8c064
Version: 418.94.202506121335-0 (2025-06-12T13:39:57Z)
I opted to rollback to the previous image which I knew I could do.
[root@nvd-srv-30 core]# rpm-ostree rollback
Moving 'd1fb888c12bc35c6d59679aa521c0f950013f80bce397ed0af73181c33305679.0' to be first deployment
Transaction complete; bootconfig swap: no; bootversion: boot.0.0, deployment count change: 0
Downgraded:
bzip2-libs 1.0.8-10.el9_5 -> 1.0.8-8.el9_4.1
libibverbs 2501mlnx56-1.2504061 -> 48.0-1.el9
rdma-core 2501mlnx56-1.2504061 -> 48.0-1.el9
unbound-libs 1.16.2-18.el9_6 -> 1.16.2-8.el9_4.1
Removed:
bzip2-devel-1.0.8-10.el9_5.x86_64
clusterkit-1.15.470-1.2504061.20250428.80af081.x86_64
cmake-filesystem-3.26.5-2.el9.x86_64
collectx-clxapi-1.21.1-1.x86_64
collectx-clxapidev-1.21.1-1.x86_64
doca-all-3.0.0-058000.x86_64
doca-apsh-config-3.0.0058-1.el9.x86_64
doca-bench-3.0.0058-1.el9.x86_64
doca-caps-3.0.0058-1.el9.x86_64
doca-comm-channel-admin-3.0.0058-1.el9.x86_64
doca-devel-3.0.0-058000.x86_64
doca-dms-3.0.0058-1.el9.x86_64
doca-flow-tune-3.0.0058-1.el9.x86_64
doca-ofed-3.0.0-058000.x86_64
doca-openvswitch-3.0.0-0056_25.04_based_3.3.5.el9.x86_64
doca-pcc-counters-3.0.0058-1.el9.x86_64
doca-perftest-1.0.1-1.el9.x86_64
doca-runtime-3.0.0-058000.x86_64
doca-samples-3.0.0058-1.el9.x86_64
doca-sdk-aes-gcm-3.0.0058-1.el9.x86_64
doca-sdk-aes-gcm-devel-3.0.0058-1.el9.x86_64
doca-sdk-apsh-3.0.0058-1.el9.x86_64
doca-sdk-apsh-devel-3.0.0058-1.el9.x86_64
doca-sdk-argp-3.0.0058-1.el9.x86_64
doca-sdk-argp-devel-3.0.0058-1.el9.x86_64
doca-sdk-comch-3.0.0058-1.el9.x86_64
doca-sdk-comch-devel-3.0.0058-1.el9.x86_64
doca-sdk-common-3.0.0058-1.el9.x86_64
doca-sdk-common-devel-3.0.0058-1.el9.x86_64
doca-sdk-compress-3.0.0058-1.el9.x86_64
doca-sdk-compress-devel-3.0.0058-1.el9.x86_64
doca-sdk-devemu-3.0.0058-1.el9.x86_64
doca-sdk-devemu-devel-3.0.0058-1.el9.x86_64
doca-sdk-dma-3.0.0058-1.el9.x86_64
doca-sdk-dma-devel-3.0.0058-1.el9.x86_64
doca-sdk-dpa-3.0.0058-1.el9.x86_64
doca-sdk-dpa-devel-3.0.0058-1.el9.x86_64
doca-sdk-dpdk-bridge-3.0.0058-1.el9.x86_64
doca-sdk-dpdk-bridge-devel-3.0.0058-1.el9.x86_64
doca-sdk-erasure-coding-3.0.0058-1.el9.x86_64
doca-sdk-erasure-coding-devel-3.0.0058-1.el9.x86_64
doca-sdk-eth-3.0.0058-1.el9.x86_64
doca-sdk-eth-devel-3.0.0058-1.el9.x86_64
doca-sdk-flow-3.0.0058-1.el9.x86_64
doca-sdk-flow-devel-3.0.0058-1.el9.x86_64
doca-sdk-flow-trace-3.0.0058-1.el9.x86_64
doca-sdk-pcc-3.0.0058-1.el9.x86_64
doca-sdk-pcc-devel-3.0.0058-1.el9.x86_64
doca-sdk-rdma-3.0.0058-1.el9.x86_64
doca-sdk-rdma-devel-3.0.0058-1.el9.x86_64
doca-sdk-sha-3.0.0058-1.el9.x86_64
doca-sdk-sha-devel-3.0.0058-1.el9.x86_64
doca-sdk-sta-3.0.0058-1.el9.x86_64
doca-sdk-sta-devel-3.0.0058-1.el9.x86_64
doca-sdk-telemetry-3.0.0058-1.el9.x86_64
doca-sdk-telemetry-devel-3.0.0058-1.el9.x86_64
doca-sdk-telemetry-exporter-3.0.0058-1.el9.x86_64
doca-sdk-telemetry-exporter-devel-3.0.0058-1.el9.x86_64
doca-sdk-urom-3.0.0058-1.el9.x86_64
doca-sdk-urom-devel-3.0.0058-1.el9.x86_64
doca-socket-relay-3.0.0058-1.el9.x86_64
doca-sosreport-4.9.0-1.el9.noarch
doca-telemetry-utils-3.0.0058-1.el9.x86_64
dpa-gdbserver-25.04.2725-0.el9.x86_64
dpa-resource-mgmt-25.04.0169-1.el9.x86_64
dpa-stats-25.04.0169-0.el9.x86_64
dpacc-1.11.0.6-1.el9.x86_64
dpacc-extract-1.11.0.6-1.el9.x86_64
flexio-samples-25.04.2725-0.el9.noarch
flexio-sdk-25.04.2725-0.el9.x86_64
glib2-devel-2.68.4-14.el9_4.1.x86_64
hcoll-4.8.3230-1.20250428.1a4e38d7.x86_64
ibacm-2501mlnx56-1.2504061.x86_64
ibarr-0.1.3-1.2504061.x86_64
ibdump-6.0.0-1.2504061.x86_64
ibsim-0.12-1.2504061.x86_64
ibutils2-2.1.1-0.22200.MLNX20250423.g91730569c.2504061.x86_64
infiniband-diags-2501mlnx56-1.2504061.x86_64
jsoncpp-1.9.5-1.el9.x86_64
kernel-headers-5.14.0-570.25.1.el9_6.x86_64
kmod-iser-25.04-OFED.25.04.0.6.1.1.rhel9u4.x86_64
kmod-isert-25.04-OFED.25.04.0.6.1.1.rhel9u4.x86_64
kmod-kernel-mft-mlnx-4.32.0-1.rhel9u4.x86_64
kmod-knem-1.1.4.90mlnx3-OFED.23.10.0.2.1.1.rhel9u4.x86_64
kmod-mlnx-ofa_kernel-25.04-OFED.25.04.0.6.1.1.rhel9u4.x86_64
kmod-srp-25.04-OFED.25.04.0.6.1.1.rhel9u4.x86_64
kmod-xpmem-2.7.4-1.2504061.rhel9u4.rhel9u4.x86_64
libblkid-devel-2.37.4-18.el9.x86_64
libffi-devel-3.4.2-8.el9.x86_64
libgfortran-11.5.0-5.el9_5.x86_64
libibumad-2501mlnx56-1.2504061.x86_64
libibverbs-utils-2501mlnx56-1.2504061.x86_64
libmount-devel-2.37.4-18.el9.x86_64
libnl3-devel-3.9.0-1.el9.x86_64
libpcap-devel-14:1.10.0-4.el9.x86_64
libquadmath-11.5.0-5.el9_5.x86_64
librdmacm-2501mlnx56-1.2504061.x86_64
librdmacm-utils-2501mlnx56-1.2504061.x86_64
libselinux-devel-3.6-1.el9.x86_64
libsepol-devel-3.6-1.el9.x86_64
libunwind-1.6.2-1.el9.x86_64
libxpmem-2.7.4-1.2504061.rhel9u4.x86_64
libxpmem-devel-2.7.4-1.2504061.rhel9u4.x86_64
libyaml-devel-0.2.5-7.el9.x86_64
libzip-1.7.3-8.el9.x86_64
libzip-devel-1.7.3-8.el9.x86_64
meson-0.61.2-1.el9.noarch
mft-4.32.0-120.x86_64
mlnx-dpdk-22.11.0-2504.1.0.2504061.x86_64
mlnx-dpdk-devel-22.11.0-2504.1.0.2504061.x86_64
mlnx-ethtool-6.11-1.2504061.x86_64
mlnx-iproute2-6.12.0-1.2504061.x86_64
mlnx-ofa_kernel-25.04-OFED.25.04.0.6.1.1.rhel9u4.x86_64
mlnx-ofa_kernel-devel-25.04-OFED.25.04.0.6.1.1.rhel9u4.x86_64
mlnx-ofa_kernel-source-25.04-OFED.25.04.0.6.1.1.rhel9u4.x86_64
mlnx-tools-25.01-0.2504061.x86_64
mpitests_openmpi-3.2.24-2ffc2d6.2504061.x86_64
ninja-build-1.10.2-3.el9~bootstrap.x86_64
nvhws-25.04-1.el9.x86_64
nvhws-devel-25.04-1.el9.x86_64
ofed-scripts-25.04-OFED.25.04.0.6.1.x86_64
openmpi-3:4.1.7rc1-1.2504061.20250428.6d9519e4c3.x86_64
opensm-5.23.00.MLNX20250423.ac516692-0.1.2504061.x86_64
opensm-devel-5.23.00.MLNX20250423.ac516692-0.1.2504061.x86_64
opensm-libs-5.23.00.MLNX20250423.ac516692-0.1.2504061.x86_64
opensm-static-5.23.00.MLNX20250423.ac516692-0.1.2504061.x86_64
openssl-devel-1:3.0.7-29.el9_4.x86_64
pcre-cpp-8.44-3.el9.3.x86_64
pcre-devel-8.44-3.el9.3.x86_64
pcre-utf16-8.44-3.el9.3.x86_64
pcre-utf32-8.44-3.el9.3.x86_64
pcre2-devel-10.40-5.el9.x86_64
pcre2-utf16-10.40-5.el9.x86_64
pcre2-utf32-10.40-5.el9.x86_64
perftest-25.04.0-0.84.g97da83e.2504061.x86_64
python3-file-magic-5.39-16.el9.noarch
python3-pexpect-4.8.0-7.el9.noarch
python3-ptyprocess-0.6.0-12.el9.noarch
python3-pyverbs-2501mlnx56-1.2504061.x86_64
rdma-core-devel-2501mlnx56-1.2504061.x86_64
rshim-2.3.8-0.geaa5c03.x86_64
sharp-3.11.0.MLNX20250423.66d243a0-1.2504061.x86_64
srp_daemon-2501mlnx56-1.2504061.x86_64
sysprof-capture-devel-3.40.1-3.el9.x86_64
ucx-1.19.0-1.2504061.20250428.6ecd4e5ae.x86_64
ucx-cma-1.19.0-1.2504061.20250428.6ecd4e5ae.x86_64
ucx-devel-1.19.0-1.2504061.20250428.6ecd4e5ae.x86_64
ucx-ib-1.19.0-1.2504061.20250428.6ecd4e5ae.x86_64
ucx-ib-mlx5-1.19.0-1.2504061.20250428.6ecd4e5ae.x86_64
ucx-knem-1.19.0-1.2504061.20250428.6ecd4e5ae.x86_64
ucx-rdmacm-1.19.0-1.2504061.20250428.6ecd4e5ae.x86_64
ucx-xpmem-1.19.0-1.2504061.20250428.6ecd4e5ae.x86_64
unbound-1.16.2-18.el9_6.x86_64
xpmem-2.7.4-1.2504061.rhel9u4.x86_64
xz-devel-5.2.5-8.el9_0.x86_64
zlib-devel-1.2.11-40.el9.x86_64
Added:
openvswitch3.4-3.4.2-66.el9fdp.x86_64
Changes queued for next boot. Run "systemctl reboot" to start a reboot
Next I checked the rpm-ostree status again to see that the original deployment was at the top of my list. I needed to reboot next.
[root@nvd-srv-30 core]# rpm-ostree status
State: idle
Deployments:
ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fea916602e93f2d504affb82cb6eceb0d45c2c80fdc26c9c363bd61ade8c064
Digest: sha256:8fea916602e93f2d504affb82cb6eceb0d45c2c80fdc26c9c363bd61ade8c064
Version: 418.94.202506121335-0 (2025-06-12T13:39:57Z)
Diff: 4 downgraded, 156 removed, 1 added
● ostree-unverified-registry:quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18
Digest: sha256:9fabd9c17f9124b443aa5d43d67a7b118ef510ee938aa7970ae41bd4d8d7697e
Version: 418.94.202506121335-0 (2025-07-09T13:28:35Z)
[root@nvd-srv-30 core]#
[root@nvd-srv-30 core]# reboot
[root@nvd-srv-30 core]# Connection to nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com closed by remote host.
Now this does not completely solve my backout strategy. My machineconfig that applied the updated layer was gone, my worker was running the original image but the oc get mcp still showed a updating and degraded state.
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-0230114d55788bab601fd33f8c816798 True False False 3 3 3 0 20d
worker rendered-worker-c8361c0ad5c75212f16f53fa60772292 False True True 2 1 1 1 20d
This was because the MachineConfig Operator still thought the system should be using the layered image. I could see that by the following.
$ oc describe mcp worker
Name: worker
Namespace:
Labels: machineconfiguration.openshift.io/mco-built-in=
pools.operator.machineconfiguration.openshift.io/worker=
Annotations: sriovnetwork.openshift.io/state: Paused
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfigPool
(...)
Last Transition Time: 2025-07-10T15:09:15Z
Message: Node nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com is reporting: "unexpected on-disk state validating against rendered-worker-0483f3f9c265f75685e1b23edf5d261d: expected target osImageURL \"quay.io/redhat_emp1/ecosys-nvidia/ocp-4.18-doca-all:4.18.18\", have \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fea916602e93f2d504affb82cb6eceb0d45c2c80fdc26c9c363bd61ade8c064\""
Reason: 1 nodes are reporting degraded status on sync
Status: True
Type: NodeDegraded
(...)
Degraded Machine Count: 1
Machine Count: 2
Observed Generation: 13
Ready Machine Count: 1
Unavailable Machine Count: 1
Updated Machine Count: 1
Events: <none>
To resolve this I needed to look for my last good rendered worker which was this one: rendered-worker-c8361c0ad5c75212f16f53fa60772292.
$ oc get mc|grep rendered-worker
rendered-worker-0483f3f9c265f75685e1b23edf5d261d efe259e04ba98784102ba603941ecbbb75233c6b 3.4.0 4h26m
rendered-worker-0c4bed761bad4468c07325ab74dd8d7a efe259e04ba98784102ba603941ecbbb75233c6b 3.4.0 3h18m
rendered-worker-1ecf444812421713d125d8c2fab0c8b5 efe259e04ba98784102ba603941ecbbb75233c6b 3.4.0 4h14m
rendered-worker-1f9846d31d24be3134459fe31a1e3eb9 00143af1a51bedf0290496a6a97e47cf60b18693 3.4.0 21d
rendered-worker-5ace9f4a37135e9a1ad365d34e36d5a4 00143af1a51bedf0290496a6a97e47cf60b18693 3.4.0 21d
rendered-worker-770c72f8be98e98497c57232e1e284f0 00143af1a51bedf0290496a6a97e47cf60b18693 3.4.0 21d
rendered-worker-c8361c0ad5c75212f16f53fa60772292 efe259e04ba98784102ba603941ecbbb75233c6b 3.4.0 25h
rendered-worker-cf0ffc3899672cb089c88c470682ea27 00143af1a51bedf0290496a6a97e47cf60b18693 3.4.0 16d
rendered-worker-e7ff5d837ac6c17c9dc7f0417eb30ce0 00143af1a51bedf0290496a6a97e47cf60b18693 3.4.0 16d
rendered-worker-efcaf1659d34673a2efabce6dc580638 00143af1a51bedf0290496a6a97e47cf60b18693 3.4.0 21d
Then I generate a backup yaml from that rendered worker.
$ oc get mc/rendered-worker-c8361c0ad5c75212f16f53fa60772292 -o yaml > rendered-mc-backup.yaml
Then I edit the rendered-mc-backup.yaml and update the rendered worker value to the one it expected: rendered-worker-0483f3f9c265f75685e1b23edf5d261d and comment out 3 lines. This is just the relevant section in the rendered-mc-back.yaml that I updated.
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
annotations:
machineconfiguration.openshift.io/generated-by-controller-version: efe259e04ba98784102ba603941ecbbb75233c6b
machineconfiguration.openshift.io/release-image-version: 4.18.18
#creationTimestamp: "2025-07-09T19:20:50Z"
#generation: 1
name: rendered-worker-0483f3f9c265f75685e1b23edf5d261d
ownerReferences:
- apiVersion: machineconfiguration.openshift.io/v1
blockOwnerDeletion: true
controller: true
kind: MachineConfigPool
name: worker
uid: 1653432e-a894-4a17-92f7-d3636c82efa9
#resourceVersion: "9616677"
#uid: 36507fb2-934b-46ac-ae8d-8e663678ad16
Then I delete the original rendered-worker which was the overlay image.
$ oc delete mc rendered-worker-0483f3f9c265f75685e1b23edf5d261d
machineconfig.machineconfiguration.openshift.io "rendered-worker-0483f3f9c265f75685e1b23edf5d261d" deleted
Recreate the new rendered-worker with the same id as the one I just deleted but using the known good backup state.
$ oc create -f rendered-mc-backup.yaml
machineconfig.machineconfiguration.openshift.io/rendered-worker-0483f3f9c265f75685e1b23edf5d261d created
And finally touch the machine-config-daemon-force to force a reconciliation.
$ oc debug node/nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com -- touch /host/run/machine-config-daemon-force
Starting pod/nvd-srv-30nvidiaengrdu2dcredhatcom-debug-jbbmv ...
To use host binaries, run `chroot /host`
^C
Removing debug pod ...
This got me back to a good known state and I could proceed with adding my hughpage machineconfig and then reapplying the OVS-DOCA image to my cluster.
Hopefully this write-up provided some insight into this experimental yet not supported activity of trying to get the OVS-DOCA version of Open vSwitch running on an OpenShift cluster.

