Thursday, May 07, 2026

NVIDIA OVS-DOCA via On-Cluster Layer OpenShift

In a previous blog I wrote about using off-cluster layering of NVIDIA's OVS-DOCA.  In this blog I want to use on-clustering layering which makes manageability even easier than with the off-cluster layer.   Before we begin though let's review what OVS-DOCA is all about.

What Is NVIDIA OVS-DOCA?

Open vSwitch (OVS) is a software-based network technology that enhances virtual machine (VM) communication within internal and external networks. Typically deployed in the hypervisor, OVS employs a software-based approach for packet switching, which can strain CPU resources, impacting system performance and network bandwidth utilization. Addressing this, NVIDIA's Accelerated Switching and Packet Processing (ASAP2) technology offloads OVS data-plane tasks to specialized hardware, like the embedded switch (eSwitch) within the NIC subsystem, while maintaining an unmodified OVS control-plane. This results in notably improved OVS performance without burdening the CPU.

NVIDIA's OVS-DOCA extends the traditional OVS-DPDK and OVS-Kernel data-path offload interfaces (DPIF), introducing OVS-DOCA as an additional DPIF implementation. OVS-DOCA, built upon NVIDIA's networking API, preserves the same interfaces as OVS-DPDK and OVS-Kernel while utilizing the DOCA Flow library with the additional OVS-DOCA DPIF. Unlike the use of the other DPIFs (DPDK, Kernel), OVS-DOCA DPIF exploits unique hardware offload mechanisms and application techniques, maximizing performance and features for NVIDA NICs and DPUs. This mode is especially efficient due to its architecture and DOCA library integration, enhancing e-switch configuration and accelerating hardware offloads beyond what the other modes can achieve.


Disclaimer: The following workflow of replacing OVS-DOCA in OpenShift is not currently a supported exercise. While using on-clustering is supported the process of switching out OVS with OVS-DOCA is purely experimental and for experimental purposes.

Workflow

The following on-cluster experiment of layering on NVIDIA OVS-DOCA was done on a single node OpenShift 4.20.15 environment on an x86 architecture. We assume in this document that the following has already been configured as its outside the scope of this document:

  • Hugepages have been configured by a machine configuration
  • The Logical Volume Manager Storage Operator has been installed and a LVM Cluster resource has been created

The workflow is broken down into five sections which cover environment configuration, creating the layer and then validating that it all works.

  • Configure OpenShift Internal Registry
  • Configure Configmap
  • Generate MachineOSConfig Custom Resource
  • Create MachineOSConfig Layer
  • Validate MachineOSConfig Layer

Configure OpenShift Internal Registry

The first thing we need to do is configured the OpenShift Internal Registry which is disabled by default. We will start by creating a persistent volume claim that will be consumed by the image registry. We will need to generate the following custom resource file.

$ cat <<EOF > imageregistry-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: registry-claim namespace: openshift-image-registry annotations: imageregistry.openshift.io: "true" spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 100Gi storageClassName: lvms-vg1 EOF

Once we have the file created we can create it on the cluster.

$ oc create -f imageregistry-pvc.yaml persistentvolumeclaim/registry-claim created

The results will show the following which is normal at this point.

$ oc get pvc -n openshift-image-registry NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE registry-claim Pending lvms-vg1 <unset> 22s

Next we will enable route creation for the registry and this could take a few minutes.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge config.imageregistry.operator.openshift.io/cluster patched

We can proceed to allow the image registry operator to use our pvc we created above.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge --patch '{"spec": {"storage":{"pvc":{"claim":"registry-claim"}}}}' config.imageregistry.operator.openshift.io/cluster patched

Then we can also set the rollout strategy. Since this is a single node a replica of 1 will do.

$ oc patch config.imageregistry.operator.openshift.io/cluster --type=merge -p '{"spec":{"rolloutStrategy":"Recreate","replicas":1}}' config.imageregistry.operator.openshift.io/cluster patched

Finally we will set the image registry to managed which will signal the operator to start the registry.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge --patch '{"spec":{"managementState":"Managed"}}' config.imageregistry.operator.openshift.io/cluster patched

If everything went like planned we should now see the image registry pod running.

$ oc get pods -n openshift-image-registry NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-6bcb795b48-jgzw9 1/1 Running 5 7h34m image-registry-676f99fd9c-8gf4m 1/1 Running 0 58s node-ca-bt8ns 1/1 Running 4 7h16m

We should also see that our persistent volume claim is now bound and also has an associated persistent volume from our lvms-vg1 storage class.

$ oc get pvc -n openshift-image-registry NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE registry-claim Bound pvc-c58510da-9eb9-4220-b101-4086f58b3311 100Gi RWO lvms-vg1 <unset> 6m $ oc get pv -n openshift-image-registry NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE pvc-c58510da-9eb9-4220-b101-4086f58b3311 100Gi RWO Delete Bound openshift-image-registry/registry-claim lvms-vg1 <unset> 2m34s

Now that our registry is running we need to make sure we can pull and push to the registry. First let's set some environment variables for our internal registry, the user, the namespace and the token creation.

$ export REGISTRY=image-registry.openshift-image-registry.svc:5000 $ export REGISTRY_USER=builder $ export REGISTRY_NAMESPACE=openshift-machine-config-operator $ export TOKEN=$(oc create token $REGISTRY_USER -n $REGISTRY_NAMESPACE --duration=$((900*24))h)

Next let's create the push-secret using the variables we set in the openshift-machine-config-operator namespace.

$ oc create secret docker-registry push-secret -n openshift-machine-config-operator --docker-server=$REGISTRY --docker-username=$REGISTRY_USER --docker-password=$TOKEN secret/push-secret created

Now we need to extract the push secret and the clusters global pull-secret.

$ oc extract secret/push-secret -n openshift-machine-config-operator --to=- > push-secret.json # .dockerconfigjson $ oc extract secret/pull-secret -n openshift-config --to=- > pull-secret.json # .dockerconfigjson

We will now take the push-secret and global pull secret into one merged secret.

$ jq -s '.[0] * .[1]' pull-secret.json push-secret.json > pull-and-push-secret.json

This new merged secret needs to be create as well in the openshift-machine-config-operator namespace.

$ oc create secret generic pull-and-push-secret -n openshift-machine-config-operator --from-file=.dockerconfigjson=pull-and-push-secret.json --type=kubernetes.io/dockerconfigjson secret/pull-and-push-secret created

We can also validate the secrets have been created.

$ oc get secrets -n openshift-machine-config-operator |grep push pull-and-push-secret kubernetes.io/dockerconfigjson 1 10s push-secret kubernetes.io/dockerconfigjson 1 114s

At this point our registry is configured and running and we can move onto the next workflow.

Configure Configmap

With the Machine Config Operator which will be doing the heavy lifting of building the OVS-DOCA RHCOS layer we can use configuration maps to map in values or files. In this case we want to generate a ConfigMap that will contain our DOCA repo. We will first generate the following custom resource file.

$ cat <<EOF > repos-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: etc-yum-repos-d namespace: openshift-machine-config-operator data: doca.repo: | [doca] name=DOCA Online Repo baseurl=https://linux.mellanox.com/public/repo/doca/3.3.0/rhel9/x86_64/ enabled=1 gpgcheck=0 centos.repos: | [baseos] name=CentOS Stream $releasever - BaseOS baseurl=https://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial gpgcheck=0 repo_gpgcheck=0 metadata_expire=6h countme=1 enabled=1 [appstream] name=CentOS Stream $releasever - AppStream baseurl=https://mirror.stream.centos.org/9-stream/AppStream/x86_64/os gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial gpgcheck=0 repo_gpgcheck=0 metadata_expire=6h countme=1 enabled=1 [CRB] name=Centos Stream $releasever - CRB baseurl=https://mirror.stream.centos.org/9-stream/CRB/x86_64/os gpgcheck=1 repo_gpgcheck=0 metadata_expire=6h countme=1 enabled=1 EOF

With the file created we will go ahead and create it on the cluster.

$ oc create -f repos-configmap.yaml configmap/etc-yum-repos-d created

For sanity we can check that it was created.

$ oc get configmap -n openshift-machine-config-operator etc-yum-repos-d NAME DATA AGE etc-yum-repos-d 2 22s

What this ConfigMap will do is mount up the repo files created in the builder container when we go to create the OVS-DOCA RHCOS layer. If everything looks good we can move onto the next step in the workflow.

Generate MachineOSConfig Custom Resource

When using on-cluster layering we have to create a MachineOSConfig that will basically do the steps that the Dockerfile would have done if we were building an off-cluster layer. For our on-cluster layer we need to ensure it does the following:

  • Installs the dependencies needed
  • Upgrades some of the existing packages to a version suitable for OVS-DOCA
  • Removes the current Red Hat based version of OpenvSwitch
  • Replaces some packages with those from the DOCA repo
  • Installs the doca-all bundle which will contain the OVS-DOCA rpms including OpenvSwitch
  • Note that since this is a SNO setup our metadata name and machineConfigPool is master. On a multi-node cluster this would most likely be worker or the name of a custom machineConfigPool.

Below is the following custom resource file we will use.

$ cat <<EOF > on-cluster-rhcos-layer-mc.yaml.works apiVersion: machineconfiguration.openshift.io/v1 kind: MachineOSConfig metadata: name: master spec: machineConfigPool: name: master containerFile: - containerfileArch: NoArch content: |- FROM configs AS final RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \ mkdir /var/opt && \ dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libyaml-devel-0.2.5-7.el9.x86_64.rpm && \ dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libpcap-devel-1.10.0-4.el9.x86_64.rpm && \ dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libzip-devel-1.7.3-7.el9.x86_64.rpm && \ dnf install -y libunwind jsoncpp openssl-devel kernel-devel kernel-headers && \ dnf upgrade -y unbound-libs unbound bzip2-libs bzip2-devel && \ rpm-ostree override remove openvswitch3.5 && \ rpm-ostree override replace libibverbs rdma-core --experimental --from repo='doca' && \ dnf install doca-all -y && \ rm -r -f /etc/yum.repos.d/* && \ dnf clean all -y && \ bootc container lint imageBuilder: imageBuilderType: Job baseImagePullSecret: name: pull-and-push-secret renderedImagePushSecret: name: push-secret renderedImagePushSpec: image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image:latest EOF

If everything looks correct we can move onto the next section in the workflow.

Create MachineOSConfig Layer

At this point we are ready to create our MachineOSConfig which will do the following:

  • Build the OVS-DOCA RHCOS layer image based on our requirements in the MachineOSConfig
  • Push that OVS-DOCA RHCOS layer image to the local OpenShift registry
  • Apply that OVS-DOCA RHCOS layer image to the system
  • Reboot the node for the changes to take effect

To kick things off we need to create MachineOSConfig on the cluster.

$ oc create -f on-cluster-rhcos-layer-mc.yaml machineosconfig.machineconfiguration.openshift.io/master created

Next we can look at the state of the build process by looking at the machineOSbuild state.

$ oc get machineOSbuild NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE master-9c55a02933a10f5fc31c6bb5329e1f38 False True False False False 16s

While the machineOSbuild is building we should notice that two additional pods were created in the openshift-machine-config-operator namespace: machine-os-builer and build-master.

$ oc get pods -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE build-master-9c55a02933a10f5fc31c6bb5329e1f38-qr262 0/1 Init:0/1 0 30s kube-rbac-proxy-crio-nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com 1/1 Running 7 27h machine-config-controller-847595d69d-q9jzb 2/2 Running 9 27h machine-config-daemon-php6s 2/2 Running 14 27h machine-config-nodes-crd-cleanup-29633172-tv2ml 0/1 Completed 0 27h machine-config-operator-6d4cbf84b4-q4z6c 2/2 Running 9 27h machine-config-server-vvxkk 1/1 Running 4 27h machine-os-builder-9d9d855dd-9xjjv 1/1 Running 0 39s

The build-master is where the actual build is happening and we can watch the execution of the build by tailing the logs of that pod and the image-build container inside. Below is an example snippet (the log can be incredibly long) of right at the end of a successful build.

$ oc logs -f -n openshift-machine-config-operator build-master-9c55a02933a10f5fc31c6bb5329e1f38-qr262 image-build (...) Copying blob sha256:25e5c12c08ced2f786717e0303aff37e3ce37f8e8171ae91fb298eee4e7af424 Copying blob sha256:7e6009a201a327b83674a7491fd077e59ecfb63b63c7a570359dbf84081c6aa0 Copying config sha256:a7f6fd648ae51526c4c113756eb1af4beac6c05c6c6f4988665f3258a89e42bf Writing manifest to image destination

Once the image-build container has finished building the image we can go back and check the status of the machineOSbuild to see it succeeded.

$ oc get machineOSbuild NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE master-9c55a02933a10f5fc31c6bb5329e1f38 False False True False False 7m16s

We will also notice now that the master machineConfigPool is now in an updating state because the new RHCOS layer is being applied.

$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c4363e454702b9a652e29fb24f28c7c7 False True False 1 0 0 0 27h worker rendered-worker-91833d06ff7a4b4563d843249bc12228 True False False 0 0 0 0 27h

Remember that as the OVS-DOCA RHCOS layer is being applied the node will reboot. Once the node comes back the image should be applied and we can see the update is complete by the following.

$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c4363e454702b9a652e29fb24f28c7c7 True False False 1 1 1 0 27h worker rendered-worker-91833d06ff7a4b4563d843249bc12228 True False False 0 0 0 0 27h

If everything looks good we can move onto the next section of the workflow.

Validate MachineOSConfig Layer

Now that the RHCOS layer is applied to the system we can validate that the OVS-DOCA Openvswitch is in use. First we need to open a debug pod on the node where the image was applied.

$ oc debug node/nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-26nvidiaengrdu2dcredhatcom-debug-hlcdm ... To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`. Pod IP: 10.6.135.5 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host

Next we can check for what version of openvswitch is installed. The output below shows that we do in fact have the doca-openvswitch package installed.

sh-5.1# rpm -qa|grep openvswitch openvswitch-selinux-extra-policy-1.0-39.el9fdp.noarch doca-openvswitch-3.3.0040-1.el9.x86_64

Next we can use the ovs-vsctl command check further if the running openvswitch is in fact doca version. Below in the output we can see the datapath_types have doca listed and when we look at the dpdk_version we can see it references doca as well.

sh-5.1# ovs-vsctl list open_vswitch _uuid : bb0f2e33-856e-49f3-b063-a25084cb7894 bridges : [7d465f8c-2332-4cd8-9115-f5f53161d838, e137b176-e5ee-4df8-8c79-60b364c2c368] cur_cfg : 703 datapath_types : [doca, netdev, system] datapaths : {system=748f4bb7-442e-479a-9923-811ee91408cc} db_version : "8.5.1" doca_initialized : false doca_version : "3.3.0109" dpdk_initialized : false dpdk_version : "DPDK 25.11.0+doca2601" external_ids : {hostname=nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-bridge-remote-probe-interval="0", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.5", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="01916ed0-7268-43dd-8355-68fef87a1761"} iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan] manager_options : [] next_cfg : 703 other_config : {bundle-idle-timeout="0", ovn-chassis-idx-01916ed0-7268-43dd-8355-68fef87a1761="", vlan-limit="0"} ovs_version : "3.3.0040" ssl : [] statistics : {} system_type : rhel system_version : "9.6"

Now I noticed above doca and dpdk were not initialized. Let's see if we can get them going with the following. Then restart openvswitch on the node.

sh-5.1# ovs-vsctl --no-wait set Open_vSwitch . other_config:doca-init=true sh-5.1# systemctl restart openvswitch

And now if we look at the open_vswitch values again we can see they are now true.

sh-5.1# ovs-vsctl list open_vswitch _uuid : bb0f2e33-856e-49f3-b063-a25084cb7894 bridges : [7b6aaed2-2e60-45bf-b98c-90e6806482dc, 7d465f8c-2332-4cd8-9115-f5f53161d838] cur_cfg : 898 datapath_types : [doca, netdev, system] datapaths : {system=748f4bb7-442e-479a-9923-811ee91408cc} db_version : "8.5.1" doca_initialized : true doca_version : "3.3.0109" dpdk_initialized : true dpdk_version : "DPDK 25.11.0+doca2601" external_ids : {hostname=nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-bridge-remote-probe-interval="0", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.5", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="01916ed0-7268-43dd-8355-68fef87a1761"} iface_types : [bareudp, doca, docavdpa, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan] manager_options : [] next_cfg : 898 other_config : {bundle-idle-timeout="0", doca-init="true", ovn-chassis-idx-01916ed0-7268-43dd-8355-68fef87a1761="", vlan-limit="0"} ovs_version : "3.3.0040" ssl : [] statistics : {} system_type : rhel system_version : "9.6"

One other thing I will note here is that in this SNO environment I did go ahead and upgrade it from the starting 4.20.15 version to 4.21.11 with the OVS-DOCA RHCOS on-cluster image in place. After getting to 4.21.11 the OVS-DOCA openvswitch was still in place and running appropriately.  OpenShift showed no issues with the swap out replacement and/or the upgrade process.

Hopefully this provided a good example of how to build and apply a on-cluster OVS-DOCA RHCOS layer for experimental purposes.