SCHMAUSTECH: NVIDIA OVS-DOCA via On-Cluster Layer OpenShift

In a previous blog I wrote about using off-cluster layering of NVIDIA's OVS-DOCA. In this blog I want to use on-clustering layering which makes manageability even easier than with the off-cluster layer. Before we begin though let's review what OVS-DOCA is all about.

What Is NVIDIA OVS-DOCA?

Open vSwitch (OVS) is a software-based network technology that enhances virtual machine (VM) communication within internal and external networks. Typically deployed in the hypervisor, OVS employs a software-based approach for packet switching, which can strain CPU resources, impacting system performance and network bandwidth utilization. Addressing this, NVIDIA's Accelerated Switching and Packet Processing (ASAP2) technology offloads OVS data-plane tasks to specialized hardware, like the embedded switch (eSwitch) within the NIC subsystem, while maintaining an unmodified OVS control-plane. This results in notably improved OVS performance without burdening the CPU.

NVIDIA's OVS-DOCA extends the traditional OVS-DPDK and OVS-Kernel data-path offload interfaces (DPIF), introducing OVS-DOCA as an additional DPIF implementation. OVS-DOCA, built upon NVIDIA's networking API, preserves the same interfaces as OVS-DPDK and OVS-Kernel while utilizing the DOCA Flow library with the additional OVS-DOCA DPIF. Unlike the use of the other DPIFs (DPDK, Kernel), OVS-DOCA DPIF exploits unique hardware offload mechanisms and application techniques, maximizing performance and features for NVIDA NICs and DPUs. This mode is especially efficient due to its architecture and DOCA library integration, enhancing e-switch configuration and accelerating hardware offloads beyond what the other modes can achieve.

Disclaimer: The following workflow of replacing OVS-DOCA in OpenShift is not currently a supported exercise. While using on-clustering is supported the process of switching out OVS with OVS-DOCA is purely experimental and for experimental purposes.

Workflow

The following on-cluster experiment of layering on NVIDIA OVS-DOCA was done on a single node OpenShift 4.20.15 environment on an x86 architecture. We assume in this document that the following has already been configured as its outside the scope of this document:

Hugepages have been configured by a machine configuration
The Logical Volume Manager Storage Operator has been installed and a LVM Cluster resource has been created

The workflow is broken down into five sections which cover environment configuration, creating the layer and then validating that it all works.

Configure OpenShift Internal Registry
Configure Configmap
Generate MachineOSConfig Custom Resource
Create MachineOSConfig Layer
Validate MachineOSConfig Layer

Configure OpenShift Internal Registry

The first thing we need to do is configured the OpenShift Internal Registry which is disabled by default. We will start by creating a persistent volume claim that will be consumed by the image registry. We will need to generate the following custom resource file.

$ cat <<EOF > imageregistry-pvc.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: registry-claim
  namespace: openshift-image-registry
  annotations:
    imageregistry.openshift.io: "true"
spec:
  accessModes:
  - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 100Gi
  storageClassName: lvms-vg1
EOF

Once we have the file created we can create it on the cluster.

$ oc create -f imageregistry-pvc.yaml 
persistentvolumeclaim/registry-claim created

The results will show the following which is normal at this point.

$ oc get pvc -n openshift-image-registry
NAME             STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
registry-claim   Pending                                      lvms-vg1       <unset>                 22s

Next we will enable route creation for the registry and this could take a few minutes.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
config.imageregistry.operator.openshift.io/cluster patched

We can proceed to allow the image registry operator to use our pvc we created above.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge --patch '{"spec": {"storage":{"pvc":{"claim":"registry-claim"}}}}'
config.imageregistry.operator.openshift.io/cluster patched

Then we can also set the rollout strategy. Since this is a single node a replica of 1 will do.

$ oc patch config.imageregistry.operator.openshift.io/cluster --type=merge -p '{"spec":{"rolloutStrategy":"Recreate","replicas":1}}'
config.imageregistry.operator.openshift.io/cluster patched

Finally we will set the image registry to managed which will signal the operator to start the registry.

$ oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge --patch '{"spec":{"managementState":"Managed"}}'
config.imageregistry.operator.openshift.io/cluster patched

If everything went like planned we should now see the image registry pod running.

$ oc get pods -n openshift-image-registry
NAME                                               READY   STATUS    RESTARTS   AGE
cluster-image-registry-operator-6bcb795b48-jgzw9   1/1     Running   5          7h34m
image-registry-676f99fd9c-8gf4m                    1/1     Running   0          58s
node-ca-bt8ns                                      1/1     Running   4          7h16m

We should also see that our persistent volume claim is now bound and also has an associated persistent volume from our lvms-vg1 storage class.

$ oc get pvc -n openshift-image-registry
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
registry-claim   Bound    pvc-c58510da-9eb9-4220-b101-4086f58b3311   100Gi      RWO            lvms-vg1       <unset>                 6m

$ oc get pv -n openshift-image-registry
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                     STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
pvc-c58510da-9eb9-4220-b101-4086f58b3311   100Gi      RWO            Delete           Bound    openshift-image-registry/registry-claim   lvms-vg1       <unset>                          2m34s

Now that our registry is running we need to make sure we can pull and push to the registry. First let's set some environment variables for our internal registry, the user, the namespace and the token creation.

$ export REGISTRY=image-registry.openshift-image-registry.svc:5000
$ export REGISTRY_USER=builder
$ export REGISTRY_NAMESPACE=openshift-machine-config-operator
$ export TOKEN=$(oc create token $REGISTRY_USER -n $REGISTRY_NAMESPACE --duration=$((900*24))h)

Next let's create the push-secret using the variables we set in the openshift-machine-config-operator namespace.

$ oc create secret docker-registry push-secret -n openshift-machine-config-operator --docker-server=$REGISTRY --docker-username=$REGISTRY_USER --docker-password=$TOKEN
secret/push-secret created

Now we need to extract the push secret and the clusters global pull-secret.

$ oc extract secret/push-secret -n openshift-machine-config-operator --to=- > push-secret.json
# .dockerconfigjson

$ oc extract secret/pull-secret -n openshift-config --to=- > pull-secret.json
# .dockerconfigjson

We will now take the push-secret and global pull secret into one merged secret.

$ jq -s '.[0] * .[1]' pull-secret.json push-secret.json > pull-and-push-secret.json

This new merged secret needs to be create as well in the openshift-machine-config-operator namespace.

$ oc create secret generic pull-and-push-secret -n openshift-machine-config-operator --from-file=.dockerconfigjson=pull-and-push-secret.json --type=kubernetes.io/dockerconfigjson
secret/pull-and-push-secret created

We can also validate the secrets have been created.

$ oc get secrets -n openshift-machine-config-operator |grep push
pull-and-push-secret                        kubernetes.io/dockerconfigjson        1      10s
push-secret                                 kubernetes.io/dockerconfigjson        1      114s

At this point our registry is configured and running and we can move onto the next workflow.

Configure Configmap

With the Machine Config Operator which will be doing the heavy lifting of building the OVS-DOCA RHCOS layer we can use configuration maps to map in values or files. In this case we want to generate a ConfigMap that will contain our DOCA repo. We will first generate the following custom resource file.

$ cat <<EOF > repos-configmap.yaml
apiVersion: v1
kind: ConfigMap                   
metadata:
  name: etc-yum-repos-d           
  namespace: openshift-machine-config-operator
data:
  doca.repo: |
    [doca]
    name=DOCA Online Repo
    baseurl=https://linux.mellanox.com/public/repo/doca/3.3.0/rhel9/x86_64/
    enabled=1
    gpgcheck=0

  centos.repos: |
    [baseos]
    name=CentOS Stream $releasever - BaseOS
    baseurl=https://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
    gpgcheck=0
    repo_gpgcheck=0
    metadata_expire=6h
    countme=1
    enabled=1

    [appstream]
    name=CentOS Stream $releasever - AppStream
    baseurl=https://mirror.stream.centos.org/9-stream/AppStream/x86_64/os
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
    gpgcheck=0
    repo_gpgcheck=0
    metadata_expire=6h
    countme=1
    enabled=1

    [CRB]
    name=Centos Stream $releasever - CRB
    baseurl=https://mirror.stream.centos.org/9-stream/CRB/x86_64/os
    gpgcheck=1
    repo_gpgcheck=0
    metadata_expire=6h
    countme=1
    enabled=1
EOF

With the file created we will go ahead and create it on the cluster.

$ oc create -f repos-configmap.yaml
configmap/etc-yum-repos-d created

For sanity we can check that it was created.

$ oc get configmap -n openshift-machine-config-operator etc-yum-repos-d
NAME              DATA   AGE
etc-yum-repos-d   2      22s

What this ConfigMap will do is mount up the repo files created in the builder container when we go to create the OVS-DOCA RHCOS layer. If everything looks good we can move onto the next step in the workflow.

Generate MachineOSConfig Custom Resource

When using on-cluster layering we have to create a MachineOSConfig that will basically do the steps that the Dockerfile would have done if we were building an off-cluster layer. For our on-cluster layer we need to ensure it does the following:

Installs the dependencies needed
Upgrades some of the existing packages to a version suitable for OVS-DOCA
Removes the current Red Hat based version of OpenvSwitch
Replaces some packages with those from the DOCA repo
Installs the doca-all bundle which will contain the OVS-DOCA rpms including OpenvSwitch
Note that since this is a SNO setup our metadata name and machineConfigPool is master. On a multi-node cluster this would most likely be worker or the name of a custom machineConfigPool.

Below is the following custom resource file we will use.

$ cat  <<EOF > on-cluster-rhcos-layer-mc.yaml.works
apiVersion: machineconfiguration.openshift.io/v1 
kind: MachineOSConfig
metadata:
  name: master  
spec:
  machineConfigPool:
    name: master 
  containerFile: 
  - containerfileArch: NoArch 
    content: |-
      FROM configs AS final
      RUN  dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
      mkdir /var/opt && \
      dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libyaml-devel-0.2.5-7.el9.x86_64.rpm && \
      dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libpcap-devel-1.10.0-4.el9.x86_64.rpm && \
      dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libzip-devel-1.7.3-7.el9.x86_64.rpm && \
      dnf install -y libunwind jsoncpp openssl-devel kernel-devel kernel-headers && \
      dnf upgrade -y unbound-libs unbound bzip2-libs bzip2-devel && \
      rpm-ostree override remove openvswitch3.5 && \
      rpm-ostree override replace libibverbs rdma-core --experimental --from repo='doca' && \
      dnf install doca-all -y && \
      rm -r -f /etc/yum.repos.d/* && \
      dnf clean all -y  && \
      bootc container lint
  imageBuilder: 
    imageBuilderType: Job
  baseImagePullSecret: 
    name: pull-and-push-secret
  renderedImagePushSecret: 
    name: push-secret
  renderedImagePushSpec: image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image:latest
EOF

If everything looks correct we can move onto the next section in the workflow.

Create MachineOSConfig Layer

At this point we are ready to create our MachineOSConfig which will do the following:

Build the OVS-DOCA RHCOS layer image based on our requirements in the MachineOSConfig
Push that OVS-DOCA RHCOS layer image to the local OpenShift registry
Apply that OVS-DOCA RHCOS layer image to the system
Reboot the node for the changes to take effect

To kick things off we need to create MachineOSConfig on the cluster.

$ oc create -f on-cluster-rhcos-layer-mc.yaml
machineosconfig.machineconfiguration.openshift.io/master created

Next we can look at the state of the build process by looking at the machineOSbuild state.

$ oc get machineOSbuild 
NAME                                      PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED   AGE
master-9c55a02933a10f5fc31c6bb5329e1f38   False      True       False       False         False    16s

While the machineOSbuild is building we should notice that two additional pods were created in the openshift-machine-config-operator namespace: machine-os-builer and build-master.

$ oc get pods -n openshift-machine-config-operator
NAME                                                            READY   STATUS      RESTARTS   AGE
build-master-9c55a02933a10f5fc31c6bb5329e1f38-qr262             0/1     Init:0/1    0          30s
kube-rbac-proxy-crio-nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com   1/1     Running     7          27h
machine-config-controller-847595d69d-q9jzb                      2/2     Running     9          27h
machine-config-daemon-php6s                                     2/2     Running     14         27h
machine-config-nodes-crd-cleanup-29633172-tv2ml                 0/1     Completed   0          27h
machine-config-operator-6d4cbf84b4-q4z6c                        2/2     Running     9          27h
machine-config-server-vvxkk                                     1/1     Running     4          27h
machine-os-builder-9d9d855dd-9xjjv                              1/1     Running     0          39s

The build-master is where the actual build is happening and we can watch the execution of the build by tailing the logs of that pod and the image-build container inside. Below is an example snippet (the log can be incredibly long) of right at the end of a successful build.

$ oc logs -f -n openshift-machine-config-operator build-master-9c55a02933a10f5fc31c6bb5329e1f38-qr262 image-build
(...)
Copying blob sha256:25e5c12c08ced2f786717e0303aff37e3ce37f8e8171ae91fb298eee4e7af424
Copying blob sha256:7e6009a201a327b83674a7491fd077e59ecfb63b63c7a570359dbf84081c6aa0
Copying config sha256:a7f6fd648ae51526c4c113756eb1af4beac6c05c6c6f4988665f3258a89e42bf
Writing manifest to image destination

Once the image-build container has finished building the image we can go back and check the status of the machineOSbuild to see it succeeded.

$ oc get machineOSbuild 
NAME                                      PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED   AGE
master-9c55a02933a10f5fc31c6bb5329e1f38   False      False      True        False         False    7m16s

We will also notice now that the master machineConfigPool is now in an updating state because the new RHCOS layer is being applied.

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-c4363e454702b9a652e29fb24f28c7c7   False     True       False      1              0                   0                     0                      27h
worker   rendered-worker-91833d06ff7a4b4563d843249bc12228   True      False      False      0              0                   0                     0                      27h

Remember that as the OVS-DOCA RHCOS layer is being applied the node will reboot. Once the node comes back the image should be applied and we can see the update is complete by the following.

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-c4363e454702b9a652e29fb24f28c7c7   True      False      False      1              1                   1                     0                      27h
worker   rendered-worker-91833d06ff7a4b4563d843249bc12228   True      False      False      0              0                   0                     0                      27h

If everything looks good we can move onto the next section of the workflow.

Validate MachineOSConfig Layer

Now that the RHCOS layer is applied to the system we can validate that the OVS-DOCA Openvswitch is in use. First we need to open a debug pod on the node where the image was applied.

$ oc debug node/nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-26nvidiaengrdu2dcredhatcom-debug-hlcdm ...
To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`.
Pod IP: 10.6.135.5
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host

Next we can check for what version of openvswitch is installed. The output below shows that we do in fact have the doca-openvswitch package installed.

sh-5.1# rpm -qa|grep openvswitch
openvswitch-selinux-extra-policy-1.0-39.el9fdp.noarch
doca-openvswitch-3.3.0040-1.el9.x86_64

Next we can use the ovs-vsctl command check further if the running openvswitch is in fact doca version. Below in the output we can see the datapath_types have doca listed and when we look at the dpdk_version we can see it references doca as well.

sh-5.1# ovs-vsctl list open_vswitch
_uuid               : bb0f2e33-856e-49f3-b063-a25084cb7894
bridges             : [7d465f8c-2332-4cd8-9115-f5f53161d838, e137b176-e5ee-4df8-8c79-60b364c2c368]
cur_cfg             : 703
datapath_types      : [doca, netdev, system]
datapaths           : {system=748f4bb7-442e-479a-9923-811ee91408cc}
db_version          : "8.5.1"
doca_initialized    : false
doca_version        : "3.3.0109"
dpdk_initialized    : false
dpdk_version        : "DPDK 25.11.0+doca2601"
external_ids        : {hostname=nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-bridge-remote-probe-interval="0", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.5", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="01916ed0-7268-43dd-8355-68fef87a1761"}
iface_types         : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan]
manager_options     : []
next_cfg            : 703
other_config        : {bundle-idle-timeout="0", ovn-chassis-idx-01916ed0-7268-43dd-8355-68fef87a1761="", vlan-limit="0"}
ovs_version         : "3.3.0040"
ssl                 : []
statistics          : {}
system_type         : rhel
system_version      : "9.6"

Now I noticed above doca and dpdk were not initialized. Let's see if we can get them going with the following. Then restart openvswitch on the node.

sh-5.1# ovs-vsctl --no-wait set Open_vSwitch . other_config:doca-init=true

sh-5.1# systemctl restart openvswitch

And now if we look at the open_vswitch values again we can see they are now true.

sh-5.1# ovs-vsctl list open_vswitch
_uuid               : bb0f2e33-856e-49f3-b063-a25084cb7894
bridges             : [7b6aaed2-2e60-45bf-b98c-90e6806482dc, 7d465f8c-2332-4cd8-9115-f5f53161d838]
cur_cfg             : 898
datapath_types      : [doca, netdev, system]
datapaths           : {system=748f4bb7-442e-479a-9923-811ee91408cc}
db_version          : "8.5.1"
doca_initialized    : true
doca_version        : "3.3.0109"
dpdk_initialized    : true
dpdk_version        : "DPDK 25.11.0+doca2601"
external_ids        : {hostname=nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-bridge-remote-probe-interval="0", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.5", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="01916ed0-7268-43dd-8355-68fef87a1761"}
iface_types         : [bareudp, doca, docavdpa, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan]
manager_options     : []
next_cfg            : 898
other_config        : {bundle-idle-timeout="0", doca-init="true", ovn-chassis-idx-01916ed0-7268-43dd-8355-68fef87a1761="", vlan-limit="0"}
ovs_version         : "3.3.0040"
ssl                 : []
statistics          : {}
system_type         : rhel
system_version      : "9.6"

One other thing I will note here is that in this SNO environment I did go ahead and upgrade it from the starting 4.20.15 version to 4.21.11 with the OVS-DOCA RHCOS on-cluster image in place. After getting to 4.21.11 the OVS-DOCA openvswitch was still in place and running appropriately. OpenShift showed no issues with the swap out replacement and/or the upgrade process.

Hopefully this provided a good example of how to build and apply a on-cluster OVS-DOCA RHCOS layer for experimental purposes.

Thursday, May 07, 2026

NVIDIA OVS-DOCA via On-Cluster Layer OpenShift