In a previous blog I wrote about using off-cluster layering of NVIDIA's OVS-DOCA. In this blog I want to use on-clustering layering which makes manageability even easier than with the off-cluster layer. Before we begin though let's review what OVS-DOCA is all about.
What Is NVIDIA OVS-DOCA?
Open vSwitch (OVS) is a software-based network technology that enhances virtual machine (VM) communication within internal and external networks. Typically deployed in the hypervisor, OVS employs a software-based approach for packet switching, which can strain CPU resources, impacting system performance and network bandwidth utilization. Addressing this, NVIDIA's Accelerated Switching and Packet Processing (ASAP2) technology offloads OVS data-plane tasks to specialized hardware, like the embedded switch (eSwitch) within the NIC subsystem, while maintaining an unmodified OVS control-plane. This results in notably improved OVS performance without burdening the CPU.
NVIDIA's OVS-DOCA extends the traditional OVS-DPDK and OVS-Kernel data-path offload interfaces (DPIF), introducing OVS-DOCA as an additional DPIF implementation. OVS-DOCA, built upon NVIDIA's networking API, preserves the same interfaces as OVS-DPDK and OVS-Kernel while utilizing the DOCA Flow library with the additional OVS-DOCA DPIF. Unlike the use of the other DPIFs (DPDK, Kernel), OVS-DOCA DPIF exploits unique hardware offload mechanisms and application techniques, maximizing performance and features for NVIDA NICs and DPUs. This mode is especially efficient due to its architecture and DOCA library integration, enhancing e-switch configuration and accelerating hardware offloads beyond what the other modes can achieve.
Disclaimer: The following workflow of replacing OVS-DOCA in OpenShift is not currently a supported exercise. While using on-clustering is supported the process of switching out OVS with OVS-DOCA is purely experimental and for experimental purposes.
Workflow
The following on-cluster experiment of layering on NVIDIA OVS-DOCA was done on a single node OpenShift 4.20.15 environment on an x86 architecture. We assume in this document that the following has already been configured as its outside the scope of this document:
- Hugepages have been configured by a machine configuration
- The Logical Volume Manager Storage Operator has been installed and a LVM Cluster resource has been created
The workflow is broken down into five sections which cover environment configuration, creating the layer and then validating that it all works.
- Configure OpenShift Internal Registry
- Configure Configmap
- Generate MachineOSConfig Custom Resource
- Create MachineOSConfig Layer
- Validate MachineOSConfig Layer
Configure OpenShift Internal Registry
The first thing we need to do is configured the OpenShift Internal Registry which is disabled by default. We will start by creating a persistent volume claim that will be consumed by the image registry. We will need to generate the following custom resource file.
$ cat <<EOF > imageregistry-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: registry-claim
namespace: openshift-image-registry
annotations:
imageregistry.openshift.io: "true"
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 100Gi
storageClassName: lvms-vg1
EOF
Once we have the file created we can create it on the cluster.
$ oc create -f imageregistry-pvc.yaml
persistentvolumeclaim/registry-claim created
The results will show the following which is normal at this point.
$ oc get pvc -n openshift-image-registry
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
registry-claim Pending lvms-vg1 <unset> 22s
Next we will enable route creation for the registry and this could take a few minutes.
$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
config.imageregistry.operator.openshift.io/cluster patched
We can proceed to allow the image registry operator to use our pvc we created above.
$ oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge --patch '{"spec": {"storage":{"pvc":{"claim":"registry-claim"}}}}'
config.imageregistry.operator.openshift.io/cluster patched
Then we can also set the rollout strategy. Since this is a single node a replica of 1 will do.
$ oc patch config.imageregistry.operator.openshift.io/cluster --type=merge -p '{"spec":{"rolloutStrategy":"Recreate","replicas":1}}'
config.imageregistry.operator.openshift.io/cluster patched
Finally we will set the image registry to managed which will signal the operator to start the registry.
$ oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge --patch '{"spec":{"managementState":"Managed"}}'
config.imageregistry.operator.openshift.io/cluster patched
If everything went like planned we should now see the image registry pod running.
$ oc get pods -n openshift-image-registry
NAME READY STATUS RESTARTS AGE
cluster-image-registry-operator-6bcb795b48-jgzw9 1/1 Running 5 7h34m
image-registry-676f99fd9c-8gf4m 1/1 Running 0 58s
node-ca-bt8ns 1/1 Running 4 7h16m
We should also see that our persistent volume claim is now bound and also has an associated persistent volume from our lvms-vg1 storage class.
$ oc get pvc -n openshift-image-registry
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
registry-claim Bound pvc-c58510da-9eb9-4220-b101-4086f58b3311 100Gi RWO lvms-vg1 <unset> 6m
$ oc get pv -n openshift-image-registry
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
pvc-c58510da-9eb9-4220-b101-4086f58b3311 100Gi RWO Delete Bound openshift-image-registry/registry-claim lvms-vg1 <unset> 2m34s
Now that our registry is running we need to make sure we can pull and push to the registry. First let's set some environment variables for our internal registry, the user, the namespace and the token creation.
$ export REGISTRY=image-registry.openshift-image-registry.svc:5000
$ export REGISTRY_USER=builder
$ export REGISTRY_NAMESPACE=openshift-machine-config-operator
$ export TOKEN=$(oc create token $REGISTRY_USER -n $REGISTRY_NAMESPACE --duration=$((900*24))h)
Next let's create the push-secret using the variables we set in the openshift-machine-config-operator namespace.
$ oc create secret docker-registry push-secret -n openshift-machine-config-operator --docker-server=$REGISTRY --docker-username=$REGISTRY_USER --docker-password=$TOKEN
secret/push-secret created
Now we need to extract the push secret and the clusters global pull-secret.
$ oc extract secret/push-secret -n openshift-machine-config-operator --to=- > push-secret.json
# .dockerconfigjson
$ oc extract secret/pull-secret -n openshift-config --to=- > pull-secret.json
# .dockerconfigjson
We will now take the push-secret and global pull secret into one merged secret.
$ jq -s '.[0] * .[1]' pull-secret.json push-secret.json > pull-and-push-secret.json
This new merged secret needs to be create as well in the openshift-machine-config-operator namespace.
$ oc create secret generic pull-and-push-secret -n openshift-machine-config-operator --from-file=.dockerconfigjson=pull-and-push-secret.json --type=kubernetes.io/dockerconfigjson
secret/pull-and-push-secret created
We can also validate the secrets have been created.
$ oc get secrets -n openshift-machine-config-operator |grep push
pull-and-push-secret kubernetes.io/dockerconfigjson 1 10s
push-secret kubernetes.io/dockerconfigjson 1 114s
At this point our registry is configured and running and we can move onto the next workflow.
Configure Configmap
With the Machine Config Operator which will be doing the heavy lifting of building the OVS-DOCA RHCOS layer we can use configuration maps to map in values or files. In this case we want to generate a ConfigMap that will contain our DOCA repo. We will first generate the following custom resource file.
$ cat <<EOF > repos-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: etc-yum-repos-d
namespace: openshift-machine-config-operator
data:
doca.repo: |
[doca]
name=DOCA Online Repo
baseurl=https://linux.mellanox.com/public/repo/doca/3.3.0/rhel9/x86_64/
enabled=1
gpgcheck=0
centos.repos: |
[baseos]
name=CentOS Stream $releasever - BaseOS
baseurl=https://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
gpgcheck=0
repo_gpgcheck=0
metadata_expire=6h
countme=1
enabled=1
[appstream]
name=CentOS Stream $releasever - AppStream
baseurl=https://mirror.stream.centos.org/9-stream/AppStream/x86_64/os
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
gpgcheck=0
repo_gpgcheck=0
metadata_expire=6h
countme=1
enabled=1
[CRB]
name=Centos Stream $releasever - CRB
baseurl=https://mirror.stream.centos.org/9-stream/CRB/x86_64/os
gpgcheck=1
repo_gpgcheck=0
metadata_expire=6h
countme=1
enabled=1
EOF
With the file created we will go ahead and create it on the cluster.
$ oc create -f repos-configmap.yaml
configmap/etc-yum-repos-d created
For sanity we can check that it was created.
$ oc get configmap -n openshift-machine-config-operator etc-yum-repos-d
NAME DATA AGE
etc-yum-repos-d 2 22s
What this ConfigMap will do is mount up the repo files created in the builder container when we go to create the OVS-DOCA RHCOS layer. If everything looks good we can move onto the next step in the workflow.
Generate MachineOSConfig Custom Resource
When using on-cluster layering we have to create a MachineOSConfig that will basically do the steps that the Dockerfile would have done if we were building an off-cluster layer. For our on-cluster layer we need to ensure it does the following:
- Installs the dependencies needed
- Upgrades some of the existing packages to a version suitable for OVS-DOCA
- Removes the current Red Hat based version of OpenvSwitch
- Replaces some packages with those from the DOCA repo
- Installs the doca-all bundle which will contain the OVS-DOCA rpms including OpenvSwitch
- Note that since this is a SNO setup our metadata name and machineConfigPool is master. On a multi-node cluster this would most likely be worker or the name of a custom machineConfigPool.
Below is the following custom resource file we will use.
$ cat <<EOF > on-cluster-rhcos-layer-mc.yaml.works
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineOSConfig
metadata:
name: master
spec:
machineConfigPool:
name: master
containerFile:
- containerfileArch: NoArch
content: |-
FROM configs AS final
RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
mkdir /var/opt && \
dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libyaml-devel-0.2.5-7.el9.x86_64.rpm && \
dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libpcap-devel-1.10.0-4.el9.x86_64.rpm && \
dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libzip-devel-1.7.3-7.el9.x86_64.rpm && \
dnf install -y libunwind jsoncpp openssl-devel kernel-devel kernel-headers && \
dnf upgrade -y unbound-libs unbound bzip2-libs bzip2-devel && \
rpm-ostree override remove openvswitch3.5 && \
rpm-ostree override replace libibverbs rdma-core --experimental --from repo='doca' && \
dnf install doca-all -y && \
rm -r -f /etc/yum.repos.d/* && \
dnf clean all -y && \
bootc container lint
imageBuilder:
imageBuilderType: Job
baseImagePullSecret:
name: pull-and-push-secret
renderedImagePushSecret:
name: push-secret
renderedImagePushSpec: image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image:latest
EOF
If everything looks correct we can move onto the next section in the workflow.
Create MachineOSConfig Layer
At this point we are ready to create our MachineOSConfig which will do the following:
- Build the OVS-DOCA RHCOS layer image based on our requirements in the MachineOSConfig
- Push that OVS-DOCA RHCOS layer image to the local OpenShift registry
- Apply that OVS-DOCA RHCOS layer image to the system
- Reboot the node for the changes to take effect
To kick things off we need to create MachineOSConfig on the cluster.
$ oc create -f on-cluster-rhcos-layer-mc.yaml
machineosconfig.machineconfiguration.openshift.io/master created
Next we can look at the state of the build process by looking at the machineOSbuild state.
$ oc get machineOSbuild
NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE
master-9c55a02933a10f5fc31c6bb5329e1f38 False True False False False 16s
While the machineOSbuild is building we should notice that two additional pods were created in the openshift-machine-config-operator namespace: machine-os-builer and build-master.
$ oc get pods -n openshift-machine-config-operator
NAME READY STATUS RESTARTS AGE
build-master-9c55a02933a10f5fc31c6bb5329e1f38-qr262 0/1 Init:0/1 0 30s
kube-rbac-proxy-crio-nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com 1/1 Running 7 27h
machine-config-controller-847595d69d-q9jzb 2/2 Running 9 27h
machine-config-daemon-php6s 2/2 Running 14 27h
machine-config-nodes-crd-cleanup-29633172-tv2ml 0/1 Completed 0 27h
machine-config-operator-6d4cbf84b4-q4z6c 2/2 Running 9 27h
machine-config-server-vvxkk 1/1 Running 4 27h
machine-os-builder-9d9d855dd-9xjjv 1/1 Running 0 39s
The build-master is where the actual build is happening and we can watch the execution of the build by tailing the logs of that pod and the image-build container inside. Below is an example snippet (the log can be incredibly long) of right at the end of a successful build.
$ oc logs -f -n openshift-machine-config-operator build-master-9c55a02933a10f5fc31c6bb5329e1f38-qr262 image-build
(...)
Copying blob sha256:25e5c12c08ced2f786717e0303aff37e3ce37f8e8171ae91fb298eee4e7af424
Copying blob sha256:7e6009a201a327b83674a7491fd077e59ecfb63b63c7a570359dbf84081c6aa0
Copying config sha256:a7f6fd648ae51526c4c113756eb1af4beac6c05c6c6f4988665f3258a89e42bf
Writing manifest to image destination
Once the image-build container has finished building the image we can go back and check the status of the machineOSbuild to see it succeeded.
$ oc get machineOSbuild
NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE
master-9c55a02933a10f5fc31c6bb5329e1f38 False False True False False 7m16s
We will also notice now that the master machineConfigPool is now in an updating state because the new RHCOS layer is being applied.
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-c4363e454702b9a652e29fb24f28c7c7 False True False 1 0 0 0 27h
worker rendered-worker-91833d06ff7a4b4563d843249bc12228 True False False 0 0 0 0 27h
Remember that as the OVS-DOCA RHCOS layer is being applied the node will reboot. Once the node comes back the image should be applied and we can see the update is complete by the following.
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-c4363e454702b9a652e29fb24f28c7c7 True False False 1 1 1 0 27h
worker rendered-worker-91833d06ff7a4b4563d843249bc12228 True False False 0 0 0 0 27h
If everything looks good we can move onto the next section of the workflow.
Validate MachineOSConfig Layer
Now that the RHCOS layer is applied to the system we can validate that the OVS-DOCA Openvswitch is in use. First we need to open a debug pod on the node where the image was applied.
$ oc debug node/nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-26nvidiaengrdu2dcredhatcom-debug-hlcdm ...
To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`.
Pod IP: 10.6.135.5
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
Next we can check for what version of openvswitch is installed. The output below shows that we do in fact have the doca-openvswitch package installed.
sh-5.1# rpm -qa|grep openvswitch
openvswitch-selinux-extra-policy-1.0-39.el9fdp.noarch
doca-openvswitch-3.3.0040-1.el9.x86_64
Next we can use the ovs-vsctl command check further if the running openvswitch is in fact doca version. Below in the output we can see the datapath_types have doca listed and when we look at the dpdk_version we can see it references doca as well.
sh-5.1# ovs-vsctl list open_vswitch
_uuid : bb0f2e33-856e-49f3-b063-a25084cb7894
bridges : [7d465f8c-2332-4cd8-9115-f5f53161d838, e137b176-e5ee-4df8-8c79-60b364c2c368]
cur_cfg : 703
datapath_types : [doca, netdev, system]
datapaths : {system=748f4bb7-442e-479a-9923-811ee91408cc}
db_version : "8.5.1"
doca_initialized : false
doca_version : "3.3.0109"
dpdk_initialized : false
dpdk_version : "DPDK 25.11.0+doca2601"
external_ids : {hostname=nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-bridge-remote-probe-interval="0", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.5", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="01916ed0-7268-43dd-8355-68fef87a1761"}
iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan]
manager_options : []
next_cfg : 703
other_config : {bundle-idle-timeout="0", ovn-chassis-idx-01916ed0-7268-43dd-8355-68fef87a1761="", vlan-limit="0"}
ovs_version : "3.3.0040"
ssl : []
statistics : {}
system_type : rhel
system_version : "9.6"
Now I noticed above doca and dpdk were not initialized. Let's see if we can get them going with the following. Then restart openvswitch on the node.
sh-5.1# ovs-vsctl --no-wait set Open_vSwitch . other_config:doca-init=true
sh-5.1# systemctl restart openvswitch
And now if we look at the open_vswitch values again we can see they are now true.
sh-5.1# ovs-vsctl list open_vswitch
_uuid : bb0f2e33-856e-49f3-b063-a25084cb7894
bridges : [7b6aaed2-2e60-45bf-b98c-90e6806482dc, 7d465f8c-2332-4cd8-9115-f5f53161d838]
cur_cfg : 898
datapath_types : [doca, netdev, system]
datapaths : {system=748f4bb7-442e-479a-9923-811ee91408cc}
db_version : "8.5.1"
doca_initialized : true
doca_version : "3.3.0109"
dpdk_initialized : true
dpdk_version : "DPDK 25.11.0+doca2601"
external_ids : {hostname=nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com, ovn-bridge-mappings="physnet:br-ex", ovn-bridge-remote-probe-interval="0", ovn-enable-lflow-cache="true", ovn-encap-ip="10.6.135.5", ovn-encap-type=geneve, ovn-is-interconn="true", ovn-memlimit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-ofctrl-wait-before-clear="0", ovn-remote="unix:/var/run/ovn/ovnsb_db.sock", ovn-remote-probe-interval="180000", ovn-set-local-ip="true", rundir="/var/run/openvswitch", system-id="01916ed0-7268-43dd-8355-68fef87a1761"}
iface_types : [bareudp, doca, docavdpa, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, srv6, stt, system, tap, vxlan]
manager_options : []
next_cfg : 898
other_config : {bundle-idle-timeout="0", doca-init="true", ovn-chassis-idx-01916ed0-7268-43dd-8355-68fef87a1761="", vlan-limit="0"}
ovs_version : "3.3.0040"
ssl : []
statistics : {}
system_type : rhel
system_version : "9.6"
One other thing I will note here is that in this SNO environment I did go ahead and upgrade it from the starting 4.20.15 version to 4.21.11 with the OVS-DOCA RHCOS on-cluster image in place. After getting to 4.21.11 the OVS-DOCA openvswitch was still in place and running appropriately. OpenShift showed no issues with the swap out replacement and/or the upgrade process.
Hopefully this provided a good example of how to build and apply a on-cluster OVS-DOCA RHCOS layer for experimental purposes.
