Saturday, January 04, 2025

RDMA with NVIDIA on OpenShift


The rise of artificial intelligence(AI) has generated some really challenging problems with data movement.  In traditional environments if I needed to move data from one node to another it would need to be manipulated by the central processor (CPU) of the host.   While this was reasonable with small amounts of data a better and more efficient method is needed for AI workloads and their large datasets. 

To solve this challenge we can use RDMA or remote direct memory access which enables direct memory access from the memory of one compute node to another compute node without involving the CPU of the hosts.  This enables high-throughput, low-latency networking which is especially useful in massive compute clusters with large datasets.

The rest of this blog will cover example(s) of using RDMA with NVIDIA's Network Operator and GPU Operator along with Red Hat OpenShift Container Platform.   The three primary examples covered in this document will be: RDMA Shared Device, RDMA Host Device and RDMA in Legacy SRIOV.

Lab Environment

The following configurations and testing were done a OpenShift environment that consisted of the following:

  • OpenShift 4.16.19 x86
  • Network Operator 24.10
  • All other operators used the default values for OCP 4.16.
  • 3 physical nodes: 1 SNO master, 2 workers
  • The workers consisted of Dell R760xa with 2 NVIDIA BF3 cards in them.
  • One BF3 card was attached to the NVIDIA Spectrum SN5600 switch for RDMA over ethernet
  • One BF3 card was attached to the NVIDIA Quantum QM9700 switch for RDMA over infiniband

Blacklist IRDMA Module

On some systems, including the DellR750xa I used for testing, the irdma kernel module creates problems for the NVIDIA Network Operator on unload/load of the DOCA drivers so we need to blacklist it with a machine configuration that gets applied to all worker nodes.

Generate the following machine configuration file yaml specifying the module irdma to blacklist.

$ cat <<EOF > 99-machine-config-blacklist-irdma.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-blacklist-irdma spec: kernelArguments: - "module_blacklist=irdma" EOF

Then create the machine configuration on the cluster and wait for the worker nodes to reboot.

$ oc create -f 99-machine-config-blacklist-irdma.yaml machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created

Validate in a debug pod on each node that the module has not loaded.

$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-btfj2 ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.11 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# lsmod|grep irdma sh-5.1#

At this point, if everything looks good, we can move onto the next steps of the workflow.

Persistent Naming Rules

Sometimes there is a need to make sure the device names persist on reboots. On the R760xa systems and where nodes had a large number of networking cards, I was noticing the Mellanox devices were being renamed on reboots so I decided to use a MachineConfig to set persistence. 

First gather the the MAC address names into a file from the worker nodes for the node(s) and also provide names for the interfaces that need to persist. We will call the file 70-persistent-net.rules and stash the details in it.

$ cat <<EOF > 70-persistent-net.rules SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:28",ATTR{type}=="1",NAME="ibs2f0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:29",ATTR{type}=="1",NAME="ens8f0np0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d0",ATTR{type}=="1",NAME="ibs2f0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d1",ATTR{type}=="1",NAME="ens8f0np0" EOF

Now we need to convert that file into a base64 string without line breaks and set the output to the variable PERSIST.

$ PERSIST=`cat 70-persistent-net.rules| base64 -w 0` $ echo $PERSIST U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK

Now we can create a machine configuration and set the base64 encoding in our custom resource file.  Notice how I am using the PERSIST variable in my yaml creation to mitigate copy/paste type errors.

$ cat <<EOF > 99-machine-config-udev-network.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-machine-config-udev-network spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;base64,$PERSIST filesystem: root mode: 420 path: /etc/udev/rules.d/70-persistent-net.rules EOF

Once we have the machine configuration we can create it on the cluster.

$ oc create -f 99-machine-config-udev-network.yaml machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9adfe851c2c14d9598eea5ec3df6c187 True False False 1 1 1 0 6h21m worker rendered-worker-4568f1b174066b4b1a4de794cf538fee False True False 2 0 0 0 6h21m

The worker nodes will reboot and once the updating field goes back to false we can validate on the nodes by looking at the devices in a debug pod if we chose to do so.

If everything looks good we can move onto configuring the operators of the OpenShift cluster.

Install and Configure Required Operators

This next section will cover the installation and configurations of the required operators we need for the RDMA testing.

Install and Configure NFD Operator

The Node Feature Discovery (NFD) operator manages the detection of hardware features and configuration in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, operating system version, and so on.

To get started we will generate a NFD Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > nfd-operator.yaml apiVersion: v1 kind: Namespace metadata: name: openshift-nfd --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: openshift-nfd namespace: openshift-nfd spec: targetNamespaces: - openshift-nfd --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nfd namespace: openshift-nfd spec: channel: "stable" installPlanApproval: Automatic name: nfd source: redhat-operators sourceNamespace: openshift-marketplace EOF

Next we can create the resources on the cluster.

$ oc create -f nfd-operator.yaml namespace/openshift-nfd created operatorgroup.operators.coreos.com/openshift-nfd created subscription.operators.coreos.com/nfd created

We can validate that the operator is installed and running by looking at the pods in the openshift-nfd namespace.

$ oc get pods -n openshift-nfd NAME READY STATUS RESTARTS AGE nfd-controller-manager-8698c88cdd-t8gbc 2/2 Running 0 2m

With the NFD controller running we can move onto generating the NodeFeatureDiscovery instance and adding it to the cluster.

The ClusterServiceVersion specification for NFD operator provides default values, including the NFD operand image that is part of the operator payload. We retrieve its value with the following command line and assign it to the variable NFD_OPERAND_IMAGE.

$ NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`

We can now create the NodeFeatureDiscovery instance. Note that we add entries to the default deviceClasseWhiteList field, so that to support more network adapters, such as the NVIDIA BlueField DPUs and the NVIDIA GPUs.

$ cat <<EOF > nfd-instance.yaml apiVersion: nfd.openshift.io/v1 kind: NodeFeatureDiscovery metadata: name: nfd-instance namespace: openshift-nfd spec: instance: '' operand: image: '${NFD_OPERAND_IMAGE}' servicePort: 12000 prunerOnDelete: false topologyUpdater: false workerConfig: configData: | core: sleepInterval: 60s sources: pci: deviceClassWhitelist: - "02" - "03" - "0200" - "0207" - "12" deviceLabelFields: - "vendor" EOF $ oc create -f nfd-instance.yaml nodefeaturediscovery.nfd.openshift.io/nfd-instance created

Finally we can validate our instance is up and running by again looking at the pods under the openshift-nfd namespace.

$ oc get pods -n openshift-nfd NAME READY STATUS RESTARTS AGE nfd-controller-manager-7cb6d656-jcnqb 2/2 Running 0 4m nfd-gc-7576d64889-s28k9 1/1 Running 0 21s nfd-master-b7bcf5cfd-qnrmz 1/1 Running 0 21s nfd-worker-96pfh 1/1 Running 0 21s nfd-worker-b2gkg 1/1 Running 0 21s nfd-worker-bd9bk 1/1 Running 0 21s nfd-worker-cswf4 1/1 Running 0 21s nfd-worker-kp6gg 1/1 Running 0 21s

After a minute or so, we can verify that NFD has added labels to the node. The NFD labels are prefixed with feature.node.kubernetes.io, so we can easily filter them.

$ oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))' { "feature.node.kubernetes.io/cpu-cpuid.ADX": "true", "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true", "feature.node.kubernetes.io/cpu-cpuid.AVX": "true", "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true", "feature.node.kubernetes.io/cpu-cpuid.CETSS": "true", "feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true", "feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true", "feature.node.kubernetes.io/cpu-cpuid.CPBOOST": "true", "feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS": "true", "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true", "feature.node.kubernetes.io/cpu-cpuid.FP256": "true", "feature.node.kubernetes.io/cpu-cpuid.FSRM": "true", "feature.node.kubernetes.io/cpu-cpuid.FXSR": "true", "feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSFFV": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST": "true", "feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD": "true", "feature.node.kubernetes.io/cpu-cpuid.INVLPGB": "true", "feature.node.kubernetes.io/cpu-cpuid.LAHF": "true", "feature.node.kubernetes.io/cpu-cpuid.LBRVIRT": "true", "feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true", "feature.node.kubernetes.io/cpu-cpuid.MCOMMIT": "true", "feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true", "feature.node.kubernetes.io/cpu-cpuid.MOVU": "true", "feature.node.kubernetes.io/cpu-cpuid.MSRIRC": "true", "feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH": "true", "feature.node.kubernetes.io/cpu-cpuid.NRIPS": "true", "feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true", "feature.node.kubernetes.io/cpu-cpuid.PPIN": "true", "feature.node.kubernetes.io/cpu-cpuid.PSFD": "true", "feature.node.kubernetes.io/cpu-cpuid.RDPRU": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_ES": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_SNP": "true", "feature.node.kubernetes.io/cpu-cpuid.SHA": "true", "feature.node.kubernetes.io/cpu-cpuid.SME": "true", "feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT": "true", "feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true", "feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true", "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true", "feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON": "true", "feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true", "feature.node.kubernetes.io/cpu-cpuid.SVM": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMDA": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMFBASID": "true", "feature.node.kubernetes.io/cpu-cpuid.SVML": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMNP": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMPF": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMPFT": "true", "feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true", "feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true", "feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED": "true", "feature.node.kubernetes.io/cpu-cpuid.TOPEXT": "true", "feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR": "true", "feature.node.kubernetes.io/cpu-cpuid.VAES": "true", "feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN": "true", "feature.node.kubernetes.io/cpu-cpuid.VMPL": "true", "feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT": "true", "feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true", "feature.node.kubernetes.io/cpu-cpuid.VTE": "true", "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true", "feature.node.kubernetes.io/cpu-cpuid.X87": "true", "feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true", "feature.node.kubernetes.io/cpu-hardware_multithreading": "false", "feature.node.kubernetes.io/cpu-model.family": "25", "feature.node.kubernetes.io/cpu-model.id": "1", "feature.node.kubernetes.io/cpu-model.vendor_id": "AMD", "feature.node.kubernetes.io/kernel-config.NO_HZ": "true", "feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true", "feature.node.kubernetes.io/kernel-selinux.enabled": "true", "feature.node.kubernetes.io/kernel-version.full": "5.14.0-427.35.1.el9_4.x86_64", "feature.node.kubernetes.io/kernel-version.major": "5", "feature.node.kubernetes.io/kernel-version.minor": "14", "feature.node.kubernetes.io/kernel-version.revision": "0", "feature.node.kubernetes.io/memory-numa": "true", "feature.node.kubernetes.io/network-sriov.capable": "true", "feature.node.kubernetes.io/pci-102b.present": "true", "feature.node.kubernetes.io/pci-10de.present": "true", "feature.node.kubernetes.io/pci-10de.sriov.capable": "true", "feature.node.kubernetes.io/pci-15b3.present": "true", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true", "feature.node.kubernetes.io/rdma.available": "true", "feature.node.kubernetes.io/rdma.capable": "true", "feature.node.kubernetes.io/storage-nonrotationaldisk": "true", "feature.node.kubernetes.io/system-os_release.ID": "rhcos", "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.17", "feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "417.94.202409121747-0", "feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.4", "feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.17", "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4", "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "17" }

Finally we can confirm there is a network device that is discovered.

$ oc describe node | grep -E 'Roles|pci' | grep pci-15b3 feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-15b3.sriov.capable=true feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-15b3.sriov.capable=true

If everything looks good we can move onto the next operator.

Install and Configure NMState Operator

There might be a need to configure network interfaces on the nodes that were not configured at initial cluster creation time and the NMState operator is designed for those use cases.  The first step is to create a custom resource file that contains the namespace, operator group and subscription.

$ cat <<EOF > nmstate-operator.yaml apiVersion: v1 kind: Namespace metadata: labels: kubernetes.io/metadata.name: openshift-nmstate name: openshift-nmstate name: openshift-nmstate spec: finalizers: - kubernetes --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: annotations: olm.providedAPIs: NMState.v1.nmstate.io name: openshift-nmstate namespace: openshift-nmstate spec: targetNamespaces: - openshift-nmstate --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: labels: operators.coreos.com/kubernetes-nmstate-operator.openshift-nmstate: "" name: kubernetes-nmstate-operator namespace: openshift-nmstate spec: channel: stable installPlanApproval: Automatic name: kubernetes-nmstate-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF

Then we can take the customer resource file and create it on the cluster.

$ oc create -f nmstate-operator.yaml namespace/openshift-nmstate created operatorgroup.operators.coreos.com/openshift-nmstate created subscription.operators.coreos.com/kubernetes-nmstate-operator created

Next we should validate the operator is up and running.

$ oc get pods -n openshift-nmstate NAME READY STATUS RESTARTS AGE nmstate-operator-d587966c9-qkl5m 1/1 Running 0 43s

A nmstate instance is required so we will create a custom resource file for that.

$ cat <<EOF > nmstate-instance.yaml apiVersion: nmstate.io/v1 kind: NMState metadata: name: nmstate EOF

Then we will create the instance on the cluster.

$ oc create -f nmstate-instance.yaml nmstate.nmstate.io/nmstate created

Finally we will validate the instance is running.

$ oc get pods -n openshift-nmstate NAME READY STATUS RESTARTS AGE nmstate-cert-manager-6dc78dc6bf-ds7kj 1/1 Running 0 17s nmstate-console-plugin-5b7595c56c-tgzbw 1/1 Running 0 17s nmstate-handler-lxkd5 1/1 Running 0 17s nmstate-operator-d587966c9-qkl5m 1/1 Running 0 3m27s nmstate-webhook-54dbd47d9d-cvsf6 0/1 Running 0 17s

Next we can build a NodeNetworkConfigurationPolicy. The example below will configure a static ipaddress on the ens8f0np0 interface on nvd-srv-32.

$ cat <<EOF > nncp-static-ip.yaml apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: ens8f0np0-policy spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com desiredState: interfaces: - name: ens8f0np0 description: Configuring ens8f0np0 on nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com type: ethernet state: up ipv4: dhcp: false address: - ip: 10.6.145.32 prefix-length: 24 enabled: true EOF

Once we have the customer resource file we can create it on the cluster.

$ oc create -f nncp-static-ip.yaml nodenetworkconfigurationpolicy.nmstate.io/ens8f0np0-policy created $ oc get nncp -A NAME STATUS REASON ens8f0np0-policy Available SuccessfullyConfigured

We can validate that the ipaddress is set by looking inside the node at the interface.

$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-8mx6q ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.11 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# ip address show dev ens8f0np0 96: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 58:a2:e1:e1:42:78 brd ff:ff:ff:ff:ff:ff altname enp160s0f0np0 inet 10.6.145.32/24 brd 10.6.145.255 scope global noprefixroute ens8f0np0 valid_lft forever preferred_lft forever inet6 fe80::c397:5afa:d618:e752/64 scope link noprefixroute valid_lft forever preferred_lft forever

If everything looks good we can proceed to the next operator.

Install and Configure SRIOV Operator

Now we need to create the SRIOV Operator custom resource file to create the namespace, operator group and subscription.

$ cat << EOF > openshift-sriov-network-operator.yaml apiVersion: v1 kind: Namespace metadata: name: openshift-sriov-network-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: sriov-network-operators namespace: openshift-sriov-network-operator spec: targetNamespaces: - openshift-sriov-network-operator upgradeStrategy: Default --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: sriov-network-operator-subscription namespace: openshift-sriov-network-operator spec: channel: stable installPlanApproval: Automatic name: sriov-network-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF

Now we can create the SRIOV resource on the cluster.

$ oc create -f openshift-sriov-network-operator.yaml namespace/openshift-sriov-network-operator created operatorgroup.operators.coreos.com/sriov-network-operators created subscription.operators.coreos.com/sriov-network-operator-subscription created

We can validate the operator is running by looking at the pod output.

$ oc get pods -n openshift-sriov-network-operator NAME READY STATUS RESTARTS AGE sriov-network-operator-7cb6c49868-89486 1/1 Running 0 22s

Next we will need to create the default SriovOperatorConfig configuration file.

$ cat <<EOF > sriov-operator-config.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovOperatorConfig metadata: name: default namespace: openshift-sriov-network-operator spec: enableInjector: true enableOperatorWebhook: true logLevel: 2 EOF

Then create the resource on the cluster.

$ oc create -f sriov-operator-config.yaml sriovoperatorconfig.sriovnetwork.openshift.io/default created

For the default SriovOperatorConfig to work with the MLNX_OFED container, please run the following patch command.

$ oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }' sriovoperatorconfig.sriovnetwork.openshift.io/default patched

If everything looks good we can proceed to installing the next operator.

Install and Configure Network Operator

To get started we will generate a NVIDIA Network Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > network-operator.yaml apiVersion: v1 kind: Namespace metadata: name: nvidia-network-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: nvidia-network-operator namespace: nvidia-network-operator spec: targetNamespaces: - nvidia-network-operator --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nvidia-network-operator namespace: nvidia-network-operator spec: channel: v24.10.0 installPlanApproval: Automatic name: nvidia-network-operator source: certified-operators sourceNamespace: openshift-marketplace EOF

Next we can create the resources on the cluster.

$ oc create -f network-operator.yaml namespace/nvidia-network-operator created operatorgroup.operators.coreos.com/nvidia-network-operator created subscription.operators.coreos.com/nvidia-network-operator created

We can then validate that the network operator has installed and is running by confirming the controller is running in the nvidia-network-operator namespace.

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m

With the operator up we can create the NicClusterPolicy customer resource file. Note in this file I have hard coded the Infiniband interface ibs2f0 -or- ens8f0np0 that I will be using as my shared rdma device.  Note both cannot be defined at the same time in the policy from what I have experiences but they are shown to show either ethernet or infiniband interfaces can be used.  This could be a different device depending on the system configuration.

$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: nicFeatureDiscovery: image: nic-feature-discovery repository: ghcr.io/mellanox version: v0.0.1 docaTelemetryService: image: doca_telemetry repository: nvcr.io/nvidia/doca version: 1.16.5-doca2.6.0-host rdmaSharedDevicePlugin: config: | { "configList": [ { "resourceName": "rdma_shared_device_ib", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ibs2f0"] } }, { "resourceName": "rdma_shared_device_eth", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ens8f0np0"] } } ] } image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: v1.5.1 secondaryNetwork: ipoib: image: ipoib-cni repository: ghcr.io/mellanox version: v1.2.0 nvIpam: enableWebhook: false image: nvidia-k8s-ipam repository: ghcr.io/mellanox version: v0.2.0 ofedDriver: readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 forcePrecompiled: false terminationGracePeriodSeconds: 300 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: true enable: true force: true timeoutSeconds: 300 podSelector: '' maxParallelUpgrades: 1 safeLoad: false waitForCompletion: timeoutSeconds: 0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 image: doca-driver repository: nvcr.io/nvidia/mellanox version: 24.10-0.7.0.0-0 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" EOF

Next we can create the NicClusterPolicy custom resource on the cluster.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created

We can validate the NicClusterPolicy by running a few commands in the DOCA/MOFED container.

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE doca-telemetry-service-hwj65 1/1 Running 2 160m kube-ipoib-cni-ds-fsn8g 1/1 Running 2 160m mofed-rhcos4.16-9b5ddf4c6-ds-ct2h5 2/2 Running 4 160m nic-feature-discovery-ds-dtksz 1/1 Running 2 160m nv-ipam-controller-854585f594-c5jpp 1/1 Running 2 160m nv-ipam-controller-854585f594-xrnp5 1/1 Running 2 160m nv-ipam-node-xqttl 1/1 Running 2 160m nvidia-network-operator-controller-manager-5798b564cd-5cq99 1/1 Running 2 5d23h rdma-shared-dp-ds-p9vvg 1/1 Running 0 85m

And we can rsh into the mofed container to check a few things.

$ MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed) $ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD} sh-5.1# ofed_info -s OFED-internal-24.10-0.7.0.0-0: sh-5.1# ibdev2netdev -v 0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up) 0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)

Now we need to create a IPoIBNetwork custom resource file (for infiniband based interfaces).

$ cat <<EOF > ipoib-network.yaml apiVersion: mellanox.com/v1alpha1 kind: IPoIBNetwork metadata: name: example-ipoibnetwork spec: ipam: | { "type": "whereabouts", "range": "192.168.6.225/28", "exclude": [ "192.168.6.229/30", "192.168.6.236/32" ] } master: ibs2f0 networkNamespace: default EOF

And then create the IPoIBNetwork resource on the cluster.

$ $ oc create -f ipoib-network.yaml ipoibnetwork.mellanox.com/example-ipoibnetwork created

We will do the same thing for our ethernet interface but this will be a MacvlanNetwork custom resource file.

$ cat <<EOF > macvlan-network.yaml apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdmashared-net spec: networkNamespace: default master: ens8f0np0 mode: bridge mtu: 1500 ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}' EOF

Then create the resource on the cluster.

$ oc create -f macvlan-network.yaml macvlannetwork.mellanox.com/rdmashared-net created

If everything looks good we can proceed to the next operator.

Install and Configure GPU Operator

The next operator we need to configured is the NVIDIA GPU Operator. As with most operators, we will need to configure a namespace, operator group and subscription.

To get started we will generate a NVIDIA GPU Operator CRD that will create the namespace, operator group and subscription.

$ cat <<EOF > gpu-operator.yaml apiVersion: v1 kind: Namespace metadata: name: nvidia-gpu-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: nvidia-gpu-operator namespace: nvidia-gpu-operator spec: targetNamespaces: - nvidia-gpu-operator --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nvidia-gpu-operator namespace: nvidia-gpu-operator spec: channel: "v24.9" installPlanApproval: Automatic name: gpu-operator-certified source: certified-operators sourceNamespace: openshift-marketplace EOF

Next we can create the resources on the cluster.

$ oc create -f gpu-operator.yaml namespace/nvidia-gpu-operator created operatorgroup.operators.coreos.com/nvidia-gpu-operator created subscription.operators.coreos.com/nvidia-gpu-operator created

We can check that the operator pod is running by looking at the pods under the namespace.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s

Now that we have the operator running we need to create a GPU cluster policy custom resource file like the one below.

$ cat <<EOF > gpu-cluster-policy.yaml apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' serviceMonitor: enabled: true enabled: true cdi: default: false enabled: false driver: licensingConfig: nlsEnabled: true configMapName: '' certConfig: name: '' rdma: enabled: true kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' enabled: true useNvidiaDriverCRD: false useOpenKernelModules: true devicePlugin: config: name: '' default: '' mps: root: /run/nvidia/mps enabled: true gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: nvidia-fs version: 2.20.5 repository: nvcr.io/nvidia/cloud-native vgpuManager: enabled: false vfioManager: enabled: true toolkit: installDir: /usr/local/nvidia enabled: true EOF

With the GPU ClusterPolicy custom resource file generated, let's create it on the cluster.

$ oc create -f gpu-cluster-policy.yaml clusterpolicy.nvidia.com/gpu-cluster-policy created

After some time, all the pods are up and running.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-d5ngn 1/1 Running 0 3m20s gpu-feature-discovery-z42rx 1/1 Running 0 3m23s gpu-operator-6bb4d4b4c5-njh78 1/1 Running 0 4m35s nvidia-container-toolkit-daemonset-bkh8l 1/1 Running 0 3m20s nvidia-container-toolkit-daemonset-c4hzm 1/1 Running 0 3m23s nvidia-cuda-validator-4blvg 0/1 Completed 0 106s nvidia-cuda-validator-tw8sl 0/1 Completed 0 112s nvidia-dcgm-exporter-rrw4g 1/1 Running 0 3m20s nvidia-dcgm-exporter-xc78t 1/1 Running 0 3m23s nvidia-dcgm-nvxpf 1/1 Running 0 3m20s nvidia-dcgm-snj4j 1/1 Running 0 3m23s nvidia-device-plugin-daemonset-fk2xz 1/1 Running 0 3m23s nvidia-device-plugin-daemonset-wq87j 1/1 Running 0 3m20s nvidia-driver-daemonset-416.94.202410211619-0-ngrjg 4/4 Running 0 3m58s nvidia-driver-daemonset-416.94.202410211619-0-tm4x6 4/4 Running 0 3m58s nvidia-node-status-exporter-jlzxh 1/1 Running 0 3m57s nvidia-node-status-exporter-zjffs 1/1 Running 0 3m57s nvidia-operator-validator-l49hx 1/1 Running 0 3m20s nvidia-operator-validator-n44nn 1/1 Running 0 3m23s

Once we see the pods running above, we can remote shell into the NVIDIA driver daemonset pod and confirm two items. The first is that the nvidia modules are loaded and ensuring specifically the nvidia_peermem one is there. We can also run the nvidia-smi utility to show the details about the driver and the hardware.

$ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver) sh-4.4# lsmod|grep nvidia nvidia_fs 327680 0 nvidia_peermem 24576 0 nvidia_modeset 1507328 0 video 73728 1 nvidia_modeset nvidia_uvm 6889472 8 nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200 sh-4.4# nvidia-smi Wed Nov 6 22:03:53 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A40 On | 00000000:61:00.0 Off | 0 | | 0% 37C P0 88W / 300W | 1MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 | | 0% 28C P8 29W / 300W | 1MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

While we are in the driver pod we should also set the GPU clock to maximum using the following nvidia-smi command.  This is optional but why not have it at full speed.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1) GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0 All done. sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1) GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0 All done.

One last thing we can do is validate our resource are available from a node describe perspective.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9 Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596712Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445736Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596672Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445696Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63

If everything looks good we can proceed to actual RDMA testing.

The Shared Device RDMA Testing

This section will cover running workload pods across the nodes in the environment. We will setup the required privileges, create the workload pod, validate connectivity between the two hosts on the infiniband fabric and then run a performance test.

Create Service Account

First let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > default-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: rdma namespace: default EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z rdma clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

If everything looks good we can move onto creating the workload pods.

Create Workload Pods for IB

With the service account setup we now need to create a workload pod that contains all the tooling for our testing. We can generate a custom pod resource file for each worker node as follows to meet that requirement.

$ cat <<EOF > rdma-ib-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-ib-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: example-ipoibnetwork spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-ib-32-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_ib: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_ib: 1 EOF $ cat <<EOF > rdma-ib-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-ib-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: example-ipoibnetwork spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-ib-33-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_ib: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_ib: 1 EOF

Then we can create the pods on the cluster.

$ oc create -f rdma-ib-32-workload.yaml pod/rdma-ib-32-workload created $ oc create -f rdma-ib-33-workload.yaml pod/rdma-ib-33-workload created

Let's validate the pods is running.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-ib-32-workload 1/1 Running 0 10s rdma-ib-33-workload 1/1 Running 0 3s

With the pods up and running we can validate connectivity.

Validate IB Connectivity

This section will cover confirming the infiniband connectivity is working between the systems.  This section is option but provides a lot of  good infiniband troubleshooting tips.  First we should rsh into each rdma-workload-client pod.

$ oc rsh -n default rdma-ib-32-workload sh-5.1#

The first command we can run is the ibhosts command which shows infiniband host nodes in topology.

sh-5.1# ibhosts Ca : 0x58a2e10300e14446 ports 1 "nvd-srv-33 mlx5_0" Ca : 0x58a2e10300dfe416 ports 1 "nvd-srv-32 mlx5_0"

We can also run the ibnodes command which will show not only the nodes but also switches in the topology.

sh-5.1# ibnodes Ca : 0x58a2e10300e14446 ports 1 "nvd-srv-33 mlx5_0" Ca : 0x58a2e10300dfe416 ports 1 "nvd-srv-32 mlx5_0" Switch : 0xfc6a1c0300e7ecc0 ports 129 "MF0;qm9700-ib:MQM9700/U1" enhanced port 0 lid 1 lmc 0

We can look deeper into an interface state by using the ibstatus command and pass an interface. If no interface is passed all will display.

sh-5.1# ibstatus mlx5_0 Infiniband device 'mlx5_0' port 1 status: default gid: fe80:0000:0000:0000:58a2:e103:00df:e416 base lid: 0x4 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 400 Gb/sec (4X NDR) link_layer: InfiniBand

Now that we have familiarized ourself with the environment we can run ibstat and grep out only certain key elements of the output. These will be needed for the ibping test.

The first ibstat output is that of our first node which will act as the server side for the ibping command.

sh-5.1# ibstat | egrep "Port|Base|Link" Port 1: Physical state: LinkUp Base lid: 4 Port GUID: 0x58a2e10300e14446 Link layer: InfiniBand Port 1: Physical state: LinkUp Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet

The output above shows both an infiniband and ethernet interface. We are only interested in the infiniband in this use case. Make note of the lid number as that is used in the ibping command on the client side.

We can run the same command on the client side and notice while some of the details are similar the lid number is unique along with the port GUID.

sh-5.1# ibstat | egrep "Port|Base|Link" Port 1: Physical state: LinkUp Base lid: 5 Port GUID: 0x58a2e10300e14446 Link layer: InfiniBand Port 1: Physical state: LinkUp Base lid: 0 Port GUID: 0x0000000000000000 Link layer: Ethernet

Next we can run an ibping with the server switch on the first workload pod.

sh-5.1# ibping -S -P 1 -d ibdebug: [114] ibping_serv: starting to serve... ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none) ibwarn: [114] mad_respond_via: dest Lid 5 ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000 ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)

And on the second workload pod we can run an ibping command to ping the server side we started on the other pod.

sh-5.1# ibping -P 1 4 Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.011 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.014 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms

Once we have completed confirming connectivity we can move onto the performance testing.

Now we want to run a test across the two pods running. We will need to rsh into the first pod and run the ib_write_bw command. Then we will rsh into the second pod in a different terminal window and run the ib_write_bw <ipaddress> command.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE rdma-ib-32-workload 1/1 Running 0 8m12s rdma-ib-33-workload 1/1 Running 0 8m5s

First let's get the ipaddress of the first pod.

$ oc get pod rdma-ib-32-workload -o yaml | grep -E 'default/example-ipoibnetwork' -A3 "name": "default/example-ipoibnetwork", "interface": "net1", "ips": [ "192.168.6.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default rdma-ib-32-workload sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default rdma-ib-33-workload sh-5.1# ib_write_bw 192.168.6.225 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x05 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007fcbace2f000 remote address: LID 0x04 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007f360e3d8000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 2500.000000 != 3495.887000. CPU Frequency is not max. 65536 5000 44604.62 44576.86 0.713230 ---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007f360e3d8000 remote address: LID 0x05 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007fcbace2f000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 5000 44604.62 44576.86 0.713230 ---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

Create Workload Pods for ETH

Now we need to test IB over ethernet. We can generate a custom pod resource file for both nodes as follows to meet that requirement.

$ cat <<EOF > rdma-eth-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-eth-32-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF $ cat <<EOF > rdma-eth-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-eth-33-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF

Then we can create the pods on the cluster.

$ oc create -f rdma-eth-32-workload.yaml pod/rdma-eth-32-workload created $ oc create -f rdma-eth-33-workload.yaml pod/rdma-eth-33-workload created

Let's validate the pods is running.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE rdma-eth-32-workload 1/1 Running 0 25s rdma-eth-33-workload 1/1 Running 0 22s

With the pods up and running we can move onto the actual test.

Now we want to run a test across the two pods running. We will need to rsh into the first pod and run the ib_write_bw command. Then we will rsh into the second pod in a different terminal window and run the ib_write_bw <ipaddress> command.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE rdma-eth-32-workload 1/1 Running 0 106s rdma-eth-33-workload 1/1 Running 0 103s

First let's get the ipaddress of the first pod.

$ oc get pod rdma-eth-32-workload -o yaml | grep -E 'default/rdmashared' -A3 "name": "default/rdmashared-net", "interface": "net1", "ips": [ "192.168.2.1"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default rdma-eth-32-workload sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default rdma-eth-33-workload sh-5.1# ib_write_bw 192.168.2.1 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x05 QPN 0x0ce2 PSN 0x5389f7 RKey 0x1fff00 VAddr 0x007f7368df3000 remote address: LID 0x04 QPN 0x0ce2 PSN 0x81fa7f RKey 0x1fff00 VAddr 0x007f7e8c890000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 2500.000000 != 3497.359000. CPU Frequency is not max. 65536 5000 44490.32 44467.35 0.711478 ---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x0ce2 PSN 0x81fa7f RKey 0x1fff00 VAddr 0x007f7e8c890000 remote address: LID 0x05 QPN 0x0ce2 PSN 0x5389f7 RKey 0x1fff00 VAddr 0x007f7368df3000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 5000 44490.32 44467.35 0.711478 ---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

The Host Device RDMA Testing

This section will demonstrate how to configure host device RDMA for Nvidia Network Operator and then how to test per pod configuration.

Configure Nic Cluster Policy for Host Device

The operator should be running from previous steps. If a NicClusterPolicy exists we need to delete the existing one and generate a new hostdev NicClusterPolicy customer resource file.

$ cat <<EOF > network-hostdev-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: 24.10-0.7.0.0-0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" sriovDevicePlugin: image: sriov-network-device-plugin repository: ghcr.io/k8snetworkplumbingwg version: v3.7.0 config: | { "resourceList": [ { "resourcePrefix": "nvidia.com", "resourceName": "hostdev", "selectors": { "vendors": ["15b3"], "isRdma": true } } ] } EOF

Next we can create the NicClusterPolicy custom resource on the cluster.

$ oc create -f network-hostdev-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created

We can validate the host device NicClusterPolicy by running a few commands in the DOCA/MOFED container.

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE mofed-rhcos4.16-696886fcb4-ds-9sgvd 2/2 Running 0 2m37s mofed-rhcos4.16-696886fcb4-ds-lkjd4 2/2 Running 0 2m37s nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 0 141m sriov-device-plugin-6v2nz 1/1 Running 0 2m14s sriov-device-plugin-hc4t8 1/1 Running 0 2m14s

We can also confirm that the resources show up in the cluster oc decribe node section.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A7 Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596708Ki nvidia.com/hostdev: 2 pods: 250 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445732Ki nvidia.com/hostdev: 2 pods: 250 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596704Ki nvidia.com/hostdev: 2 pods: 250 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445728Ki nvidia.com/hostdev: 2 pods: 250

Now we need to create a HostDeviceNetwork custom resource file.

$ cat <<EOF > hostdev-network.yaml apiVersion: mellanox.com/v1alpha1 kind: HostDeviceNetwork metadata: name: hostdev-net spec: networkNamespace: "default" resourceName: "hostdev" ipam: | { "type": "whereabouts", "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ] } EOF

And then create the HostDeviceNetwork resource on the cluster.

$ oc create -f hostdev-network.yaml hostdevicenetwork.mellanox.com/hostdev-net created

Let's validate our resources are showing up properly.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8 Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596708Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 2 pods: 250 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445732Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 2 pods: 250 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596680Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 2 pods: 250 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445704Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 2 pods: 250

End of nic cluster policy for host device section.

Create Workload Pods and Perf Test Host Device

Now we need to create a workload pod that contains all the tooling for our host device testing. We can generate a custom pod file for each node as follows to meet that requirement.

$ cat << EOF > hostdev-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: hostdev-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: hostdev-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: hostdev-32-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 nvidia.com/hostdev: 1 requests: nvidia.com/gpu: 1 nvidia.com/hostdev: 1 EOF $ cat <<EOF > hostdev-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: hostdev-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: hostdev-net spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: hostdev-33-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 nvidia.com/hostdev: 1 requests: nvidia.com/gpu: 1 nvidia.com/hostdev: 1 EOF

Then we can create the pods on the cluster.

$ oc create -f hostdev-32-workload.yaml pod/hostdev-32-workload created $ oc create -f hostdev-33-workload.yaml pod/hostdev-33-workload created

Let's validate the pods are running.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE hostdev-32-workload 1/1 Running 0 73s hostdev-33-workload 1/1 Running 0 12s

First let's get the ipaddress of the first pod.

$ oc get pod hostdev-32-workload -o yaml | grep -E 'default/hostdev-net' -A3 "name": "default/hostdev-net", "interface": "net1", "ips": [ "192.168.3.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh -n default hostdev-32-workload sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************

Then open another terminal and rsh to the second pod and run ib_write_bw 192.168.6.225.

$ oc rsh -n default hostdev-33-workload sh-5.1# ib_write_bw 192.168.3.225 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x05 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007fe688c97000 remote address: LID 0x04 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007f1f0249d000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 2500.000000 != 3498.323000. CPU Frequency is not max. 65536 5000 44351.41 44328.98 0.709264 ---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007f1f0249d000 remote address: LID 0x05 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007fe688c97000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 5000 44351.41 44328.98 0.709264 ---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over and move onto the next test.

The SRIOV Legacy Mode RDMA Testing

This deployment mode supports SR-IOV in legacy mode.

Configure Nic Cluster Policy for SRIOV Legacy

First we need to create a NicClusterPolicy which for SRIOV legacy mode is fairly generic. Generate the following custom resource file below.  If an existing NicClusterPolicy exists please remove it.

$ cat <<EOF > network-sriovleg-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: 24.10-0.7.0.0-0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" EOF

Now let's create the policy on the cluster.

$ oc create -f network-sriovleg-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created

Before we continue we can validate the pods are up.

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE mofed-rhcos4.16-696886fcb4-ds-4mb42 2/2 Running 0 40s mofed-rhcos4.16-696886fcb4-ds-8knwq 2/2 Running 0 40s nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 13 (4d ago) 4d21h

Now we need to create a SriovNetworkNodePolicy which will generate the VFs for the device we want to operate in SRIOV legacy mode. Generate the customer resource file below.

$ cat <<EOF > sriov-network-node-policy.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: sriov-legacy-policy namespace: openshift-sriov-network-operator spec: deviceType: netdevice mtu: 1500 nicSelector: vendor: "15b3" pfNames: ["ens8f0np0#0-7"] nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" numVfs: 8 priority: 90 isRdma: true resourceName: sriovlegacy EOF

Next we can create the custom resource on the cluster. As a note make sure SR-IOV Global Enable is enabled as per Red Hat Knowledge Article.

$ oc create -f sriov-network-node-policy.yaml sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created

The nodes should go through a reboot process. Each one will have scheduling disabled and reboot to make the configuration take place.

$ oc get nodes NAME STATUS ROLES AGE VERSION edge-19.edge.lab.eng.rdu2.redhat.com Ready control-plane,master,worker 5d v1.29.8+632b078 nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Ready worker 4d22h v1.29.8+632b078 nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com NotReady,SchedulingDisabled worker 4d22h v1.29.8+632b078

Once the nodes have reboot we can validate that the VF interfaces were created by opening up a debug pod on each node.

a$ oc debug node/nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-33nvidiaengrdu2dcredhatcom-debug-cqfjz ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.12 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# ip link show | grep ens8 26: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 42: ens8f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 43: ens8f0v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 44: ens8f0v2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 45: ens8f0v3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 46: ens8f0v4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 47: ens8f0v5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 48: ens8f0v6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 49: ens8f0v7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

We can repeat the same steps above on the second node if we want to feel complete.

We can also confirm via the node capabilities output.

$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8 Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596692Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 0 openshift.io/sriovlegacy: 8 -- Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445716Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 0 openshift.io/sriovlegacy: 8 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596688Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 0 openshift.io/sriovlegacy: 8 -- Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445712Ki nvidia.com/gpu: 2 nvidia.com/hostdev: 0 openshift.io/sriovlegacy: 8

Now that the VFs for SRIOV legacy mode are in place we can generate the SriovNetwork customer resource file.

$ cat <<EOF > sriov-network.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: sriov-network namespace: openshift-sriov-network-operator spec: vlan: 0 networkNamespace: "default" resourceName: "sriovlegacy" ipam: | { "type": "whereabouts", "range": "192.168.3.225/28", "exclude": [ "192.168.3.229/30", "192.168.3.236/32" ] } EOF

Then we can create the customer resource on the cluster.

$ oc create -f sriov-network.yaml sriovnetwork.sriovnetwork.openshift.io/sriov-network created

End of nic cluster policy for host device section.

Create Workload and Perf Test SRIOV Legacy

Now we need to create a workload pod that contains all the tooling for our host device testing. We can generate a custom pod file for each node as follows to meet that requirement.

$ cat << EOF > sriovlegacy-32-workload.yaml apiVersion: v1 kind: Pod metadata: name: sriovlegacy-32-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: sriov-network spec: nodeSelector: kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: sriovlegacy-32-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 requests: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 EOF $ cat <<EOF > sriovlegacy-33-workload.yaml apiVersion: v1 kind: Pod metadata: name: sriovlegacy-33-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: sriov-network spec: nodeSelector: kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: sriovlegacy-33-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 requests: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 EOF

Then we can create the pods on the cluster.

$ oc create -f sriovlegacy-32-workload.yaml pod/sriovlegacy-32-workload created $ oc create -f sriovlegacy-33-workload.yaml pod/sriovlegacy-33-workload created

Let's validate the pods are running.

$ oc get pods -n default NAME READY STATUS RESTARTS AGE sriovlegacy-32-workload 1/1 Running 0 73s sriovlegacy-33-workload 1/1 Running 0 12s

First let's get the ipaddress of the first pod.

$ oc get pod sriovlegacy-32-workload -o yaml | grep -E 'default/sriov-network' -A3 "name": "default/sriov-network", "interface": "net1", "ips": [ "192.168.3.225"

Now rsh into the first pod and run the ib_write_bw command and leave that terminal open.

$ oc rsh sriovlegacy-33-workload sh-5.1# ib_write_bw 192.168.3.225 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x05 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f397ace8000 remote address: LID 0x04 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f0eeefac000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] Conflicting CPU frequency values detected: 2500.000000 != 3491.228000. CPU Frequency is not max. 65536 5000 44414.44 44386.66 0.710187 ---------------------------------------------------------------------------------------

If we go back to the first terminal on pod number one we should also see similar response results.

sh-5.1# ib_write_bw ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f0eeefac000 remote address: LID 0x05 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f397ace8000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 5000 44414.44 44386.66 0.710187 ---------------------------------------------------------------------------------------

We can now clean up the pods since testing is over.

Hopefully this blog was detailed enough to provide an understanding of RDMA testing with NVIDIA and OpenShift.  It provide a brief example of how to configure the different RDMA methods: Shared, Hostdev and SRIOV Legacy.