Lab Environment
The following configurations and testing were done a OpenShift environment that consisted of the following:
- OpenShift 4.16.19 x86
- Network Operator 24.10
- All other operators used the default values for OCP 4.16.
- 3 physical nodes: 1 SNO master, 2 workers
- The workers consisted of Dell R760xa with 2 NVIDIA BF3 cards in them.
- One BF3 card was attached to the NVIDIA Spectrum SN5600 switch for RDMA over ethernet
- One BF3 card was attached to the NVIDIA Quantum QM9700 switch for RDMA over infiniband
Blacklist IRDMA Module
On some systems, including the DellR750xa I used for testing, the irdma
kernel module creates problems for the NVIDIA Network Operator on unload/load of the DOCA drivers so we need to blacklist it with a machine configuration that gets applied to all worker nodes.
Generate the following machine configuration file yaml specifying the module irdma to blacklist.
$ cat <<EOF > 99-machine-config-blacklist-irdma.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-worker-blacklist-irdma
spec:
kernelArguments:
- "module_blacklist=irdma"
EOF
Then create the machine configuration on the cluster and wait for the worker nodes to reboot.
$ oc create -f 99-machine-config-blacklist-irdma.yaml
machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created
Validate in a debug pod on each node that the module has not loaded.
$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-btfj2 ...
To use host binaries, run `chroot /host`
Pod IP: 10.6.135.11
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# lsmod|grep irdma
sh-5.1#
At this point, if everything looks good, we can move onto the next steps of the workflow.
Persistent Naming Rules
Sometimes there is a need to make sure the device names persist on reboots. On the R760xa systems and where nodes had a large number of networking cards, I was noticing the Mellanox devices were being renamed on reboots so I decided to use a MachineConfig
to set persistence.
First gather the the MAC address names into a file from the worker nodes for the node(s) and also provide names for the interfaces that need to persist. We will call the file 70-persistent-net.rules
and stash the details in it.
$ cat <<EOF > 70-persistent-net.rules
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:28",ATTR{type}=="1",NAME="ibs2f0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:29",ATTR{type}=="1",NAME="ens8f0np0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d0",ATTR{type}=="1",NAME="ibs2f0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d1",ATTR{type}=="1",NAME="ens8f0np0"
EOF
Now we need to convert that file into a base64 string without line breaks and set the output to the variable PERSIST
.
$ PERSIST=`cat 70-persistent-net.rules| base64 -w 0`
$ echo $PERSIST
U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK
Now we can create a machine configuration and set the base64 encoding in our custom resource file. Notice how I am using the PERSIST variable in my yaml creation to mitigate copy/paste type errors.
$ cat <<EOF > 99-machine-config-udev-network.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-machine-config-udev-network
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:text/plain;base64,$PERSIST
filesystem: root
mode: 420
path: /etc/udev/rules.d/70-persistent-net.rules
EOF
Once we have the machine configuration we can create it on the cluster.
$ oc create -f 99-machine-config-udev-network.yaml
machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-9adfe851c2c14d9598eea5ec3df6c187 True False False 1 1 1 0 6h21m
worker rendered-worker-4568f1b174066b4b1a4de794cf538fee False True False 2 0 0 0 6h21m
The worker nodes will reboot and once the updating field goes back to false we can validate on the nodes by looking at the devices in a debug pod if we chose to do so.
If everything looks good we can move onto configuring the operators of the OpenShift cluster.
Install and Configure Required Operators
Install and Configure NFD Operator
The Node Feature Discovery (NFD) operator manages the detection of hardware features and configuration in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, operating system version, and so on.
To get started we will generate a NFD Operator CRD that will create the namespace, operator group and subscription.
$ cat <<EOF > nfd-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
name: openshift-nfd
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: openshift-nfd
namespace: openshift-nfd
spec:
targetNamespaces:
- openshift-nfd
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nfd
namespace: openshift-nfd
spec:
channel: "stable"
installPlanApproval: Automatic
name: nfd
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
Next we can create the resources on the cluster.
$ oc create -f nfd-operator.yaml
namespace/openshift-nfd created
operatorgroup.operators.coreos.com/openshift-nfd created
subscription.operators.coreos.com/nfd created
We can validate that the operator is installed and running by looking at the
pods in the openshift-nfd
namespace.
$ oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-8698c88cdd-t8gbc 2/2 Running 0 2m
With the NFD controller running we can move onto generating the
NodeFeatureDiscovery
instance and adding it to the cluster.
The ClusterServiceVersion
specification for NFD operator provides default
values, including the NFD operand image that is part of the operator payload.
We retrieve its value with the following command line and assign it to the variable NFD_OPERAND_IMAGE.
$ NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
We can now create the NodeFeatureDiscovery
instance. Note that we add entries
to the default deviceClasseWhiteList
field, so that to support more network
adapters, such as the NVIDIA BlueField DPUs and the NVIDIA GPUs.
$ cat <<EOF > nfd-instance.yaml
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd
spec:
instance: ''
operand:
image: '${NFD_OPERAND_IMAGE}'
servicePort: 12000
prunerOnDelete: false
topologyUpdater: false
workerConfig:
configData: |
core:
sleepInterval: 60s
sources:
pci:
deviceClassWhitelist:
- "02"
- "03"
- "0200"
- "0207"
- "12"
deviceLabelFields:
- "vendor"
EOF
$ oc create -f nfd-instance.yaml
nodefeaturediscovery.nfd.openshift.io/nfd-instance created
Finally we can validate our instance is up and running by again looking at the
pods under the openshift-nfd
namespace.
$ oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-7cb6d656-jcnqb 2/2 Running 0 4m
nfd-gc-7576d64889-s28k9 1/1 Running 0 21s
nfd-master-b7bcf5cfd-qnrmz 1/1 Running 0 21s
nfd-worker-96pfh 1/1 Running 0 21s
nfd-worker-b2gkg 1/1 Running 0 21s
nfd-worker-bd9bk 1/1 Running 0 21s
nfd-worker-cswf4 1/1 Running 0 21s
nfd-worker-kp6gg 1/1 Running 0 21s
After a minute or so, we can verify that NFD has added labels to the node.
The NFD labels are prefixed with feature.node.kubernetes.io
, so we can easily
filter them.
$ oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))'
{
"feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
"feature.node.kubernetes.io/cpu-cpuid.CETSS": "true",
"feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true",
"feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true",
"feature.node.kubernetes.io/cpu-cpuid.CPBOOST": "true",
"feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS": "true",
"feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
"feature.node.kubernetes.io/cpu-cpuid.FP256": "true",
"feature.node.kubernetes.io/cpu-cpuid.FSRM": "true",
"feature.node.kubernetes.io/cpu-cpuid.FXSR": "true",
"feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBRS": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBS": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSFFV": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST": "true",
"feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD": "true",
"feature.node.kubernetes.io/cpu-cpuid.INVLPGB": "true",
"feature.node.kubernetes.io/cpu-cpuid.LAHF": "true",
"feature.node.kubernetes.io/cpu-cpuid.LBRVIRT": "true",
"feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true",
"feature.node.kubernetes.io/cpu-cpuid.MCOMMIT": "true",
"feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true",
"feature.node.kubernetes.io/cpu-cpuid.MOVU": "true",
"feature.node.kubernetes.io/cpu-cpuid.MSRIRC": "true",
"feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH": "true",
"feature.node.kubernetes.io/cpu-cpuid.NRIPS": "true",
"feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true",
"feature.node.kubernetes.io/cpu-cpuid.PPIN": "true",
"feature.node.kubernetes.io/cpu-cpuid.PSFD": "true",
"feature.node.kubernetes.io/cpu-cpuid.RDPRU": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_ES": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_SNP": "true",
"feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
"feature.node.kubernetes.io/cpu-cpuid.SME": "true",
"feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT": "true",
"feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true",
"feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true",
"feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
"feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON": "true",
"feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVM": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMDA": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMFBASID": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVML": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMNP": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMPF": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMPFT": "true",
"feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true",
"feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true",
"feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED": "true",
"feature.node.kubernetes.io/cpu-cpuid.TOPEXT": "true",
"feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR": "true",
"feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMPL": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT": "true",
"feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.VTE": "true",
"feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
"feature.node.kubernetes.io/cpu-cpuid.X87": "true",
"feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true",
"feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
"feature.node.kubernetes.io/cpu-model.family": "25",
"feature.node.kubernetes.io/cpu-model.id": "1",
"feature.node.kubernetes.io/cpu-model.vendor_id": "AMD",
"feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
"feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true",
"feature.node.kubernetes.io/kernel-selinux.enabled": "true",
"feature.node.kubernetes.io/kernel-version.full": "5.14.0-427.35.1.el9_4.x86_64",
"feature.node.kubernetes.io/kernel-version.major": "5",
"feature.node.kubernetes.io/kernel-version.minor": "14",
"feature.node.kubernetes.io/kernel-version.revision": "0",
"feature.node.kubernetes.io/memory-numa": "true",
"feature.node.kubernetes.io/network-sriov.capable": "true",
"feature.node.kubernetes.io/pci-102b.present": "true",
"feature.node.kubernetes.io/pci-10de.present": "true",
"feature.node.kubernetes.io/pci-10de.sriov.capable": "true",
"feature.node.kubernetes.io/pci-15b3.present": "true",
"feature.node.kubernetes.io/pci-15b3.sriov.capable": "true",
"feature.node.kubernetes.io/rdma.available": "true",
"feature.node.kubernetes.io/rdma.capable": "true",
"feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
"feature.node.kubernetes.io/system-os_release.ID": "rhcos",
"feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.17",
"feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "417.94.202409121747-0",
"feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.4",
"feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.17",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "17"
}
Finally we can confirm there is a network device that is discovered.
$ oc describe node | grep -E 'Roles|pci' | grep pci-15b3
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
If everything looks good we can move onto the next operator.
Install and Configure NMState Operator
There might be a need to configure network interfaces on the nodes that were not configured at initial cluster creation time and the NMState operator is designed for those use cases. The first step is to create a custom resource file that contains the namespace, operator group and subscription.
$ cat <<EOF > nmstate-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
labels:
kubernetes.io/metadata.name: openshift-nmstate
name: openshift-nmstate
name: openshift-nmstate
spec:
finalizers:
- kubernetes
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
annotations:
olm.providedAPIs: NMState.v1.nmstate.io
name: openshift-nmstate
namespace: openshift-nmstate
spec:
targetNamespaces:
- openshift-nmstate
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/kubernetes-nmstate-operator.openshift-nmstate: ""
name: kubernetes-nmstate-operator
namespace: openshift-nmstate
spec:
channel: stable
installPlanApproval: Automatic
name: kubernetes-nmstate-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
Then we can take the customer resource file and create it on the cluster.
$ oc create -f nmstate-operator.yaml
namespace/openshift-nmstate created
operatorgroup.operators.coreos.com/openshift-nmstate created
subscription.operators.coreos.com/kubernetes-nmstate-operator created
Next we should validate the operator is up and running.
$ oc get pods -n openshift-nmstate
NAME READY STATUS RESTARTS AGE
nmstate-operator-d587966c9-qkl5m 1/1 Running 0 43s
A nmstate instance is required so we will create a custom resource file for that.
$ cat <<EOF > nmstate-instance.yaml
apiVersion: nmstate.io/v1
kind: NMState
metadata:
name: nmstate
EOF
Then we will create the instance on the cluster.
$ oc create -f nmstate-instance.yaml
nmstate.nmstate.io/nmstate created
Finally we will validate the instance is running.
$ oc get pods -n openshift-nmstate
NAME READY STATUS RESTARTS AGE
nmstate-cert-manager-6dc78dc6bf-ds7kj 1/1 Running 0 17s
nmstate-console-plugin-5b7595c56c-tgzbw 1/1 Running 0 17s
nmstate-handler-lxkd5 1/1 Running 0 17s
nmstate-operator-d587966c9-qkl5m 1/1 Running 0 3m27s
nmstate-webhook-54dbd47d9d-cvsf6 0/1 Running 0 17s
Next we can build a NodeNetworkConfigurationPolicy. The example below will configure a static ipaddress on the ens8f0np0 interface on nvd-srv-32.
$ cat <<EOF > nncp-static-ip.yaml
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
name: ens8f0np0-policy
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
desiredState:
interfaces:
- name: ens8f0np0
description: Configuring ens8f0np0 on nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
type: ethernet
state: up
ipv4:
dhcp: false
address:
- ip: 10.6.145.32
prefix-length: 24
enabled: true
EOF
Once we have the customer resource file we can create it on the cluster.
$ oc create -f nncp-static-ip.yaml
nodenetworkconfigurationpolicy.nmstate.io/ens8f0np0-policy created
$ oc get nncp -A
NAME STATUS REASON
ens8f0np0-policy Available SuccessfullyConfigured
We can validate that the ipaddress is set by looking inside the node at the interface.
$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-8mx6q ...
To use host binaries, run `chroot /host`
Pod IP: 10.6.135.11
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ip address show dev ens8f0np0
96: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 58:a2:e1:e1:42:78 brd ff:ff:ff:ff:ff:ff
altname enp160s0f0np0
inet 10.6.145.32/24 brd 10.6.145.255 scope global noprefixroute ens8f0np0
valid_lft forever preferred_lft forever
inet6 fe80::c397:5afa:d618:e752/64 scope link noprefixroute
valid_lft forever preferred_lft forever
If everything looks good we can proceed to the next operator.
Install and Configure SRIOV Operator
Now we need to create the SRIOV Operator custom resource file to create the namespace, operator group and subscription.
$ cat << EOF > openshift-sriov-network-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
name: openshift-sriov-network-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: sriov-network-operators
namespace: openshift-sriov-network-operator
spec:
targetNamespaces:
- openshift-sriov-network-operator
upgradeStrategy: Default
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: sriov-network-operator-subscription
namespace: openshift-sriov-network-operator
spec:
channel: stable
installPlanApproval: Automatic
name: sriov-network-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
Now we can create the SRIOV resource on the cluster.
$ oc create -f openshift-sriov-network-operator.yaml
namespace/openshift-sriov-network-operator created
operatorgroup.operators.coreos.com/sriov-network-operators created
subscription.operators.coreos.com/sriov-network-operator-subscription created
We can validate the operator is running by looking at the pod output.
$ oc get pods -n openshift-sriov-network-operator
NAME READY STATUS RESTARTS AGE
sriov-network-operator-7cb6c49868-89486 1/1 Running 0 22s
Next we will need to create the default SriovOperatorConfig configuration file.
$ cat <<EOF > sriov-operator-config.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: openshift-sriov-network-operator
spec:
enableInjector: true
enableOperatorWebhook: true
logLevel: 2
EOF
Then create the resource on the cluster.
$ oc create -f sriov-operator-config.yaml
sriovoperatorconfig.sriovnetwork.openshift.io/default created
For the default SriovOperatorConfig to work with the MLNX_OFED container, please run the following patch command.
$ oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
sriovoperatorconfig.sriovnetwork.openshift.io/default patched
If everything looks good we can proceed to installing the next operator.
Install and Configure Network Operator
To get started we will generate a NVIDIA Network Operator CRD that will create the namespace, operator group and subscription.
$ cat <<EOF > network-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
name: nvidia-network-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: nvidia-network-operator
namespace: nvidia-network-operator
spec:
targetNamespaces:
- nvidia-network-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nvidia-network-operator
namespace: nvidia-network-operator
spec:
channel: v24.10.0
installPlanApproval: Automatic
name: nvidia-network-operator
source: certified-operators
sourceNamespace: openshift-marketplace
EOF
Next we can create the resources on the cluster.
$ oc create -f network-operator.yaml
namespace/nvidia-network-operator created
operatorgroup.operators.coreos.com/nvidia-network-operator created
subscription.operators.coreos.com/nvidia-network-operator created
We can then validate that the network operator has installed and is running by confirming the controller is running in the nvidia-network-operator
namespace.
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m
With the operator up we can create the NicClusterPolicy
customer resource file. Note in this file I have hard coded the Infiniband interface ibs2f0 -or-
ens8f0np0 that I will be using as my shared rdma device. Note both cannot be defined at the same time in the policy from what I have experiences but they are shown to show either ethernet or infiniband interfaces can be used. This could be a different device depending on the system configuration.
$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
nicFeatureDiscovery:
image: nic-feature-discovery
repository: ghcr.io/mellanox
version: v0.0.1
docaTelemetryService:
image: doca_telemetry
repository: nvcr.io/nvidia/doca
version: 1.16.5-doca2.6.0-host
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_ib",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ibs2f0"]
}
},
{
"resourceName": "rdma_shared_device_eth",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ens8f0np0"]
}
}
]
}
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: v1.5.1
secondaryNetwork:
ipoib:
image: ipoib-cni
repository: ghcr.io/mellanox
version: v1.2.0
nvIpam:
enableWebhook: false
image: nvidia-k8s-ipam
repository: ghcr.io/mellanox
version: v0.2.0
ofedDriver:
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
forcePrecompiled: false
terminationGracePeriodSeconds: 300
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: true
enable: true
force: true
timeoutSeconds: 300
podSelector: ''
maxParallelUpgrades: 1
safeLoad: false
waitForCompletion:
timeoutSeconds: 0
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.7.0.0-0
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
- name: CREATE_IFNAMES_UDEV
value: "true"
EOF
Next we can create the NicClusterPolicy
custom resource on the cluster.
$ oc create -f network-sharedrdma-nic-cluster-policy.yaml
nicclusterpolicy.mellanox.com/nic-cluster-policy created
We can validate the NicClusterPolicy
by running a few commands in the DOCA/MOFED container.
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
doca-telemetry-service-hwj65 1/1 Running 2 160m
kube-ipoib-cni-ds-fsn8g 1/1 Running 2 160m
mofed-rhcos4.16-9b5ddf4c6-ds-ct2h5 2/2 Running 4 160m
nic-feature-discovery-ds-dtksz 1/1 Running 2 160m
nv-ipam-controller-854585f594-c5jpp 1/1 Running 2 160m
nv-ipam-controller-854585f594-xrnp5 1/1 Running 2 160m
nv-ipam-node-xqttl 1/1 Running 2 160m
nvidia-network-operator-controller-manager-5798b564cd-5cq99 1/1 Running 2 5d23h
rdma-shared-dp-ds-p9vvg 1/1 Running 0 85m
And we can rsh
into the mofed
container to check a few things.
$ MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed)
$ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD}
sh-5.1# ofed_info -s
OFED-internal-24.10-0.7.0.0-0:
sh-5.1# ibdev2netdev -v
0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up)
0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)
Now we need to create a IPoIBNetwork
custom resource file (for infiniband based interfaces).
$ cat <<EOF > ipoib-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
name: example-ipoibnetwork
spec:
ipam: |
{
"type": "whereabouts",
"range": "192.168.6.225/28",
"exclude": [
"192.168.6.229/30",
"192.168.6.236/32"
]
}
master: ibs2f0
networkNamespace: default
EOF
And then create the IPoIBNetwork
resource on the cluster.
$ $ oc create -f ipoib-network.yaml
ipoibnetwork.mellanox.com/example-ipoibnetwork created
We will do the same thing for our ethernet interface but this will be a MacvlanNetwork
custom resource file.
$ cat <<EOF > macvlan-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdmashared-net
spec:
networkNamespace: default
master: ens8f0np0
mode: bridge
mtu: 1500
ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}'
EOF
Then create the resource on the cluster.
$ oc create -f macvlan-network.yaml
macvlannetwork.mellanox.com/rdmashared-net created
If everything looks good we can proceed to the next operator.
Install and Configure GPU Operator
The next operator we need to configured is the NVIDIA GPU Operator. As with most operators, we will need to configure a namespace, operator group and subscription.
To get started we will generate a NVIDIA GPU Operator CRD that will create the namespace, operator group and subscription.
$ cat <<EOF > gpu-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
name: nvidia-gpu-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: nvidia-gpu-operator
namespace: nvidia-gpu-operator
spec:
targetNamespaces:
- nvidia-gpu-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nvidia-gpu-operator
namespace: nvidia-gpu-operator
spec:
channel: "v24.9"
installPlanApproval: Automatic
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
EOF
Next we can create the resources on the cluster.
$ oc create -f gpu-operator.yaml
namespace/nvidia-gpu-operator created
operatorgroup.operators.coreos.com/nvidia-gpu-operator created
subscription.operators.coreos.com/nvidia-gpu-operator created
We can check that the operator pod is running by looking at the pods under the namespace.
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s
Now that we have the operator running we need to create a GPU cluster policy custom resource file like the one below.
$ cat <<EOF > gpu-cluster-policy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
serviceMonitor:
enabled: true
enabled: true
cdi:
default: false
enabled: false
driver:
licensingConfig:
nlsEnabled: true
configMapName: ''
certConfig:
name: ''
rdma:
enabled: true
kernelModuleConfig:
name: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
repoConfig:
configMapName: ''
virtualTopology:
config: ''
enabled: true
useNvidiaDriverCRD: false
useOpenKernelModules: true
devicePlugin:
config:
name: ''
default: ''
mps:
root: /run/nvidia/mps
enabled: true
gdrcopy:
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: true
image: nvidia-fs
version: 2.20.5
repository: nvcr.io/nvidia/cloud-native
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
installDir: /usr/local/nvidia
enabled: true
EOF
With the GPU ClusterPolicy custom resource file generated, let's create it on the cluster.
$ oc create -f gpu-cluster-policy.yaml
clusterpolicy.nvidia.com/gpu-cluster-policy created
After some time, all the pods are up and running.
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-d5ngn 1/1 Running 0 3m20s
gpu-feature-discovery-z42rx 1/1 Running 0 3m23s
gpu-operator-6bb4d4b4c5-njh78 1/1 Running 0 4m35s
nvidia-container-toolkit-daemonset-bkh8l 1/1 Running 0 3m20s
nvidia-container-toolkit-daemonset-c4hzm 1/1 Running 0 3m23s
nvidia-cuda-validator-4blvg 0/1 Completed 0 106s
nvidia-cuda-validator-tw8sl 0/1 Completed 0 112s
nvidia-dcgm-exporter-rrw4g 1/1 Running 0 3m20s
nvidia-dcgm-exporter-xc78t 1/1 Running 0 3m23s
nvidia-dcgm-nvxpf 1/1 Running 0 3m20s
nvidia-dcgm-snj4j 1/1 Running 0 3m23s
nvidia-device-plugin-daemonset-fk2xz 1/1 Running 0 3m23s
nvidia-device-plugin-daemonset-wq87j 1/1 Running 0 3m20s
nvidia-driver-daemonset-416.94.202410211619-0-ngrjg 4/4 Running 0 3m58s
nvidia-driver-daemonset-416.94.202410211619-0-tm4x6 4/4 Running 0 3m58s
nvidia-node-status-exporter-jlzxh 1/1 Running 0 3m57s
nvidia-node-status-exporter-zjffs 1/1 Running 0 3m57s
nvidia-operator-validator-l49hx 1/1 Running 0 3m20s
nvidia-operator-validator-n44nn 1/1 Running 0 3m23s
Once we see the pods running above, we can remote shell into the NVIDIA driver daemonset pod and confirm two items. The first is that the nvidia
modules are loaded and ensuring specifically the nvidia_peermem
one is there. We can also run the nvidia-smi
utility to show the details about the driver and the hardware.
$ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver)
sh-4.4# lsmod|grep nvidia
nvidia_fs 327680 0
nvidia_peermem 24576 0
nvidia_modeset 1507328 0
video 73728 1 nvidia_modeset
nvidia_uvm 6889472 8
nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib
drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200
sh-4.4# nvidia-smi
Wed Nov 6 22:03:53 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 On | 00000000:61:00.0 Off | 0 |
| 0% 37C P0 88W / 300W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 |
| 0% 28C P8 29W / 300W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
While we are in the driver pod we should also set the GPU clock to maximum using the following nvidia-smi
command. This is optional but why not have it at full speed.
$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc
sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0
All done.
sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0
All done.
One last thing we can do is validate our resource are available from a node describe perspective.
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596712Ki
nvidia.com/gpu: 2
pods: 250
rdma/rdma_shared_device_eth: 63
rdma/rdma_shared_device_ib: 63
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445736Ki
nvidia.com/gpu: 2
pods: 250
rdma/rdma_shared_device_eth: 63
rdma/rdma_shared_device_ib: 63
--
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596672Ki
nvidia.com/gpu: 2
pods: 250
rdma/rdma_shared_device_eth: 63
rdma/rdma_shared_device_ib: 63
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445696Ki
nvidia.com/gpu: 2
pods: 250
rdma/rdma_shared_device_eth: 63
rdma/rdma_shared_device_ib: 63
If everything looks good we can proceed to actual RDMA testing.
The Shared Device RDMA Testing
This section will cover running workload pods across the nodes in the environment. We will setup the required privileges, create the workload pod, validate connectivity between the two hosts on the infiniband fabric and then run a performance test.
Create Service Account
First let's generate a service account CRD to use in the default
namespace.
$ cat <<EOF > default-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: rdma
namespace: default
EOF
Next we can create it on our cluster.
$ oc create -f default-serviceaccount.yaml
serviceaccount/rdma created
Finally with the service account create we can add privleges to it.
$ oc -n default adm policy add-scc-to-user privileged -z rdma
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"
If everything looks good we can move onto creating the workload pods.
Create Workload Pods for IB
With the service account setup we now need to create a workload pod that contains all the tooling for our testing. We can generate a custom pod resource file for each worker node as follows to meet that requirement.
$ cat <<EOF > rdma-ib-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: rdma-ib-32-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: example-ipoibnetwork
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
name: rdma-ib-32-workload
command:
- sh
- -c
- sleep inf
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_ib: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_ib: 1
EOF
$ cat <<EOF > rdma-ib-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: rdma-ib-33-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: example-ipoibnetwork
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
name: rdma-ib-33-workload
command:
- sh
- -c
- sleep inf
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_ib: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_ib: 1
EOF
Then we can create the pods on the cluster.
$ oc create -f rdma-ib-32-workload.yaml
pod/rdma-ib-32-workload created
$ oc create -f rdma-ib-33-workload.yaml
pod/rdma-ib-33-workload created
Let's validate the pods is running.
$ oc get pods
NAME READY STATUS RESTARTS AGE
rdma-ib-32-workload 1/1 Running 0 10s
rdma-ib-33-workload 1/1 Running 0 3s
With the pods up and running we can validate connectivity.
Validate IB Connectivity
This section will cover confirming the infiniband connectivity is working between the systems. This section is option but provides a lot of good infiniband troubleshooting tips. First we should rsh into each rdma-workload-client
pod.
$ oc rsh -n default rdma-ib-32-workload
sh-5.1#
The first command we can run is the ibhosts
command which shows infiniband host nodes in topology.
sh-5.1# ibhosts
Ca : 0x58a2e10300e14446 ports 1 "nvd-srv-33 mlx5_0"
Ca : 0x58a2e10300dfe416 ports 1 "nvd-srv-32 mlx5_0"
We can also run the ibnodes
command which will show not only the nodes but also switches in the topology.
sh-5.1# ibnodes
Ca : 0x58a2e10300e14446 ports 1 "nvd-srv-33 mlx5_0"
Ca : 0x58a2e10300dfe416 ports 1 "nvd-srv-32 mlx5_0"
Switch : 0xfc6a1c0300e7ecc0 ports 129 "MF0;qm9700-ib:MQM9700/U1" enhanced port 0 lid 1 lmc 0
We can look deeper into an interface state by using the ibstatus
command and pass an interface. If no interface is passed all will display.
sh-5.1# ibstatus mlx5_0
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:58a2:e103:00df:e416
base lid: 0x4
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand
Now that we have familiarized ourself with the environment we can run ibstat
and grep out only certain key elements of the output. These will be needed for the ibping test.
The first ibstat
output is that of our first node which will act as the server side for the ibping
command.
sh-5.1# ibstat | egrep "Port|Base|Link"
Port 1:
Physical state: LinkUp
Base lid: 4
Port GUID: 0x58a2e10300e14446
Link layer: InfiniBand
Port 1:
Physical state: LinkUp
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
The output above shows both an infiniband and ethernet interface. We are only interested in the infiniband in this use case. Make note of the lid number as that is used in the ibping
command on the client side.
We can run the same command on the client side and notice while some of the details are similar the lid number is unique along with the port GUID.
sh-5.1# ibstat | egrep "Port|Base|Link"
Port 1:
Physical state: LinkUp
Base lid: 5
Port GUID: 0x58a2e10300e14446
Link layer: InfiniBand
Port 1:
Physical state: LinkUp
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
Next we can run an ibping with the server switch on the first workload pod.
sh-5.1# ibping -S -P 1 -d
ibdebug: [114] ibping_serv: starting to serve...
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
ibwarn: [114] mad_respond_via: dest Lid 5
ibwarn: [114] mad_respond_via: qp 0x1 class 0x32 method 129 attr 0x0 mod 0x0 datasz 0 off 0 qkey 80010000
ibdebug: [114] ibping_serv: Pong: rdma-workload-client.(none)
And on the second workload pod we can run an ibping command to ping the server side we started on the other pod.
sh-5.1# ibping -P 1 4
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.011 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.014 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.013 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Pong from rdma-workload-client.(none) (Lid 4): time 0.012 ms
Once we have completed confirming connectivity we can move onto the performance testing.
Performance Test Across IB Link
Now we want to run a test across the two pods running. We will need to rsh into the first pod and run the ib_write_bw
command. Then we will rsh into the second pod in a different terminal window and run the ib_write_bw <ipaddress>
command.
$ oc get pods -n default
NAME READY STATUS RESTARTS AGE
rdma-ib-32-workload 1/1 Running 0 8m12s
rdma-ib-33-workload 1/1 Running 0 8m5s
First let's get the ipaddress of the first pod.
$ oc get pod rdma-ib-32-workload -o yaml | grep -E 'default/example-ipoibnetwork' -A3
"name": "default/example-ipoibnetwork",
"interface": "net1",
"ips": [
"192.168.6.225"
Now rsh
into the first pod and run the ib_write_bw
command and leave that terminal open.
$ oc rsh -n default rdma-ib-32-workload
sh-5.1# ib_write_bw
************************************
* Waiting for client to connect... *
************************************
Then open another terminal and rsh
to the second pod and run ib_write_bw 192.168.6.225
.
$ oc rsh -n default rdma-ib-33-workload
sh-5.1# ib_write_bw 192.168.6.225
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x05 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007fcbace2f000
remote address: LID 0x04 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007f360e3d8000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2500.000000 != 3495.887000. CPU Frequency is not max.
65536 5000 44604.62 44576.86 0.713230
---------------------------------------------------------------------------------------
If we go back to the first terminal on pod number one we should also see similar response results.
sh-5.1# ib_write_bw
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x04 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007f360e3d8000
remote address: LID 0x05 QPN 0x0cb9 PSN 0xf5fbfc RKey 0x200000 VAddr 0x007fcbace2f000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 5000 44604.62 44576.86 0.713230
---------------------------------------------------------------------------------------
We can now clean up the pods since testing is over and move onto the next test.
Create Workload Pods for ETH
Now we need to test IB over ethernet. We can generate a custom pod resource file for both nodes as follows to meet that requirement.
$ cat <<EOF > rdma-eth-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: rdma-eth-32-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
name: rdma-eth-32-workload
command:
- sh
- -c
- sleep inf
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
EOF
$ cat <<EOF > rdma-eth-33-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: rdma-eth-33-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
name: rdma-eth-33-workload
command:
- sh
- -c
- sleep inf
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
EOF
Then we can create the pods on the cluster.
$ oc create -f rdma-eth-32-workload.yaml
pod/rdma-eth-32-workload created
$ oc create -f rdma-eth-33-workload.yaml
pod/rdma-eth-33-workload created
Let's validate the pods is running.
$ oc get pods -n default
NAME READY STATUS RESTARTS AGE
rdma-eth-32-workload 1/1 Running 0 25s
rdma-eth-33-workload 1/1 Running 0 22s
With the pods up and running we can move onto the actual test.
Performance Test Across ETH Link
Now we want to run a test across the two pods running. We will need to rsh into the first pod and run the ib_write_bw
command. Then we will rsh into the second pod in a different terminal window and run the ib_write_bw <ipaddress>
command.
$ oc get pods -n default
NAME READY STATUS RESTARTS AGE
rdma-eth-32-workload 1/1 Running 0 106s
rdma-eth-33-workload 1/1 Running 0 103s
First let's get the ipaddress of the first pod.
$ oc get pod rdma-eth-32-workload -o yaml | grep -E 'default/rdmashared' -A3
"name": "default/rdmashared-net",
"interface": "net1",
"ips": [
"192.168.2.1"
Now rsh
into the first pod and run the ib_write_bw
command and leave that terminal open.
$ oc rsh -n default rdma-eth-32-workload
sh-5.1# ib_write_bw
************************************
* Waiting for client to connect... *
************************************
Then open another terminal and rsh
to the second pod and run ib_write_bw 192.168.6.225
.
$ oc rsh -n default rdma-eth-33-workload
sh-5.1# ib_write_bw 192.168.2.1
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x05 QPN 0x0ce2 PSN 0x5389f7 RKey 0x1fff00 VAddr 0x007f7368df3000
remote address: LID 0x04 QPN 0x0ce2 PSN 0x81fa7f RKey 0x1fff00 VAddr 0x007f7e8c890000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2500.000000 != 3497.359000. CPU Frequency is not max.
65536 5000 44490.32 44467.35 0.711478
---------------------------------------------------------------------------------------
If we go back to the first terminal on pod number one we should also see similar response results.
sh-5.1# ib_write_bw
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x04 QPN 0x0ce2 PSN 0x81fa7f RKey 0x1fff00 VAddr 0x007f7e8c890000
remote address: LID 0x05 QPN 0x0ce2 PSN 0x5389f7 RKey 0x1fff00 VAddr 0x007f7368df3000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 5000 44490.32 44467.35 0.711478
---------------------------------------------------------------------------------------
We can now clean up the pods since testing is over and move onto the next test.
The Host Device RDMA Testing
This section will demonstrate how to configure host device RDMA for Nvidia Network Operator and then how to test per pod configuration.
Configure Nic Cluster Policy for Host Device
The operator should be running from previous steps. If a NicClusterPolicy
exists we need to delete the existing one and generate a new hostdev NicClusterPolicy
customer resource file.
$ cat <<EOF > network-hostdev-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.7.0.0-0
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
- name: CREATE_IFNAMES_UDEV
value: "true"
sriovDevicePlugin:
image: sriov-network-device-plugin
repository: ghcr.io/k8snetworkplumbingwg
version: v3.7.0
config: |
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "hostdev",
"selectors": {
"vendors": ["15b3"],
"isRdma": true
}
}
]
}
EOF
Next we can create the NicClusterPolicy
custom resource on the cluster.
$ oc create -f network-hostdev-nic-cluster-policy.yaml
nicclusterpolicy.mellanox.com/nic-cluster-policy created
We can validate the host device NicClusterPolicy
by running a few commands in the DOCA/MOFED container.
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
mofed-rhcos4.16-696886fcb4-ds-9sgvd 2/2 Running 0 2m37s
mofed-rhcos4.16-696886fcb4-ds-lkjd4 2/2 Running 0 2m37s
nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 0 141m
sriov-device-plugin-6v2nz 1/1 Running 0 2m14s
sriov-device-plugin-hc4t8 1/1 Running 0 2m14s
We can also confirm that the resources show up in the cluster oc decribe node
section.
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A7
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596708Ki
nvidia.com/hostdev: 2
pods: 250
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445732Ki
nvidia.com/hostdev: 2
pods: 250
--
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596704Ki
nvidia.com/hostdev: 2
pods: 250
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445728Ki
nvidia.com/hostdev: 2
pods: 250
Now we need to create a HostDeviceNetwork
custom resource file.
$ cat <<EOF > hostdev-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
name: hostdev-net
spec:
networkNamespace: "default"
resourceName: "hostdev"
ipam: |
{
"type": "whereabouts",
"range": "192.168.3.225/28",
"exclude": [
"192.168.3.229/30",
"192.168.3.236/32"
]
}
EOF
And then create the HostDeviceNetwork
resource on the cluster.
$ oc create -f hostdev-network.yaml
hostdevicenetwork.mellanox.com/hostdev-net created
Let's validate our resources are showing up properly.
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596708Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 2
pods: 250
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445732Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 2
pods: 250
--
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596680Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 2
pods: 250
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445704Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 2
pods: 250
End of nic cluster policy for host device section.
Create Workload Pods and Perf Test Host Device
Now we need to create a workload pod that contains all the tooling for our host device testing. We can generate a custom pod file for each node as follows to meet that requirement.
$ cat << EOF > hostdev-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: hostdev-32-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: hostdev-net
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
name: hostdev-32-workload
command:
- sh
- -c
- sleep inf
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/hostdev: 1
requests:
nvidia.com/gpu: 1
nvidia.com/hostdev: 1
EOF
$ cat <<EOF > hostdev-33-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: hostdev-33-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: hostdev-net
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
name: hostdev-33-workload
command:
- sh
- -c
- sleep inf
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/hostdev: 1
requests:
nvidia.com/gpu: 1
nvidia.com/hostdev: 1
EOF
Then we can create the pods on the cluster.
$ oc create -f hostdev-32-workload.yaml
pod/hostdev-32-workload created
$ oc create -f hostdev-33-workload.yaml
pod/hostdev-33-workload created
Let's validate the pods are running.
$ oc get pods -n default
NAME READY STATUS RESTARTS AGE
hostdev-32-workload 1/1 Running 0 73s
hostdev-33-workload 1/1 Running 0 12s
First let's get the ipaddress of the first pod.
$ oc get pod hostdev-32-workload -o yaml | grep -E 'default/hostdev-net' -A3
"name": "default/hostdev-net",
"interface": "net1",
"ips": [
"192.168.3.225"
Now rsh
into the first pod and run the ib_write_bw
command and leave that terminal open.
$ oc rsh -n default hostdev-32-workload
sh-5.1# ib_write_bw
************************************
* Waiting for client to connect... *
************************************
Then open another terminal and rsh
to the second pod and run ib_write_bw 192.168.6.225
.
$ oc rsh -n default hostdev-33-workload
sh-5.1# ib_write_bw 192.168.3.225
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x05 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007fe688c97000
remote address: LID 0x04 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007f1f0249d000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2500.000000 != 3498.323000. CPU Frequency is not max.
65536 5000 44351.41 44328.98 0.709264
---------------------------------------------------------------------------------------
If we go back to the first terminal on pod number one we should also see similar response results.
sh-5.1# ib_write_bw
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x04 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007f1f0249d000
remote address: LID 0x05 QPN 0x0046 PSN 0x84468b RKey 0x1fffbd VAddr 0x007fe688c97000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 5000 44351.41 44328.98 0.709264
---------------------------------------------------------------------------------------
We can now clean up the pods since testing is over and move onto the next test.
The SRIOV Legacy Mode RDMA Testing
This deployment mode supports SR-IOV in legacy mode.
Configure Nic Cluster Policy for SRIOV Legacy
First we need to create a NicClusterPolicy
which for SRIOV legacy mode is fairly generic. Generate the following custom resource file below. If an existing NicClusterPolicy exists please remove it.
$ cat <<EOF > network-sriovleg-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.7.0.0-0
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
- name: CREATE_IFNAMES_UDEV
value: "true"
EOF
Now let's create the policy on the cluster.
$ oc create -f network-sriovleg-nic-cluster-policy.yaml
nicclusterpolicy.mellanox.com/nic-cluster-policy created
Before we continue we can validate the pods are up.
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
mofed-rhcos4.16-696886fcb4-ds-4mb42 2/2 Running 0 40s
mofed-rhcos4.16-696886fcb4-ds-8knwq 2/2 Running 0 40s
nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 13 (4d ago) 4d21h
Now we need to create a SriovNetworkNodePolicy
which will generate the VFs for the device we want to operate in SRIOV legacy mode. Generate the customer resource file below.
$ cat <<EOF > sriov-network-node-policy.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: sriov-legacy-policy
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice
mtu: 1500
nicSelector:
vendor: "15b3"
pfNames: ["ens8f0np0#0-7"]
nodeSelector:
feature.node.kubernetes.io/pci-15b3.present: "true"
numVfs: 8
priority: 90
isRdma: true
resourceName: sriovlegacy
EOF
Next we can create the custom resource on the cluster. As a note make sure SR-IOV Global Enable is enabled as per Red Hat Knowledge Article.
$ oc create -f sriov-network-node-policy.yaml
sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created
The nodes should go through a reboot process. Each one will have scheduling disabled and reboot to make the configuration take place.
$ oc get nodes
NAME STATUS ROLES AGE VERSION
edge-19.edge.lab.eng.rdu2.redhat.com Ready control-plane,master,worker 5d v1.29.8+632b078
nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Ready worker 4d22h v1.29.8+632b078
nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com NotReady,SchedulingDisabled worker 4d22h v1.29.8+632b078
Once the nodes have reboot we can validate that the VF interfaces were created by opening up a debug pod on each node.
a$ oc debug node/nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-33nvidiaengrdu2dcredhatcom-debug-cqfjz ...
To use host binaries, run `chroot /host`
Pod IP: 10.6.135.12
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ip link show | grep ens8
26: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
42: ens8f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
43: ens8f0v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
44: ens8f0v2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
45: ens8f0v3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
46: ens8f0v4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
47: ens8f0v5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
48: ens8f0v6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
49: ens8f0v7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
We can repeat the same steps above on the second node if we want to feel complete.
We can also confirm via the node capabilities output.
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596692Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 0
openshift.io/sriovlegacy: 8
--
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445716Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 0
openshift.io/sriovlegacy: 8
--
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596688Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 0
openshift.io/sriovlegacy: 8
--
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445712Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 0
openshift.io/sriovlegacy: 8
Now that the VFs for SRIOV legacy mode are in place we can generate the SriovNetwork
customer resource file.
$ cat <<EOF > sriov-network.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: sriov-network
namespace: openshift-sriov-network-operator
spec:
vlan: 0
networkNamespace: "default"
resourceName: "sriovlegacy"
ipam: |
{
"type": "whereabouts",
"range": "192.168.3.225/28",
"exclude": [
"192.168.3.229/30",
"192.168.3.236/32"
]
}
EOF
Then we can create the customer resource on the cluster.
$ oc create -f sriov-network.yaml
sriovnetwork.sriovnetwork.openshift.io/sriov-network created
End of nic cluster policy for host device section.
Create Workload and Perf Test SRIOV Legacy
Now we need to create a workload pod that contains all the tooling for our host device testing. We can generate a custom pod file for each node as follows to meet that requirement.
$ cat << EOF > sriovlegacy-32-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: sriovlegacy-32-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: sriov-network
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
name: sriovlegacy-32-workload
command:
- sh
- -c
- sleep inf
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
openshift.io/sriovlegacy: 1
requests:
nvidia.com/gpu: 1
openshift.io/sriovlegacy: 1
EOF
$ cat <<EOF > sriovlegacy-33-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: sriovlegacy-33-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: sriov-network
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
serviceAccountName: rdma
containers:
- image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
name: sriovlegacy-33-workload
command:
- sh
- -c
- sleep inf
securityContext:
privileged: true
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
openshift.io/sriovlegacy: 1
requests:
nvidia.com/gpu: 1
openshift.io/sriovlegacy: 1
EOF
Then we can create the pods on the cluster.
$ oc create -f sriovlegacy-32-workload.yaml
pod/sriovlegacy-32-workload created
$ oc create -f sriovlegacy-33-workload.yaml
pod/sriovlegacy-33-workload created
Let's validate the pods are running.
$ oc get pods -n default
NAME READY STATUS RESTARTS AGE
sriovlegacy-32-workload 1/1 Running 0 73s
sriovlegacy-33-workload 1/1 Running 0 12s
First let's get the ipaddress of the first pod.
$ oc get pod sriovlegacy-32-workload -o yaml | grep -E 'default/sriov-network' -A3
"name": "default/sriov-network",
"interface": "net1",
"ips": [
"192.168.3.225"
Now rsh
into the first pod and run the ib_write_bw
command and leave that terminal open.
$ oc rsh sriovlegacy-33-workload
sh-5.1# ib_write_bw 192.168.3.225
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x05 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f397ace8000
remote address: LID 0x04 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f0eeefac000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2500.000000 != 3491.228000. CPU Frequency is not max.
65536 5000 44414.44 44386.66 0.710187
---------------------------------------------------------------------------------------
If we go back to the first terminal on pod number one we should also see similar response results.
sh-5.1# ib_write_bw
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x04 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f0eeefac000
remote address: LID 0x05 QPN 0x0046 PSN 0xa09639 RKey 0x1fffbd VAddr 0x007f397ace8000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 5000 44414.44 44386.66 0.710187
---------------------------------------------------------------------------------------
We can now clean up the pods since testing is over.
Hopefully this blog was detailed enough to provide an understanding of RDMA testing with NVIDIA and OpenShift. It provide a brief example of how to configure the different RDMA methods: Shared, Hostdev and SRIOV Legacy.