Wednesday, April 02, 2025

Change Ipaddress of OpenShift Control Node


My OpenShift 4.16.25 nodes were using DHCP addresses for their ipaddresses. However the scope changed in the DHCP range and one of my nodes that had ipaddress 10.6.135.250 was no longer able to get that address. Instead the ipaddress the node recieved was 10.6.135.245. Anyone who has worked with OpenShift knows that an ipaddress change will impact etcd. In the following I want to walk through the steps on how to recover from this situation without reinstalling OpenShift. However I also want to caution that this for academic purposes and if this happens in a real production environment be a hero and open a case with Red Hat support.

Original Ipaddress Configuration

This is a snapshot of my original configuration. The node involved in the address change ends up being nvd-srv-31-vm-1.

$ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME nvd-srv-31-vm-1 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.250 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9 nvd-srv-31-vm-2 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.243 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9 nvd-srv-31-vm-3 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.244 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9

Issues Arise

On a Friday, because it always happens on a Friday, one of my colleagues said that node nvd-srv-31-vm-1 had become unhealthy. When I took a look I could see a bunch of pods were not able to deploy. I also could not launch a debug pod for the node itself. Now the day before I had just had a conversation with someone in our networking team and they were mad about the DHCP scope having 10.6.135.250 in it. I mentioned my host had it and we currently could not change the ipaddress since it was an active OpenShift cluster. However 24 hours later something happened with the networking as I could not even ping the node via 10.5.136.250. I decided to reboot because that would help me understand the scope of the problem.

$ ping 10.6.135.250 PING 10.6.135.250 (10.6.135.250) 56(84) bytes of data. ^C --- 10.6.135.250 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3070ms

Since this node was a virtual machine I rebooted it gracefully through virsh command.

The Recovery Process

Once the node came back up I could see it obtained a new DHCP address which meant that the one it had 10.6.135.250 was no longer available. Most of the containers were able to launch on the node without issue.

$ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME nvd-srv-31-vm-1 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.245 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9 nvd-srv-31-vm-2 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.243 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9 nvd-srv-31-vm-3 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.244 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9

However I knew etcd would have a problem with the ipaddress change because etcd has the ipaddresses hard coded in the configuration to form the quorum of the etcd cluster. With that I wanted to first check if the etcd container was crashing on node svr-nvd-wrv-31-vm. I am first going to go into the openshift-etcd project and thus for all the commands I can skip passing the namespace.

$ oc project openshift-etcd Now using project "openshift-etcd" on server "https://api.doca2.nvidia.eng.rdu2.dc.redhat.com:6443" $ oc get pods -l k8s-app=etcd NAME READY STATUS RESTARTS AGE etcd-nvd-srv-31-vm-1 0/4 Init:CrashLoopBackOff 12 (17s ago) 48d etcd-nvd-srv-31-vm-2 4/4 Running 8 48d etcd-nvd-srv-31-vm-3 4/4 Running 8 48d

Sure enough the container was crashing so let's rsh into a running etcd container like nvd-srv-31-vm-2. Inside we can use the etcdctl command to list out the members.

$ oc rsh etcd-nvd-srv-31-vm-2 sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | | e33638d3b94e9016 | started | nvd-srv-31-vm-1 | https://10.6.135.250:2380 | https://10.6.135.250:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+

We can see that the nvd-srv-31-vm-1 member still has the old ipaddress of 10.6.135.250. Let's go ahead and remove this using the etcdctl command and then display the remaining members.

sh-5.1# etcdctl member remove e33638d3b94e9016 Member e33638d3b94e9016 removed from cluster f0be7a9595f9ce77 sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ sh-5.1# exit

Now that the old etcd member for nvd-srv-31-vm-1 is removed we first need to patch the etcd cluster into an unsupported state temporarily.

$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}' etcd.operator.openshift.io/cluster patched

With the etcd cluster patched we need to find all secrets related to nvd-srv-31-vm-1. There should only be three at the time of this writing.

$ oc get secret | grep nvd-srv-31-vm-1 etcd-peer-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d etcd-serving-metrics-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d etcd-serving-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d

We can remove each of those secrets as they will get regenerated when we do.

$ oc delete secret etcd-peer-nvd-srv-31-vm-1 secret "etcd-peer-nvd-srv-31-vm-1" deleted $ oc delete secret etcd-serving-metrics-nvd-srv-31-vm-1 secret "etcd-serving-metrics-nvd-srv-31-vm-1" deleted $ oc delete secret etcd-serving-nvd-srv-31-vm-1 secret "etcd-serving-nvd-srv-31-vm-1" deleted

With the secrets removed we can get the secrets again for nvd-srv-31-vm-1 and see they have been recreated.

$ oc get secret | grep nvd-srv-31-vm-1 NAME TYPE DATA AGE etcd-peer-nvd-srv-31-vm-1 kubernetes.io/tls 2 20s etcd-serving-metrics-nvd-srv-31-vm-1 kubernetes.io/tls 2 11s etcd-serving-nvd-srv-31-vm-1 kubernetes.io/tls 2 1s

Now let's double check the etcdctl member list again just to confirm we still only have two members.

$ oc rsh etcd-nvd-srv-31-vm-2 sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ sh-5.1# exit

Next we will need to approve a certificate for the nvd-srv-31-vm-1 node. Remember we removed its original secret.

$ oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-sjjxv 12m kubernetes.io/kubelet-serving system:node:nvd-srv-31-vm-1 <none> Pending $ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve certificatesigningrequest.certificates.k8s.io/csr-sjjxv approved

We can validate the certificate was approved.

$ oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-sjjxv 13m kubernetes.io/kubelet-serving system:node:nvd-srv-31-vm-1 <none> Approved,Issued

Next we will go back into one of the etcd running containers again. I will rsh into the etcd-srv-31-vm-2 one again. Here I will check endpoint health and list member table again.

$ oc rsh etcd-nvd-srv-31-vm-2 sh-5.1# etcdctl endpoint health --cluster https://10.6.135.243:2379 is healthy: successfully committed proposal: took = 5.356332ms https://10.6.135.244:2379 is healthy: successfully committed proposal: took = 7.730393ms sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+

At this point I want to add the nvd-srv-31-vm-1 member back but with the appropriate new ipaddress 10.6.135.245.

sh-5.1# etcdctl member add nvd-srv-31-vm-1 --peer-urls="https://10.6.135.245:2380" Member a4b9266380f688f4 added to cluster f0be7a9595f9ce77 ETCD_NAME="nvd-srv-31-vm-1" ETCD_INITIAL_CLUSTER="nvd-srv-31-vm-2=https://10.6.135.243:2380,nvd-srv-31-vm-3=https://10.6.135.244:2380,nvd-srv-31-vm-1=https://10.6.135.245:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.6.135.245:2380" ETCD_INITIAL_CLUSTER_STATE="existing"

We can then use etcdctl again to list all the members out and confirm our node now is listed with the correct ipaddresss.

sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | | a4b9266380f688f4 | started | nvd-srv-31-vm-1 | https://10.6.135.245:2380 | https://10.6.135.245:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+

Finally we can remove the override unspupported patch.

$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null }}' etcd.operator.openshift.io/cluster patched

And lastly we can verify the etcd containers are running on the node properly.

$ oc get pods |grep nvd-srv-31-vm-1 |grep etcd etcd-guard-nvd-srv-31-vm-1 1/1 Running 0 85m etcd-nvd-srv-31-vm-1 4/4 Running 0 56m

Hopefully this provide a good level of detail when needing to change the ipaddress on an OpenShift controller. Keep in mind this process shouldn't be used without engaging support from Red Hat.

Tuesday, April 01, 2025

NVIDIA GPU Direct Storage on OpenShift


Welcome to the NVIDIA GPU Direct Storage on OpenShift workflow.  The goal of this workflow is to understand and configure NVIDIA GPU Direct Storage for NVME devices in worker nodes of an OpenShift cluster.

What Is NVIDIA GPU Direct Storage?

GPU Direct Storage enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. Using this direct path can relieve system bandwidth bottlenecks and decrease the latency and utilization load on the CPU.

Assumptions

This document assumes that we have already deployed a OpenShift Cluster and have installed the necessary operators required for GPU Direct Storage. Those operators would be Node Feature Discover which should also be configured along with the base installation of the NVIDIA Network Operator (no NicClusterPolicy yet) and the NVIDIA GPU Operator (no GpuClusterPolicy yet).

Considerations

If any of the NVME devices in the system participate in either the operating system or other services (machine configs for LVMs or other customized access) the NVME kernel modules will not be able to unload properly even with the workaround defined in this documentation. Any use of GDS requires that the NVME drives are not in use during the deployment of the Network Operator in order for the Network Operator to be able to unload in-tree drivers and then load NVIDIA's out of tree drivers in place.

NVIDIA Network Operator Configuration

We assume the Network Operator has already been installed on the cluster but the NicClusterPolicy still needs to be created. The following NicClusterPolicy example will provide the needed configuration to ensure RDMA is properly loaded for nvme. The key option in this policy is the ENABLE_NFSRDMA variable and having it set to true. I want to note that this policy also optinonally has an rdmaSharedDevice and ENTRYPOINT_DEBUG set to true for more verbose logging.

$ cat <<EOF > network-sharedrdma-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: nicFeatureDiscovery: image: nic-feature-discovery repository: ghcr.io/mellanox version: v0.0.1 docaTelemetryService: image: doca_telemetry repository: nvcr.io/nvidia/doca version: 1.16.5-doca2.6.0-host rdmaSharedDevicePlugin: config: | { "configList": [ { "resourceName": "rdma_shared_device_eth", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ens1f0np0"] } } ] } image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: 'sha256:9f468fdc4449e65e4772575f83aa85840a00f97165f9a00ba34695c91d610fbd' secondaryNetwork: ipoib: image: ipoib-cni repository: ghcr.io/mellanox version: v1.2.0 nvIpam: enableWebhook: false image: nvidia-k8s-ipam repository: ghcr.io/mellanox version: v0.2.0 ofedDriver: readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 forcePrecompiled: false terminationGracePeriodSeconds: 300 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: true enable: true force: true timeoutSeconds: 300 podSelector: '' maxParallelUpgrades: 1 safeLoad: false waitForCompletion: timeoutSeconds: 0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 image: doca-driver repository: nvcr.io/nvidia/mellanox version: 25.01-0.6.0.0-0 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" - name: ENABLE_NFSRDMA value: "true" - name: ENTRYPOINT_DEBUG value: 'true' EOF

Before creating the NicClusterPolicy on the cluster we need to prepare a script which will allow us to workaround an issue with GPU Direct Storage in the NVIDIA Network Operator. This script when run right after creating the NicClusterPolicy will determine which nodes have mofed pods running on them and based on that node list will ssh as the core user into each node and unload the following modules: nvme, nvme_tcp, nvme_fabrics, nvme_core. By using the script to unload the modules while the mofed container is busying building the doca drivers we eliminate an issue where when the mofed container goes to install the compiled doca drivers there is a failure to load. This issue is being investigated by NVIDIA.

$ cat <<EOF > nvme-fixer.sh #!/bin/bash ### Set array of modules to be unloaded declare -a modarr=("nvme" "nvme_tcp" "nvme_fabrics" "nvme_core") ### Determine which hosts have mofed container running on them declare -a hostarr=(`oc get pods -n nvidia-network-operator -o custom-columns=POD:.metadata.name,NODE:.spec..nodeName --no-headers|grep mofed|awk {'print $2'}`) ### Iterate through modules on each host and unload them for host in "${hostarr[@]}" do echo "Unloading nvme dependencies on $host..." for module in "${modarr[@]}" do echo "Unloading module $module..." ssh core@$host sudo rmmod $module done done

Change the execute bit on the script.

$ chmod +x nvme-fixer.sh

Now we are ready to create the NicClusterPolicy on the cluster and follow it up by running the nvme-fixer.sh script. If there are any rmmod "not currently loaded" errors those can safely be ignored as the module was not loaded to start with. In the example below we had two workers nodes that had mofed pods running on them so the script went ahead and unloaded the nvme modules.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created $ ./nvme-fixer.sh Unloading nvme dependencies on nvd-srv-22.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... rmmod: ERROR: Module nvme_tcp is not currently loaded Unloading module nvme_fabrics... rmmod: ERROR: Module nvme_fabrics is not currently loaded Unloading module nvme_core... Unloading nvme dependencies on nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... Unloading module nvme_fabrics... Unloading module nvme_core... $

Now we wait for the mofed pod to finish compiling and installed the GPU Direct Storage modules. We will know its complete when the pods are in a running state like below:

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE kube-ipoib-cni-ds-5f8wk 1/1 Running 0 38s kube-ipoib-cni-ds-956nv 1/1 Running 0 38s kube-ipoib-cni-ds-jpbph 1/1 Running 0 38s kube-ipoib-cni-ds-jwtw2 1/1 Running 0 38s kube-ipoib-cni-ds-v4sb8 1/1 Running 0 38s mofed-rhcos4.17-69fb4cd685-ds-j77vl 2/2 Running 0 37s mofed-rhcos4.17-69fb4cd685-ds-lw7t9 2/2 Running 0 37s nic-feature-discovery-ds-527wc 1/1 Running 0 36s nic-feature-discovery-ds-fnn9v 1/1 Running 0 36s nic-feature-discovery-ds-l9lkf 1/1 Running 0 36s nic-feature-discovery-ds-qn4m9 1/1 Running 0 36s nic-feature-discovery-ds-w7vw4 1/1 Running 0 36s nv-ipam-controller-67556c846b-c4sfq 1/1 Running 0 36s nv-ipam-controller-67556c846b-wvm59 1/1 Running 0 36s nv-ipam-node-22rw9 1/1 Running 0 36s nv-ipam-node-6w4x4 1/1 Running 0 36s nv-ipam-node-f2p96 1/1 Running 0 36s nv-ipam-node-jssjh 1/1 Running 0 36s nv-ipam-node-z2mws 1/1 Running 0 36s nvidia-network-operator-controller-manager-57c7cfddc8-6nw6j 1/1 Running 16 (10h ago) 14d

We can validate things look correct from a module perspective by logging into one of nodes either via SSH or even debug pod and listing out the nvme modules. The results should look like the following output below. Note I ran a lsblk to also show that my nvme device is visible as well.

$ ssh core@nvd-srv-23.nvidia.eng.rdu2.dc.redhat.com Red Hat Enterprise Linux CoreOS 417.94.202502051822-0 Part of OpenShift 4.17, RHCOS is a Kubernetes-native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.17/architecture/architecture-rhcos.html Last login: Fri Mar 21 17:48:41 2025 from 10.22.81.26 [systemd] Failed Units: 1 NetworkManager-wait-online.service [core@nvd-srv-23 ~]$ sudo bash [root@nvd-srv-23 core]# lsmod|grep nvme nvme_rdma 57344 0 nvme_fabrics 45056 1 nvme_rdma nvme 73728 0 nvme_core 204800 3 nvme,nvme_rdma,nvme_fabrics rdma_cm 155648 3 rpcrdma,nvme_rdma,rdma_ucm ib_core 557056 10 rdma_cm,ib_ipoib,rpcrdma,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm mlx_compat 20480 17 rdma_cm,ib_ipoib,mlxdevm,rpcrdma,nvme,nvme_rdma,mlxfw,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core nvme_common 24576 0 t10_pi 24576 2 sd_mod,nvme_core [root@nvd-srv-23 core]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 1.5T 0 disk ├─sda1 8:1 0 1M 0 part ├─sda2 8:2 0 127M 0 part ├─sda3 8:3 0 384M 0 part /boot └─sda4 8:4 0 1.5T 0 part /var /sysroot/ostree/deploy/rhcos/var /usr /etc / /sysroot sdb 8:16 0 1.5T 0 disk sdc 8:32 0 1.5T 0 disk sdd 8:48 0 1.5T 0 disk nvme0n1 259:1 0 894.2G 0 disk

This completes the NVIDIA Network Operator portion of the configuration for GPU Direct Storage.

NVIDIA GPU Operator Configuration

Now that the NicClusterPolicy is defined and the proper NVME modules have been loaded we can move into configuring our GPU ClusterPolicy. The below example is a policy that will enable GPU Direct Storage on the worker nodes that have a proper NVIDIA GPU.

$ cat <<EOF > gpu-cluster-policy.yaml apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' serviceMonitor: enabled: true enabled: true cdi: default: false enabled: false driver: licensingConfig: nlsEnabled: true configMapName: '' certConfig: name: '' rdma: enabled: true kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' enabled: true useNvidiaDriverCRD: false useOpenKernelModules: true devicePlugin: config: name: '' default: '' mps: root: /run/nvidia/mps enabled: true gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: 'nvcr.io/nvidia/cloud-native/nvidia-fs:2.20.5' vgpuManager: enabled: false vfioManager: enabled: true toolkit: installDir: /usr/local/nvidia enabled: true EOF

Now let's create the policy on the cluster.

$ oc create -f gpu-cluster-policy.yaml clusterpolicy.nvidia.com/gpu-cluster-policy created

Once the policy is created let's validate the pods are running before we move onto the next step.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-499wh 1/1 Running 0 18h gpu-feature-discovery-m68bn 1/1 Running 0 18h gpu-operator-c9ccd586d-htl5q 1/1 Running 0 19h nvidia-container-toolkit-daemonset-8m4r5 1/1 Running 0 18h nvidia-container-toolkit-daemonset-ld7qz 1/1 Running 0 18h nvidia-cuda-validator-fddq7 0/1 Completed 0 18h nvidia-cuda-validator-mdk6b 0/1 Completed 0 18h nvidia-dcgm-565tj 1/1 Running 0 18h nvidia-dcgm-exporter-jtgt6 1/1 Running 1 (18h ago) 18h nvidia-dcgm-exporter-znpgh 1/1 Running 1 (18h ago) 18h nvidia-dcgm-xpxbx 1/1 Running 0 18h nvidia-device-plugin-daemonset-2vn52 1/1 Running 0 18h nvidia-device-plugin-daemonset-kjzjz 1/1 Running 0 18h nvidia-driver-daemonset-417.94.202502051822-0-pj7hk 5/5 Running 2 (18h ago) 18h nvidia-driver-daemonset-417.94.202502051822-0-qp8xb 5/5 Running 5 (18h ago) 18h nvidia-node-status-exporter-48cx7 1/1 Running 0 18h nvidia-node-status-exporter-dpmsr 1/1 Running 0 18h nvidia-operator-validator-fmcz4 1/1 Running 0 18h nvidia-operator-validator-g2fbt 1/1 Running 0 18h

With the NVIDIA GPU Operator pods running we can rsh into the daemonset pods and confirm GDS is enabled by running the lsmod command (note the nvidia_fs module) and cat out the /proc/driver/nvidia-fs/stats file.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk sh-4.4# lsmod|grep nvidia nvidia_fs 327680 0 nvidia_peermem 24576 0 nvidia_modeset 1507328 0 video 73728 1 nvidia_modeset nvidia_uvm 6889472 8 nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset ib_uverbs 217088 19 nvidia_peermem,rdma_ucm,mlx5_ib drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200 $ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-417.94.202502051822-0-pj7hk sh-4.4# cat /proc/driver/nvidia-fs/stats GDS Version: 1.10.0.4 NVFS statistics(ver: 4.0) NVFS Driver(version: 2.20.5) Mellanox PeerDirect Supported: True IO stats: Disabled, peer IO stats: Disabled Logging level: info Active Shadow-Buffer (MiB): 0 Active Process: 0 Reads : err=0 io_state_err=0 Sparse Reads : n=0 io=0 holes=0 pages=0 Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0 Mmap : n=0 ok=0 err=0 munmap=0 Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0 Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0 Ops : Read=0 Write=0 BatchIO=0

If everything looks good we can move onto an additional step to confirm GDS is ready for workload consumption.

GDS Cuda Workload Container

Once the GPU Direct Storage drivers are loaded we can use one more additional tool to check and confirm GDS capability. This involves building a container that contains the CUDA packages and then running it on a node. The following pod yaml defines this configuration.

$ cat <<EOF > gds-check-workload.yaml apiVersion: v1 kind: Pod metadata: name: gds-check-workload namespace: default spec: serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-tools:0.0.3 name: gds-check-workload securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] volumeMounts: - name: udev mountPath: /run/udev volumeMounts: - name: kernel-config mountPath: /sys/kernel/config volumeMounts: - name: dev mountPath: /run/dev volumeMounts: - name: sys mountPath: /sys volumeMounts: - name: results mountPath: /results volumeMounts: - name: lib mountPath: /lib/modules resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 volumes: - name: udev hostPath: path: /run/udev - name: kernel-config hostPath: path: /sys/kernel/config - name: dev hostPath: path: /run/dev - name: sys hostPath: path: /sys - name: results hostPath: path: /results - name: lib hostPath: path: /lib/modules EOF

Now let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > default-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: rdma namespace: default EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z rdma clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

With the service account defined and our pod yaml ready we can create it on the cluster.

$ oc create -f gds-check-workload.yaml pod/gds-check-workload created $ oc get pods NAME READY STATUS RESTARTS AGE gds-check-workload 1/1 Running 0 3s

Once the pod is up and running we can rsh into the pod and run the gdscheck tool to confirm capabilities and configuration of GPU Direct Storage.

$ oc rsh gds-check-workload sh-5.1# /usr/local/cuda/gds/tools/gdscheck -p GDS release version: 1.13.1.3 nvidia_fs version: 2.20 libcufile version: 2.12 Platform: x86_64 ============ ENVIRONMENT: ============ ===================== DRIVER CONFIGURATION: ===================== NVMe P2PDMA : Unsupported NVMe : Supported NVMeOF : Supported SCSI : Unsupported ScaleFlux CSD : Unsupported NVMesh : Unsupported DDN EXAScaler : Unsupported IBM Spectrum Scale : Unsupported NFS : Supported BeeGFS : Unsupported WekaFS : Unsupported Userspace RDMA : Unsupported --Mellanox PeerDirect : Enabled --rdma library : Not Loaded (libcufile_rdma.so) --rdma devices : Not configured --rdma_device_status : Up: 0 Down: 0 ===================== CUFILE CONFIGURATION: ===================== properties.use_pci_p2pdma : false properties.use_compat_mode : true properties.force_compat_mode : false properties.gds_rdma_write_support : true properties.use_poll_mode : false properties.poll_mode_max_size_kb : 4 properties.max_batch_io_size : 128 properties.max_batch_io_timeout_msecs : 5 properties.max_direct_io_size_kb : 16384 properties.max_device_cache_size_kb : 131072 properties.max_device_pinned_mem_size_kb : 33554432 properties.posix_pool_slab_size_kb : 4 1024 16384 properties.posix_pool_slab_count : 128 64 64 properties.rdma_peer_affinity_policy : RoundRobin properties.rdma_dynamic_routing : 0 fs.generic.posix_unaligned_writes : false fs.lustre.posix_gds_min_kb: 0 fs.beegfs.posix_gds_min_kb: 0 fs.weka.rdma_write_support: false fs.gpfs.gds_write_support: false fs.gpfs.gds_async_support: true profile.nvtx : false profile.cufile_stats : 0 miscellaneous.api_check_aggressive : false execution.max_io_threads : 4 execution.max_io_queue_depth : 128 execution.parallel_io : true execution.min_io_threshold_size_kb : 8192 execution.max_request_parallelism : 4 properties.force_odirect_mode : false properties.prefer_iouring : false ========= GPU INFO: ========= GPU index 0 NVIDIA A40 bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled ============== PLATFORM INFO: ============== Found ACS enabled for switch 0000:e0:01.0 IOMMU: Pass-through or enabled WARN: GDS is not guaranteed to work functionally or in a performant way with iommu=on/pt Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed) Cuda Driver Version Installed: 12040 Platform: PowerEdge R760xa, Arch: x86_64(Linux 5.14.0-427.50.1.el9_4.x86_64) Platform verification succeeded

Hopefully this provides enough detail to enable GPU Direct Storage on OpenShift. 

Wednesday, January 15, 2025

RDMA: Shared, Hostdevice, Legacy SRIOV

 
In a previous blog we discussed how to configure RDMA on OpenShift in three distinct methods: RDMA shared, host device and legacy SRIOV.   However one of the biggest questions coming out of that blog was how do I know which one to choose?  To answer this question comprehensively we probably should first step back and discuss RDMA and the three methods in detail.

What is RDMA?

Remote direct memory access (RDMA) is a technology, originally developed in the 1990s, that allows computers to directly access each others memory without the involvement of the hosts central processor unit (CPU) or operating system(OS).  RDMA is an extension of direct memory access(DMA) which allows direct access to a hosts memory without the use of CPU.  RDMA itself is geared toward high bandwidth and low latency applications making it a valuable component in the AI space.

NVIDIA offers GPUDirect RDMA which is a technology that provides a direct data path between the GPU memory directly between two or more hosts leveraging the NVIDIA networking device.  This configuration provides a significant decrease in latency and offloads the CPU of the hosts.  When leveraging this technology from NVDIA the consumer has the ability to configure it multiple ways to interact with the underlying technology but also based on the consumers use cases.

The three configuration methods for GPUDirect RDMA are as follows:

  • RDMA Shared Device
  • RDMA SR-IOV Legacy Device
  • RDMA Host Device
Let's take a look at each of these options and discuss why one might be used over the other depending on a consumers use case.

RDMA Shared Device


When using the NVIDIA network operator in OpenShift there is a configuration method in the NicClusterPolicy called RDMA shared device.  This method allows for an RDMA device to be shared among multiple pods on the OpenShift worker node where the device is exposed.  The user defined networks of those pods use VXLAN or VETH networking devices inside OpenShift.   Usually those devices are defined in the NicClusterPolicy by specifying the physical device name like in the code snippet below:

  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibs2f0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens8f0np0"]
            }
          }
        ]
      }

The example above shows both an RDMA shared device for an ethernet interface and an infiniband interface.   We also define the number of pods that could consume the interface via the rdmaHcaMax parameter.   In the NicClusterPolicy we can define as many interfaces that we have in the worker nodes.  Further we can set the number of pods that consume each device to various set points which makes this method very flexible.

In an RDMA shared device configuration keep in mind that the pods sharing the device will be competing for the bandwidth and latency of the same device as with any shared resource.  Thus an RDMA shared device is better suited for developer or application environments where performance and latency are not key but the ability to have RDMA functionality across nodes is important.

RDMA SR-IOV Legacy Device


Single Root IO Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that, like an RDMA shared device, can share a single device with multiple pods.  However the way the device is shared is very different because SR-IOV can segment the compliant network device at the hardware layer.  The network device is recognized on the node as a physical function (PF) and when segmented creates multiple virtual functions (VFs).  Each VF can be used like any other network device.  The SR-IOV network device driver for the device determines how the VF is exposed in the container:
  • netdevice driver: A regular kernel network device in the netns of the container
  • vfio-pci driver: A character device mounted in the container
Unlike a shared device though an SR-IOV device can only be shared with the number of pods based off the number of VFs the physical device supports.  However since each VF is like having direct access to the device the performance is ideal for workloads that are latency and bandwidth sensitive.

The configuration of the SRI-IOV devices doesn't take place in the NVIDIA network operator NicClusterPolicy, though we still need the policy for the driver, but rather in the SriovNetworkNodePolicy of the worker node.   The below example shows how we define a vendor and pfName for the nicSelector along with a numVfs which defines the number of VFs to create (usually a value up to the number the device supports).  

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace:  openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens8f0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy

Once the configuration is in place RDMA SR-IOV workloads that require high bandwidth and low latency are great candidates for this type of configuration where multiple pods need that performance from a single network device.

RDMA Host Device


Host device is in some ways a lot like SR-IOV in that a host device creates an additional network on a pod allowing direct physical ethernet access on the worker node.  The plugin allows the network device to be moved from the hosts network namespace to the pods network namespace.  However unlike SR-IOV once the device is passed into a pod the device is not available to any other host until the pod that is using it is removed from the system which makes it far more restrictive.

The configuration of this type of RDMA is handled again through the NVIDIA network operator NicClusterPolicy.   The irony here is even though it is not an SR-IOV configuration the DOCA driver uses the SRIOV network device plugin to do the device passing.   Below is an example of how to configure this type of RDMA where we will set a resourceName and use the NVIDIA vendors selector and any device that has the RDMA capability to be exposed as a host device.  If there are multiple cards in the system the configuration will expose all of them assuming they match the vendor id and have RDMA capabilities.


  sriovDevicePlugin:
      image: sriov-network-device-plugin
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.7.0
      config: |
        {
          "resourceList": [
              {
                  "resourcePrefix": "nvidia.com",
                  "resourceName": "hostdev",
                  "selectors": {
                      "vendors": ["15b3"],
                      "isRdma": true
                  }
              }
          ]
        }

The RDMA host device is normally leveraged where the other two options above are not feasible.  For example the use case requires performance but other requirements don't allow for the use of VFs.  Maybe the cards themselves do not support SR-IOV, or there is not enough PCI express base address registers(BAR) or maybe the system board does not support SR-IOV.   There are also rare cases where the SR-IOV netdevice driver does not support all the capabilities of the network device compared to the PF driver and the workload requires those features.

As we have discussed this blog covered what RDMA is and how one can configure three different methods of RDMA with the NVIDIA network operator.   We also discussed and compared why one might use one method over the other along the way.   Hopefully this gives those looking to adopt this technology enough detail to pursue the right solution for their use case.

Monday, January 13, 2025

Mellanox Firmware Updates via OpenShift

 

Anyone who has worked with Mellnox/NVIDIA networking devices knows there is sometimes the necessity to upgrade the firmware either to providing new feature functionality or addressing a current bug in the firmware.  This might be trivial on a legacy package based system where its easy enough to install the NVIDIA Firmware Tools (MFT) packages once and be done.  However for image based operating systems like Red Hat CoreOS which underpins the OpenShift Container Platform this can become cumbersome.   

Some of the challenges around image based systems is standard tooling like dnf is not available and while rpm-ostree install is an option its really not meant to be used like a packaging system.   When I initially was working on needing to update firmware I was instructed to install the MFT tools rpm inside the DOCA/MOFED container.  While this method works the drawbacks are:
  • The container is ephemeral so that if the DOCA/MOFED container restarts and/or gets updated I have to install the MFT tools all over again.
  • I need to stage the packages in the DOCA/MOFED container and the required kernel-devel dependencies.
Given these challenges I decided I want to build an image that I could run on OpenShift that would provide the tooling whenever I needed it simply by spinning up a pod.  We will cover that process through the rest of this blog.

Before we begin let's first explain what the MFT package of firmware management tools is used for:

  • Generate a standard or customized NVIDIA firmware image querying for firmware information
  • Burn a firmware image
  • Make configuration changes to the firmware settings

The following is a list of the available tools in MFT, together with a brief description of what each tool performs.

Tool Description/Function
mst Starts/stops the register access driver Lists the available mst devices
mlxburn Generation of a standard or customized NVIDIA firmware image for burning (.bin or .mlx)to the Flash/EEPROM attached to a NVIDIA HCA or switch device
flint This tool burns/query a firmware binary image or an expansion ROM image to the Flash device of a NVIDIA network adapter/gateway/switch device
debug utilities A set of debug utilities (e.g., itrace, fwtrace, mlxtrace, mlxdump, mstdump, mlxmcg, wqdump, mcra, mlxi2c, i2c, mget_temp, and pckt_drop)
mlxup The utility enables discovery of available NVIDIA adapters and indicates whether firmware update is required for each adapter
mlnx-tools Mellanox userland tools and scripts

Sources: Mlnx-tools Repo MFT Tools Mlxup

Prerequisites

Before we can build the container we need to setup the directory structure, gather a few packages and create the dockerfile and entrypoint script. First let's create the directory structure. I am using root in this example but it could be a regular user.

$ mkdir -p /root/mft/rpms $ cd /root/mft

Next we need to download the following rpms from Red Hat Package Downloads and place them into the rpms directory. The first is the kernel-devel package for the kernel of the OpenShift node this container will run on. To obtain the kernel version we can run the following oc command on our cluster.

$ oc debug node/nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-29nvidiaengrdu2dcredhatcom-debug-rhlgs ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.8 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# uname -r 5.14.0-427.47.1.el9_4.x86_64 sh-5.1#

Now that we have our kernel version we can download the two packages into our /root/mft/rpms directory.

  • kernel-devel-5.14.0-427.47.1.el9_4.x86_64.rpm
  • usbutils-017-1.el9.x86_64.rpm

Next we need to create the dockerfile.mft which will build the container.

$ cat <<EOF > dockerfile.mft # Start from UBI9 image FROM registry.access.redhat.com/ubi9/ubi:latest # Set work directory WORKDIR /root/mft # Copy in packages not available in UBI repo COPY ./rpms/*.rpm /root/rpms/ RUN dnf install /root/rpms/usbutils*.rpm -y # DNF install packages either from repo or locally RUN dnf install wget procps-ng pciutils yum jq iputils ethtool net-tools kmod systemd-udev rpm-build gcc make -y # Cleanup WORKDIR /root RUN dnf clean all # Run container entrypoint COPY entrypoint.sh /root/entrypoint.sh ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] EOF

The docker container file references a entrypoint.sh script so we need to create that under /root/mft/.

$ cat <<EOF > entrypoint.sh #!/bin/bash # Set working dir cd /root # Set tool versions MLNXTOOLVER=23.07-1.el9 MFTTOOLVER=4.30.0-139 MLXUPVER=4.30.0 # Set architecture ARCH=`uname -m` # Pull mlnx-tools from EPEL wget https://dl.fedoraproject.org/pub/epel/9/Everything/$ARCH/Packages/m/mlnx-tools-$MLNXTOOLVER.noarch.rpm # Arm architecture fixup for mft-tools if [ "$ARCH" == "aarch64" ]; then export ARCH="arm64"; fi # Pull mft-tools wget https://www.mellanox.com/downloads/MFT/mft-$MFTTOOLVER-$ARCH-rpm.tgz # Install mlnx-tools into container dnf install mlnx-tools-$MLNXTOOLVER.noarch.rpm # Install kernel-devel package supplied in container rpm -ivh /root/rpms/kernel-devel-*.rpm --nodeps mkdir /lib/modules/$(uname -r)/ ln -s /usr/src/kernels/$(uname -r) /lib/modules/$(uname -r)/build # Install mft-tools into container tar -xzf mft-$MFTTOOLVER-$ARCH-rpm.tgz cd /root/mft-$MFTTOOLVER-$ARCH-rpm #./install.sh --without-kernel ./install.sh # Change back to root workdir cd /root # x86 fixup for mlxup binary if [ "$ARCH" == "x86_64" ]; then export ARCH="x64"; fi # Pull and place mlxup binary wget https://www.mellanox.com/downloads/firmware/mlxup/$MLXUPVER/SFX/linux_$ARCH/mlxup mv mlxup /usr/local/bin chmod +x /usr/local/bin/mlxup sleep infinity & wait EOF

Now we should have all the prerequisites and we can move onto building the container.

Building The Container

To build the container run the podman build command on a Red Hat Enterprise Linux 9.x system using the Dockerfile.mft provided in this repository.

$ podman build . -f dockerfile.mft -t quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 STEP 1/9: FROM registry.access.redhat.com/ubi9/ubi:latest STEP 2/9: WORKDIR /root/mft --> 6e6c9f1636c7 STEP 3/9: COPY ./rpms/*.rpm /root/rpms/ --> 30a022291bd9 STEP 4/9: RUN dnf install /root/rpms/usbutils*.rpm -y Updating Subscription Management repositories. subscription-manager is operating in container mode. Red Hat Enterprise Linux 9 for x86_64 - BaseOS 9.2 MB/s | 41 MB 00:04 Red Hat Enterprise Linux 9 for x86_64 - AppStre 9.4 MB/s | 48 MB 00:05 Red Hat Universal Base Image 9 (RPMs) - BaseOS 2.2 MB/s | 525 kB 00:00 Red Hat Universal Base Image 9 (RPMs) - AppStre 5.2 MB/s | 2.3 MB 00:00 Red Hat Universal Base Image 9 (RPMs) - CodeRea 1.7 MB/s | 281 kB 00:00 Dependencies resolved. ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: usbutils x86_64 017-1.el9 @commandline 120 k Installing dependencies: hwdata noarch 0.348-9.15.el9 rhel-9-for-x86_64-baseos-rpms 1.6 M libusbx x86_64 1.0.26-1.el9 rhel-9-for-x86_64-baseos-rpms 78 k Transaction Summary ================================================================================ Install 3 Packages Total size: 1.8 M Total download size: 1.7 M Installed size: 9.8 M Downloading Packages: (1/2): libusbx-1.0.26-1.el9.x86_64.rpm 327 kB/s | 78 kB 00:00 (2/2): hwdata-0.348-9.15.el9.noarch.rpm 3.3 MB/s | 1.6 MB 00:00 -------------------------------------------------------------------------------- Total 3.4 MB/s | 1.7 MB 00:00 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : hwdata-0.348-9.15.el9.noarch 1/3 Installing : libusbx-1.0.26-1.el9.x86_64 2/3 Installing : usbutils-017-1.el9.x86_64 3/3 Running scriptlet: usbutils-017-1.el9.x86_64 3/3 Verifying : libusbx-1.0.26-1.el9.x86_64 1/3 Verifying : hwdata-0.348-9.15.el9.noarch 2/3 Verifying : usbutils-017-1.el9.x86_64 3/3 Installed products updated. Installed: hwdata-0.348-9.15.el9.noarch libusbx-1.0.26-1.el9.x86_64 usbutils-017-1.el9.x86_64 Complete! --> 7c16c8d84152 STEP 5/9: RUN dnf install wget procps-ng pciutils yum jq iputils ethtool net-tools kmod systemd-udev rpm-build gcc make -y Updating Subscription Management repositories. subscription-manager is operating in container mode. Last metadata expiration check: 0:00:08 ago on Thu Jan 9 18:32:20 2025. Package yum-4.14.0-17.el9.noarch is already installed. Dependencies resolved. ====================================================================================================== Package Arch Version Repository Size ====================================================================================================== Installing: ethtool x86_64 2:6.2-1.el9 rhel-9-for-x86_64-baseos-rpms 234 k gcc x86_64 11.5.0-2.el9 rhel-9-for-x86_64-appstream-rpms 32 M iputils x86_64 20210202-10.el9_5 rhel-9-for-x86_64-baseos-rpms 179 k (...) unzip-6.0-57.el9.x86_64 wget-1.21.1-8.el9_4.x86_64 xz-5.2.5-8.el9_0.x86_64 zip-3.0-35.el9.x86_64 zstd-1.5.1-2.el9.x86_64 Complete! --> 862d0e2c9c6f STEP 6/9: WORKDIR /root --> 5b3ec62db585 STEP 7/9: RUN dnf clean all Updating Subscription Management repositories. subscription-manager is operating in container mode. 43 files removed --> c14c44f59e9e STEP 8/9: COPY entrypoint.sh /root/entrypoint.sh --> d2d5192c3c57 STEP 9/9: ENTRYPOINT ["/bin/bash", "/root/entrypoint.sh"] COMMIT quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 --> 1873a4483236 Successfully tagged quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 1873a448323610f369a8565182a2914675f16d735ffe07f92258df89cd439224

Once the image has been built push the image up to the registry that the Openshift cluster can access.

$ podman push quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 Getting image source signatures Copying blob e5df12622381 done | Copying blob 97c1462e7c7b done | Copying blob facf1e7dd3e0 skipped: already exists Copying blob 2dca7d5c2bb7 done | Copying blob 6f64cedd7423 done | Copying blob ec465ce79861 skipped: already exists Copying blob 121c270794cd done | Copying config 1873a44832 done | Writing manifest to image destination

Running The Container

The container will need to run priviledged so we can access the hardware devices. To do this we will create a ServiceAccount and Namespace for it to run in.

$ cat <<EOF > mfttool-project.yaml apiVersion: v1 kind: Namespace metadata: name: mfttool --- apiVersion: v1 kind: ServiceAccount metadata: name: mfttool namespace: mfttool EOF

Once the resource file is generated create it on the cluster.

$ oc create -f mfttool-project.yaml namespace/mfttool created serviceaccount/mfttoolcreated

Now that the project has been created assign the appropriate privileges to the service account.

$ oc -n mfttool adm policy add-scc-to-user privileged -z mfttool clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "mfttool"

Next we will create a pod yaml for each of our baremetal nodes that will run under the mfttool namespace and leverage the MFT tooling.

$ cat <<EOF > mfttool-pod-nvd-srv-29.yaml apiVersion: v1 kind: Pod metadata: name: mfttool-pod-nvd-srv-29 namespace: mfttool spec: nodeSelector: kubernetes.io/hostname: nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com hostNetwork: true serviceAccountName: mfttool containers: - image: quay.io/redhat_emp1/ecosys-nvidia/mfttools:1.0.0 name: mfttool-pod-nvd-srv-29 securityContext: privileged: true EOF

Once the custom resource file has been generated, create the resource on the cluster.

oc create -f mfttool-pod-nvd-srv-29.yaml pod/mfttool-pod-nvd-srv-29 created

Validate that the pod is up and running.

$ oc get pods -n mfttool NAME READY STATUS RESTARTS AGE mfttool-pod-nvd-srv-29 1/1 Running 0 28s

Next we can rsh into the pod.

$ oc rsh -n mfttool mfttool-pod-nvd-srv-29 sh-5.1#

Once inside the pod we can run an mst start and then an mst status to see the devices.

$ oc rsh -n mfttool mfttool-pod-nvd-srv-29 sh-5.1# mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success [warn] mst_pciconf is already loaded, skipping Create devices Unloading MST PCI module (unused) - Success sh-5.1# mst status MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded MST devices: ------------ /dev/mst/mt4129_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:0d:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00 /dev/mst/mt4129_pciconf1 - PCI configuration cycles access. domain:bus:dev.fn=0000:37:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00 sh-5.1#

One of the things we can do with this container is query the devices and their settings with mlxconfig. We can also change those settings like when we need to change a port from ethernet mode to infiniband mode.

mlxconfig -d /dev/mst/mt4129_pciconf0 query Device #1: ---------- Device type: ConnectX7 Name: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled Device: /dev/mst/mt4129_pciconf0 Configurations: Next Boot MODULE_SPLIT_M0 Array[0..15] MEMIC_BAR_SIZE 0 MEMIC_SIZE_LIMIT _256KB(1) (...) ADVANCED_PCI_SETTINGS False(0) SAFE_MODE_THRESHOLD 10 SAFE_MODE_ENABLE True(1)

Another tool in the container is flint which allows us to see the firmware, product version and PSID for the device. This is useful for preparing for a firmware update.

flint -d /dev/mst/mt4129_pciconf0 query Image type: FS4 FW Version: 28.42.1000 FW Release Date: 8.8.2024 Product Version: 28.42.1000 Rom Info: type=UEFI version=14.35.15 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300126474 16 Base MAC: e09d73126474 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

Another tool in the container is mlxup which is an automated way to update the firmware. When we run mlxup it queries all devices on the system and reports back the current firmware and what available firmware there is for the device. We can then decide to update the cards or skip for now. In the example below I have two single port CX-7 cards in the node my pod is running on and I will upgrade their firmware.

$ mlxup Querying Mellanox devices firmware ... Device #1: ---------- Device Type: ConnectX7 Part Number: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled PSID: MT_0000001244 PCI Device Name: /dev/mst/mt4129_pciconf1 Base MAC: e09d73125fc4 Versions: Current Available FW 28.42.1000 28.43.1014 PXE 3.7.0500 N/A UEFI 14.35.0015 N/A Status: Update required Device #2: ---------- Device Type: ConnectX7 Part Number: MCX715105AS-WEAT_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 400GbE (default mode) / NDR IB; Single-port QSFP112; Port Split Capable; PCIe 5.0 x16 with x16 PCIe extension option; Crypto Disabled; Secure Boot Enabled PSID: MT_0000001244 PCI Device Name: /dev/mst/mt4129_pciconf0 Base MAC: e09d73126474 Versions: Current Available FW 28.42.1000 28.43.1014 PXE 3.7.0500 N/A UEFI 14.35.0015 N/A Status: Update required --------- Found 2 device(s) requiring firmware update... Perform FW update? [y/N]: y Device #1: Updating FW ... FSMST_INITIALIZE - OK Writing Boot image component - OK Done Device #2: Updating FW ... FSMST_INITIALIZE - OK Writing Boot image component - OK Done Restart needed for updates to take effect. Log File: /tmp/mlxup_workdir/mlxup-20250109_190606_17886.log

Notice the firmware upgrade completed but we need to restart the cards for the changes to take effect. We can use the mlxfwreset command to do this and then validate with the flint command that the card is running the new firmware.

sh-5.1# mlxfwreset -d /dev/mst/mt4129_pciconf0 reset -y The reset level for device, /dev/mst/mt4129_pciconf0 is: 3: Driver restart and PCI reset Continue with reset?[y/N] y -I- Sending Reset Command To Fw -Done -I- Stopping Driver -Done -I- Resetting PCI -Done -I- Starting Driver -Done -I- Restarting MST -Done -I- FW was loaded successfully. sh-5.1# flint -d /dev/mst/mt4129_pciconf0 query Image type: FS4 FW Version: 28.43.1014 FW Release Date: 7.11.2024 Product Version: 28.43.1014 Rom Info: type=UEFI version=14.36.16 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300126474 16 Base MAC: e09d73126474 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

We can repeat the same steps on the second card in the system to complete the firmware update.

sh-5.1# mlxfwreset -d /dev/mst/mt4129_pciconf1 reset -y The reset level for device, /dev/mst/mt4129_pciconf1 is: 3: Driver restart and PCI reset Continue with reset?[y/N] y -I- Sending Reset Command To Fw -Done -I- Stopping Driver -Done -I- Resetting PCI -Done -I- Starting Driver -Done -I- Restarting MST -Done -I- FW was loaded successfully. sh-5.1# flint -d /dev/mst/mt4129_pciconf1 query Image type: FS4 FW Version: 28.43.1014 FW Release Date: 7.11.2024 Product Version: 28.43.1014 Rom Info: type=UEFI version=14.36.16 cpu=AMD64,AARCH64 type=PXE version=3.7.500 cpu=AMD64 Description: UID GuidsNumber Base GUID: e09d730300125fc4 16 Base MAC: e09d73125fc4 16 Image VSD: N/A Device VSD: N/A PSID: MT_0000001244 Security Attributes: secure-fw

Once the firmware update has been completed and validate we can remove the container as this completes the firmware update example.   

Hopefully this gives an idea of what is required to use this container method which hopes to simplify the ability of upgrading Mellanox/NVIDIA firmware in a image based operating system like Red Hat CoreOS in OpenShift Container Platform.