Thursday, January 15, 2026

OpenShift On-Cluster Image Mode & Lustre Client


Image mode for OpenShift allows you to easily extend the functionality of your base RHCOS image by layering additional images onto the base image. This layering does not modify the base RHCOS image. Instead, it creates a custom layered image that includes all RHCOS functionality and adds additional functionality to specific nodes in the cluster.

There are two methods for deploying a custom layered image onto your nodes:

  • On-cluster image mode where we create a MachineOSConfig object that includes the Containerfile and other parameters. The build is performed on the cluster and the resulting custom layered image is automatically pushed to your repository and applied to the machine config pool that you specified in the MachineOSConfig object. The entire process is performed completely within your cluster.

  • Out-of-cluster image mode where we create a Containerfile that references an OpenShift Container Platform image and the RPM that we want to apply, build the layered image in your own environment, and push the image to a repository. Then, in the cluster, we create a MachineConfig object for the targeted node pool that points to the new image. The Machine Config Operator overrides the base RHCOS image, as specified by the osImageURL value in the associated machine config, and boots the new image.

While I have written about out-of-cluster image mode before this example will focus on on-cluster image mode and specifically cover an example where I need the incorporate the Lustre client kernel drivers and packages into my OpenShift environment.

To get started we deployed a Single Node OpenShift environment running 4.20.8. Note the process will not be any different if using multinode there will just be more nodes to apply the updated image to.

Next we need to generate our secrets that can be used in the builder process. First let's set some environment variables for our internal registry, the user, the namespace and the token creation.

$ export REGISTRY=image-registry.openshift-image-registry.svc:5000 $ export REGISTRY_USER=builder $ export REGISTRY_NAMESPACE=openshift-machine-config-operator $ export TOKEN=$(oc create token $REGISTRY_USER -n $REGISTRY_NAMESPACE --duration=$((900*24))h)

Next let's create the push-secret using the variables we set in the openshift-machine-config-operator namespace.

$ oc create secret docker-registry push-secret -n openshift-machine-config-operator --docker-server=$REGISTRY --docker-username=$REGISTRY_USER --docker-password=$TOKEN secret/push-secret created

Now we need to extract the push secret and the clusters global pull-secret.

$ oc extract secret/push-secret -n openshift-machine-config-operator --to=- > push-secret.json # .dockerconfigjson $ oc extract secret/pull-secret -n openshift-config --to=- > pull-secret.json # .dockerconfigjson

We will now take the push-secret and global pull secret into one merged secret.

$ jq -s '.[0] * .[1]' pull-secret.json push-secret.json > pull-and-push-secret.json

This new merged secret needs to be create as well in the openshift-machine-config-operator namespace.

$ oc create secret generic pull-and-push-secret -n openshift-machine-config-operator --from-file=.dockerconfigjson=pull-and-push-secret.json --type=kubernetes.io/dockerconfigjson secret/pull-and-push-secret created
$ oc get secrets -n openshift-machine-config-operator |grep push pull-and-push-secret kubernetes.io/dockerconfigjson 1 10s push-secret kubernetes.io/dockerconfigjson 1 114s

Now we need to create MachineOSConfig custom resource file that will define the additional components we need to add to RHCOS. The following example shows that we will be doing the following:

  • This MachineOSConfig will be built and applied to nodes in the master MachineOSConfig pool. If we had workers in the worker pool we could change this and apply it there as well. We can also apply it to custom pools as well.
  • This MachineOSConfig will install EPEL, libyaml-devel and the four Lustre related client packages. Dnf will ensure to pull in any additional packages.
  • This MachineOSConfig has a renderedImagePushSpec pushing and pulling to internal registry of the OCP cluster. This could point to whichever registry where you want to store the image and then pull the image from.
  • We also have our secrets that we created before defined in this file.
$ cat <<EOF > on-cluster-rhcos-layer-mc.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineOSConfig metadata: name: master spec: machineConfigPool: name: master containerFile: - containerfileArch: NoArch content: |- FROM configs AS final RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \ dnf install -y https://mirror.stream.centos.org/9-stream/CRB/x86_64/os/Packages/libyaml-devel-0.2.5-7.el9.x86_64.rpm && \ dnf install -y https://downloads.whamcloud.com/public/lustre/lustre-2.15.7/el9.6/client/RPMS/x86_64/lustre-iokit-2.15.7-1.el9.x86_64.rpm \ https://downloads.whamcloud.com/public/lustre/lustre-2.15.7/el9.6/client/RPMS/x86_64/lustre-client-2.15.7-1.el9.x86_64.rpm \ https://downloads.whamcloud.com/public/lustre/lustre-2.15.7/el9.6/client/RPMS/x86_64/lustre-client-dkms-2.15.7-1.el9.noarch.rpm \ https://downloads.whamcloud.com/public/lustre/lustre-2.15.7/el9.6/client/RPMS/x86_64/kmod-lustre-client-2.15.7-1.el9.x86_64.rpm && \ dnf clean all && \ ostree container commit imageBuilder: imageBuilderType: Job baseImagePullSecret: name: pull-and-push-secret renderedImagePushSecret: name: push-secret renderedImagePushSpec: image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image:latest

Once the MachineOSConfig custom resource file is generated we can create it on our cluster.

$ oc create -f on-cluster-rhcos-layer-mc.yaml machineosconfig.machineconfiguration.openshift.io/worker created

Once the MachineOSConfig has been created we can monitor it via the following command.

$ oc get machineosbuild NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE master-afc1942c842a324aa66271cbf5fcb0d8 False True False False False 16s

We can also observe that a build-worker pod was created.

$ oc get pods -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE build-master-afc1942c842a324aa66271cbf5fcb0d8-fprgj 0/1 Init:0/1 0 29s kube-rbac-proxy-crio-sno2.schmaustech.com 1/1 Running 9 46h machine-config-controller-78b85fcd9c-h9gmn 2/2 Running 10 45h machine-config-daemon-tlt8g 2/2 Running 18 (6h37m ago) 45h machine-config-nodes-crd-cleanup-29470933-l8jz2 0/1 Completed 0 46h machine-config-nodes-crd-cleanup-29470952-xmlvn 0/1 Completed 0 45h machine-config-operator-658ff78994-bpzpj 2/2 Running 10 46h machine-config-server-mpwff 1/1 Running 5 45h machine-os-builder-65d7b4b97-m97lw 1/1 Running 0 41s

If we want to see more details on what is happening in the build-worker pod we can tail the logs of the image-build container inside the pod. I am only showing the command to obtain the logs here because the output is quite long and verbose. Further the build process takes awhile to run.

$ oc logs -f -n openshift-machine-config-operator build-master-afc1942c842a324aa66271cbf5fcb0d8-fprgj -c image-build

When the build finishes the image will get pushed to the registry defined in the MachineOSConfig. The logs will have a reference in there like this at the end.

+ buildah push --storage-driver vfs --authfile=/tmp/final-image-push-creds/config.json --digestfile=/tmp/done/digestfile --cert-dir /var/run/secrets/kubernetes.io/serviceaccount image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image:master-afc1942c842a324aa66271cbf5fcb0d8 Getting image source signatures Copying blob sha256:3a1265f127cd4df9ca7e05bf29ad06af47b49cff0215defce94c32eceee772bc Copying blob sha256:d87a18a2396ee3eb656b6237ac1fa64072dd750dde5aef660aff53e52c156f56 (...) Copying blob sha256:1d82edb13736f9bbad861d8f95cae0abfe5d572225f9d33d326e602ecc5db5fb Copying blob sha256:eb199ffe5f75bd36c537582e9cf5fa5638d55b8145f7dcd3cfc6b28699b2568d Copying config sha256:3d835eb02f08fe48d26d9b97ebcf0e190c401df2619d45cce1a94b0845d7f4e2 Writing manifest to image destination

At that point if we look at the machineosbuild output we will see the image moved to succeeded.

$ oc get machineosbuild NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE master-afc1942c842a324aa66271cbf5fcb0d8 False False True False False 24m

And we will see that the machine config pool is now in an updating state. At this point the image that was build is getting applied to the system and it will reboot.

$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-2d9e26ad2557d1859aadc76634a4f1a5 False True False 1 0 0 0 46h worker rendered-worker-36ba71179b413c7b7abc3e477e7367d5 True False False 0 0 0 0 46h

Once the node or nodes if multicluster reboot we should be able to open a debug pod on them and validate that are kernel modules and client were installed properly. First let's open a debug prompt.

$ oc debug node/sno2.schmaustech.com Starting pod/sno2schmaustechcom-debug-xcvdh ... To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`. Pod IP: 192.168.0.204 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1#

Next let's confirm the Lustre rpm packages are present.

sh-5.1# rpm -qa|grep lustre kmod-lustre-client-2.15.7-1.el9.x86_64 lustre-client-dkms-2.15.7-1.el9.noarch lustre-client-2.15.7-1.el9.x86_64 lustre-iokit-2.15.7-1.el9.x86_64

The packages are there now let's see if the Lustre kernel module is loaded. It might not be because my understanding is that it requires a process to request it first. If its not there we can manually load it.

sh-5.1# lsmod|grep lustre sh-5.1# modprobe lustre sh-5.1# lsmod|grep lustre lustre 1155072 0 lmv 233472 1 lustre mdc 315392 1 lustre lov 385024 2 mdc,lustre ptlrpc 1662976 7 fld,osc,fid,lov,mdc,lmv,lustre obdclass 3571712 8 fld,osc,fid,ptlrpc,lov,mdc,lmv,lustre lnet 884736 6 osc,obdclass,ptlrpc,ksocklnd,lmv,lustre libcfs 262144 11 fld,lnet,osc,fid,obdclass,ptlrpc,ksocklnd,lov,mdc,lmv,lustre ~

We can see our image has been updated and contains the necessary packages. Hopefully this provides an example of how to add 3rd party drivers and packages to an OpenShift environment. More details can be found on Image Mode here.

Tuesday, December 02, 2025

NVIDIA Unified Fabric Manager (UFM) on RHEL9

The UFM platform empowers research and industrial data center operators to efficiently provision, monitor, manage, and preventively troubleshoot and maintain their high-performance InfiniBand networking fabric. The UFM platform is made up of multiple solution levels and a comprehensive feature set to meet the broadest range of modern, scale-out data center requirements. Using UFM, you can realize higher utilization of fabric resources and gain a competitive advantage, while reducing opex.

As indicated UFM is made up of multiple solution levels which include UFM Telemetry, UFM Enterprise and UFM Cyber-AI.  This blog will focus on UFM Enterprise and its relationship to the the Infiniband fabric.  More information around UFM can be found at here.

The rest of this blog will describe the process of getting UFM up and running on a host and then taking a test drive of the UFM web interface.  The blog is broken down into the following workflow sections:

  • Environment
  • Configure Repos
  • Set Firewall Rules
  • Disable SELinux
  • Install Software Dependencies
  • Install UFM Software
  • Configure UFM
  • Start UFM Services
  • UFM Overview Web UI Video

Environment

The test environment consists of a R760xa server running Red Hat Enterprise Linux 9.7.  There is also an infiniband interface to communicate with the infiniband fabric in the system.  

# cat /etc/redhat-release Red Hat Enterprise Linux release 9.7 (Plow) # uname -a Linux nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com 5.14.0-611.8.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov 13 05:30:00 EST 2025 x86_64 x86_64 x86_64 GNU/Linux

Configure Repos

We need to configure a few repositories on the UFM host. Those repositories include: CodeReady Builder, EPEL, NVIDIA Doca and Docker. First we will enable the CodeReady Builder repository (assuming RHEL host is registered and has entitlement).

# subscription-manager repos --enable codeready-builder-for-rhel-9-$(arch)-rpms Repository 'codeready-builder-for-rhel-9-x86_64-rpms' is enabled for this system.

Next we can enable the EPEL respository.

# dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm -y Updating Subscription Management repositories. Red Hat CodeReady Linux Builder for RHEL 9 x86_64 (RPMs) 47 MB/s | 15 MB 00:00 epel-release-latest-9.noarch.rpm 1.1 MB/s | 19 kB 00:00 Dependencies resolved. =================================================================================================================================================================================================================== Package Architecture Version Repository Size =================================================================================================================================================================================================================== Installing: epel-release noarch 9-10.el9 @commandline 19 k Transaction Summary =================================================================================================================================================================================================================== Install 1 Package Total size: 19 k Installed size: 26 k Downloading Packages: Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : epel-release-9-10.el9.noarch 1/1 Running scriptlet: epel-release-9-10.el9.noarch 1/1 Many EPEL packages require the CodeReady Builder (CRB) repository. It is recommended that you run /usr/bin/crb enable to enable the CRB repository. Verifying : epel-release-9-10.el9.noarch 1/1 Installed products updated. Installed: epel-release-9-10.el9.noarch Complete!

Now we need to add the NVIDIA DOCA repository.

# cat <<EOF > /etc/yum.repos.d/doca.repo [doca] name=DOCA Online Repo baseurl=https://linux.mellanox.com/public/repo/doca/3.1.0/rhel9.6/x86_64/ enabled=1 gpgcheck=0 EOF

Finally we will enable the Docker repository which is used for Docker as a requirement around UFM plugins which run as containers.

# dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo Updating Subscription Management repositories. Adding repo from: https://download.docker.com/linux/rhel/docker-ce.repo

With all the repositories added our repolist should look like the following.

# yum repolist Updating Subscription Management repositories. repo id repo name codeready-builder-for-rhel-9-x86_64-rpms Red Hat CodeReady Linux Builder for RHEL 9 x86_64 (RPMs) doca DOCA Online Repo docker-ce-stable Docker CE Stable - x86_64 epel Extra Packages for Enterprise Linux 9 - x86_64 epel-cisco-openh264 Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - x86_64 rhel-9-for-x86_64-appstream-rpms Red Hat Enterprise Linux 9 for x86_64 - AppStream (RPMs) rhel-9-for-x86_64-baseos-rpms Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)

Set Firewall Rules

There are some firewall rules we need to add in order to access the UFM web interface. Below we basically need to open up http and https ports permanently.

# firewall-cmd --get-active-zones public interfaces: eno12399 enp55s0np0 # firewall-cmd --zone=public --add-service=http success # firewall-cmd --zone=public --add-service=https success # firewall-cmd --permanent --zone=public --add-service=http success # firewall-cmd --permanent --zone=public --add-service=https success # firewall-cmd --reload success # firewall-cmd --zone=public --list-services cockpit dhcpv6-client http https ssh

Disable SELinux

UFM requires SeLinux to be disabled as per NVIDIA's official documentation so we will set it to disabled using the following sed command.

# sed -i "s/SELINUX=.*/SELINUX=disabled/" /etc/selinux/config

We will need to reboot the node for the change to take effect

Validate otherwise UFM will complain.

# sestatus SELinux status: disabled

Install Software Dependencies

There are a variety of software packages that need to be installed as dependencies before UFM can be installed. We will capture those here for installation.

# dnf install -y wget bc mod_ldap sshpass lftp zip rsync telnet qperf dos2unix httpd php net-snmp net-snmp-libs net-snmp-utils mod_ssl libnsl libxslt sqlite mod_session cairo apr-util-openssl net-tools docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Start, enable and check status of Docker

# systemctl start docker # systemctl enable docker Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /usr/lib/systemd/system/docker.service. # systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; preset: disabled) Active: active (running) since Thu 2025-11-20 17:04:39 EST; 11s ago TriggeredBy: ● docker.socket Docs: https://docs.docker.com Main PID: 3625 (dockerd) Tasks: 21 Memory: 107.7M (peak: 110.5M) CPU: 103ms CGroup: /system.slice/docker.service └─3625 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock Nov 20 17:04:38 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:38.602574940-05:00" level=info msg="Deleting nftables IPv6 rules" error="exit status 1" Nov 20 17:04:38 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:38.615498854-05:00" level=info msg="Firewalld: created docker-forwarding policy" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.151067145-05:00" level=info msg="Loading containers: done." Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.156910506-05:00" level=info msg="Docker daemon" commit=e9ff10b containerd-snapshotter=true storage-driver=overla> Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.157296220-05:00" level=info msg="Initializing buildkit" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.161871789-05:00" level=warning msg="git source cannot be enabled: failed to find git binary: exec: \"git\": exec> Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.163207822-05:00" level=info msg="Completed buildkit initialization" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.165944984-05:00" level=info msg="Daemon has completed initialization" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.165978374-05:00" level=info msg="API listen on /run/docker.sock" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: Started Docker Application Container Engine.

Install the DOCA drivers required to meet requirements for UFM.

# dnf install doca-ufm doca-kernel Updating Subscription Management repositories. Last metadata expiration check: 0:17:44 ago on Thu 20 Nov 2025 04:54:18 PM EST. Dependencies resolved. =================================================================================================================================================================================================================== Package Architecture Version Repository Size =================================================================================================================================================================================================================== Installing: doca-kernel x86_64 3.1.0-091000 doca 7.3 k doca-ufm x86_64 3.1.0-091000 doca 6.9 k Upgrading: rdma-core x86_64 2507mlnx58-1.2507097 doca 46 k Installing dependencies: ibutils2 x86_64 2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097 doca 3.9 M infiniband-diags x86_64 2507mlnx58-1.2507097 doca 314 k kernel-core x86_64 5.14.0-570.62.1.el9_6 rhel-9-for-x86_64-baseos-rpms 18 M kernel-modules-core x86_64 5.14.0-570.62.1.el9_6 rhel-9-for-x86_64-baseos-rpms 31 M kmod-iser x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 43 k kmod-isert x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 46 k kmod-kernel-mft-mlnx x86_64 4.33.0-1.rhel9u6 doca 41 k kmod-knem x86_64 1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6 doca 37 k kmod-mlnx-ofa_kernel x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 1.9 M kmod-srp x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 62 k kmod-xpmem x86_64 2.7.4-1.2507097.rhel9u6.rhel9u6 doca 492 k libibumad x86_64 2507mlnx58-1.2507097 doca 27 k lsof x86_64 4.94.0-3.el9 rhel-9-for-x86_64-baseos-rpms 241 k mlnx-ofa_kernel x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 38 k mlnx-ofa_kernel-devel x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 2.3 M mlnx-ofa_kernel-source x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 2.8 M mlnx-tools x86_64 25.07-0.2507097 doca 78 k ofed-scripts x86_64 25.07-OFED.25.07.0.9.7 doca 65 k xpmem x86_64 2.7.4-1.2507097.rhel9u6 doca 20 k Transaction Summary =================================================================================================================================================================================================================== Install 21 Packages Upgrade 1 Package Total download size: 61 M Is this ok [y/N]: y Downloading Packages: (1/22): doca-kernel-3.1.0-091000.x86_64.rpm 21 kB/s | 7.3 kB 00:00 (2/22): doca-ufm-3.1.0-091000.x86_64.rpm 18 kB/s | 6.9 kB 00:00 (3/22): kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 92 kB/s | 43 kB 00:00 (4/22): infiniband-diags-2507mlnx58-1.2507097.x86_64.rpm 496 kB/s | 314 kB 00:00 (5/22): ibutils2-2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097.x86_64.rpm 3.9 MB/s | 3.9 MB 00:00 (6/22): kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 98 kB/s | 46 kB 00:00 (7/22): kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64.rpm 102 kB/s | 41 kB 00:00 (8/22): kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 94 kB/s | 37 kB 00:00 (9/22): kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 131 kB/s | 62 kB 00:00 (10/22): kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64.rpm 706 kB/s | 492 kB 00:00 (11/22): kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 2.2 MB/s | 1.9 MB 00:00 (12/22): libibumad-2507mlnx58-1.2507097.x86_64.rpm 67 kB/s | 27 kB 00:00 (13/22): mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 96 kB/s | 38 kB 00:00 (14/22): mlnx-tools-25.07-0.2507097.x86_64.rpm 164 kB/s | 78 kB 00:00 (15/22): mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 2.6 MB/s | 2.3 MB 00:00 (16/22): mlnx-ofa_kernel-source-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 3.1 MB/s | 2.8 MB 00:00 (17/22): lsof-4.94.0-3.el9.x86_64.rpm 1.1 MB/s | 241 kB 00:00 (18/22): ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64.rpm 139 kB/s | 65 kB 00:00 (19/22): xpmem-2.7.4-1.2507097.rhel9u6.x86_64.rpm 50 kB/s | 20 kB 00:00 (20/22): kernel-core-5.14.0-570.62.1.el9_6.x86_64.rpm 81 MB/s | 18 MB 00:00 (21/22): kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64.rpm 75 MB/s | 31 MB 00:00 (22/22): rdma-core-2507mlnx58-1.2507097.x86_64.rpm 97 kB/s | 46 kB 00:00 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Total 16 MB/s | 61 MB 00:03 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64 1/23 Installing : kernel-core-5.14.0-570.62.1.el9_6.x86_64 2/23 Running scriptlet: kernel-core-5.14.0-570.62.1.el9_6.x86_64 2/23 Installing : kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 3/23 Running scriptlet: kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 3/23 Installing : libibumad-2507mlnx58-1.2507097.x86_64 4/23 Running scriptlet: libibumad-2507mlnx58-1.2507097.x86_64 4/23 Installing : ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64 5/23 Running scriptlet: ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64 5/23 Installing : mlnx-tools-25.07-0.2507097.x86_64 6/23 Installing : ibutils2-2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097.x86_64 7/23 Installing : infiniband-diags-2507mlnx58-1.2507097.x86_64 8/23 Running scriptlet: infiniband-diags-2507mlnx58-1.2507097.x86_64 8/23 Installing : kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 9/23 Running scriptlet: kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 9/23 Installing : kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 10/23 Running scriptlet: kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 10/23 Installing : kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 11/23 Running scriptlet: kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 11/23 Installing : kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64 12/23 Running scriptlet: kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64 12/23 Upgrading : rdma-core-2507mlnx58-1.2507097.x86_64 13/23 Running scriptlet: rdma-core-2507mlnx58-1.2507097.x86_64 13/23 Installing : lsof-4.94.0-3.el9.x86_64 14/23 Installing : mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 15/23 Running scriptlet: mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 15/23 Configured /etc/security/limits.conf Installing : xpmem-2.7.4-1.2507097.rhel9u6.x86_64 16/23 Installing : mlnx-ofa_kernel-source-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 17/23 Installing : mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 18/23 Running scriptlet: mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 18/23 Installing : kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64 19/23 Running scriptlet: kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64 19/23 Installing : kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64 20/23 Running scriptlet: kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64 20/23 Installing : doca-kernel-3.1.0-091000.x86_64 21/23 Installing : doca-ufm-3.1.0-091000.x86_64 22/23 Running scriptlet: rdma-core-57.0-2.el9.x86_64 23/23 Cleanup : rdma-core-57.0-2.el9.x86_64 23/23 Running scriptlet: rdma-core-57.0-2.el9.x86_64 23/23 Running scriptlet: kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64 23/23 Running scriptlet: kernel-core-5.14.0-570.62.1.el9_6.x86_64 23/23 Running scriptlet: mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 23/23 Running scriptlet: rdma-core-57.0-2.el9.x86_64 23/23 Failed to start jobs: Failed to enqueue some jobs, see logs for details: No such file or directory Verifying : doca-kernel-3.1.0-091000.x86_64 1/23 Verifying : doca-ufm-3.1.0-091000.x86_64 2/23 Verifying : ibutils2-2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097.x86_64 3/23 Verifying : infiniband-diags-2507mlnx58-1.2507097.x86_64 4/23 Verifying : kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 5/23 Verifying : kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 6/23 Verifying : kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64 7/23 Verifying : kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64 8/23 Verifying : kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 9/23 Verifying : kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 10/23 Verifying : kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64 11/23 Verifying : libibumad-2507mlnx58-1.2507097.x86_64 12/23 Verifying : mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 13/23 Verifying : mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 14/23 Verifying : mlnx-ofa_kernel-source-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 15/23 Verifying : mlnx-tools-25.07-0.2507097.x86_64 16/23 Verifying : ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64 17/23 Verifying : xpmem-2.7.4-1.2507097.rhel9u6.x86_64 18/23 Verifying : lsof-4.94.0-3.el9.x86_64 19/23 Verifying : kernel-core-5.14.0-570.62.1.el9_6.x86_64 20/23 Verifying : kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64 21/23 Verifying : rdma-core-2507mlnx58-1.2507097.x86_64 22/23 Verifying : rdma-core-57.0-2.el9.x86_64 23/23 Installed products updated. Upgraded: rdma-core-2507mlnx58-1.2507097.x86_64 Installed: doca-kernel-3.1.0-091000.x86_64 doca-ufm-3.1.0-091000.x86_64 ibutils2-2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097.x86_64 infiniband-diags-2507mlnx58-1.2507097.x86_64 kernel-core-5.14.0-570.62.1.el9_6.x86_64 kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64 kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64 kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64 libibumad-2507mlnx58-1.2507097.x86_64 lsof-4.94.0-3.el9.x86_64 mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 mlnx-ofa_kernel-source-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 mlnx-tools-25.07-0.2507097.x86_64 ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64 xpmem-2.7.4-1.2507097.rhel9u6.x86_64 Complete!

Install UFM Software

Download the UFM software from the NVIDIA Licensing Portal. Pro-tip: In the browser use the Inspect->Network tool to grab the download URL and then use wget on the actual host to save time.

Once the UFM software is on the host gzip and untar the contents into the /tmp directory then change into the directory path. Then run the install.sh script.

# cd /tmp # ls ufm* check_ports.sh check_prereq.sh common_defines functions handle_ufmapp_user.sh install_common install.sh ufm_backup.sh ufm-repo uninstall.sh upgrade.sh # cd ufm* # /tmp/ufm-6.23.1-6.el9.x86_64 # ./install.sh Do you want to install UFM Enterprise [y|n]? y UFM IB PREREQUISITE TEST Installed distribution [OK] Server architecture [OK] NVIDIA Host Infiniband Networking Driver version [OK] Other SM [OK] Timezone configuration [OK] IPtables service [OK] Required RPM(s) [OK] Sudoers directory existence [OK] Sudoers directory inclusion [OK] Conflicting RPM(s) [OK] IB interface [OK] Localhost resolving [OK] Hostname resolving [OK] SELinux disabled [OK] Available disk space [OK] Write permissions on /tmp for other [OK] Virtual IP Port [OK] Ufmapp user definitions [OK] Checking that all required ports are available Checking tcp ports Checking state of port 3307 Port 3307 is free Checking state of port 2222 Port 2222 is free Checking state of port 8088 Port 8088 is free Checking state of port 8080 Port 8080 is free Checking state of port 8081 Port 8081 is free Checking state of port 8082 Port 8082 is free Checking state of port 8083 Port 8083 is free Checking state of port 8089 Port 8089 is free Checking udp ports Checking state of port 6306 Port 6306 is free Checking state of port 8005 Port 8005 is free Checking tcp ports allowed for httpd Checking state of port 443 Port 443 is free Checking state of port 80 Port 80 is free nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com: All prerequisite tests passed. See /tmp/ufm_prereq.log for more details Installing UFM... [*] Restoring HA flags... Default plugins bundle doesn't exist, skipping stage. Make sure the bundle tarball is in the /tmp directory. Or run it manually: /opt/ufm/scripts/manage_ufm_plugins deploy-bundle -f plugins_bundle_path [*] UFM installation log : /tmp/ufm_install_10515.log [*] UFM Installation finished successfully. [*] To enable UFM on startup run: systemctl enable ufm-enterprise.service [*] To Start UFM Please run: systemctl start ufm-enterprise.service

Do not start the service yet as we have a few configuration tasks to complete.

Configure UFM

Before we can start UFM we need to make a few changes to the initial configuration. First we need to set the infiniband interface to use. Let's find the interface first.

# find /sys/class/net -mindepth 1 -maxdepth 1 -lname '*virtual*' -prune -o -printf '%f\n' ibp13s0 eno12409 eno12399 enp55s0np0

We can tell from the output above our infiniband interface is ibp13s0 as the others are ethernet. We will use this to set the infiniband interface in the UFM configuration file.

# sed -i "s/fabric_interface =.*/fabric_interface = ibp13s0/" /opt/ufm/conf/gv.cfg

We also need to set the management interface in the configuration to our primary ethernet interface on the host which is eno12399.

# sed -i "s/mgmt_interface =.*/mgmt_interface = eno12399/" /opt/ufm/conf/gv.cfg # sed -i "s/ufma_interfaces =.*/ufma_interfaces = eno12399/" /opt/ufm/conf/gv.cfg

Next let's enable telemetry history in the configuration.

# sed -i "s/history_enabled =.*/history_enabled = true/" /opt/ufm/conf/gv.cfg

Now we need to make sure a couple of users are added to the Docker group on the system in order for the plugins web interface upload mechanism to work appropriately. We will be adding users: ufmapp and nginx.

# usermod -aG docker ufmapp # usermod -aG docker nginx

UFM has the concept of plugins to add on other features or enhancements. Some plugins, not all of them, come in a plugin bundle which can be obtained from the NVIDIA Licensing Portal. We have gone ahead and download the latest bundle to our UFM system. First we need to untar the bundle and unzip the contents.

# tar -xf ufm_plugins_bundle_20251113-0836.tar # gzip -d ufm-plugin-clusterminder_1.1.14-1293.amd64.tgz ufm-plugin-utm_1.23.1-38321085.x86_64.tgz ufm-plugin-tfs_1.1.2-0.tgz ufm-plugin-gnmi_telemetry_1.3.8-5.tgz ufm-plugin-ndt_1.1.1-25.gz ufm-plugin-kpi_1.0.10-0.tgz ufm-plugin-pmc_1.19.35.tgz ufm-plugin-cablevalidation_1.7.1-4_x86_64.tgz ufm-plugin-ib-link-resiliency_1.1.5-7.x86_64.tgz

Next we can pre-load the plugins into Docker. Here I am loading all the plugins but one might only load those that they need for their environment. I should also note that plugins can be loaded via the UFM web interface once the services are up and running.

# docker load -i ufm-plugin-clusterminder_1.1.14-1293.amd64.tar Loaded image: mellanox/ufm-plugin-clusterminder:1.1.14-1293 # docker load -i ufm-plugin-utm_1.23.1-38321085.x86_64.tar Loaded image: harbor.mellanox.com/collectx/gitlab/utm/x86_64/ufm-plugin-utm:1.23.1-38321085 # docker load -i ufm-plugin-tfs_1.1.2-0.tar Loaded image: mellanox/ufm-plugin-tfs:1.1.2-0 # docker load -i ufm-plugin-ib-link-resiliency_1.1.5-7.x86_64.tar Loaded image: mellanox/ufm-plugin-ib-link-resiliency:1.1.5-7 # docker load -i ufm-plugin-gnmi_telemetry_1.3.8-5.tar Loaded image: mellanox/ufm-plugin-gnmi_telemetry:1.3.8-5 # docker load -i ufm-plugin-ndt_1.1.1-25 Loaded image: mellanox/ufm-plugin-ndt:1.1.1-25 # docker load -i ufm-plugin-kpi_1.0.10-0.tar Loaded image: mellanox/ufm-plugin-kpi:1.0.10-0 # docker load -i ufm-plugin-pmc_1.19.35.tar Loaded image: harbor.mellanox.com/collectx/gitlab/x86_64/ufm-plugin-pmc:1.19.35 # docker load -i ufm-plugin-cablevalidation_1.7.1-4_x86_64.tar Loaded image: mellanox/ufm-plugin-cablevalidation:1.7.1-4

This completes all the pre-configuration activities.

Start UFM Services

Now we can finally start the UFM services with the following.

# systemctl start ufm-enterprise.service

Optionally we can set the services to start when the host comes up from a reboot.

# systemctl enable ufm-enterprise.service

Finally let's check the status of the services.

# systemctl status ufm-enterprise.service ● ufm-enterprise.service - UFM Enterprise Loaded: loaded (/usr/lib/systemd/system/ufm-enterprise.service; disabled; preset: disabled) Active: active (exited) since Fri 2025-11-21 16:09:12 EST; 8s ago Process: 14655 ExecStart=/etc/init.d/ufmd start (code=exited, status=0/SUCCESS) Main PID: 14655 (code=exited, status=0/SUCCESS) Tasks: 588 (limit: 1643822) Memory: 548.3M (peak: 571.2M) CPU: 7.555s CGroup: /system.slice/ufm-enterprise.service ├─15131 /opt/ufm/opensm/sbin/opensm --config /opt/ufm/files/conf/opensm/opensm.conf ├─15138 osm_crashd ├─15625 /opt/ufm/sharp2/bin/sharp_am -O /opt/ufm/files/conf/sharp/sharp_am.cfg ├─15884 /opt/ufm/telemetry/venv3/bin/python3 /opt/ufm/telemetry/venv3/bin/supervisord --config=/opt/ufm/files/conf/telemetry/supervisord.conf ├─16122 /opt/ufm/telemetry/venv3/bin/python3 /opt/ufm/telemetry/venv3/bin/supervisord --config=/opt/ufm/files/conf/secondary_telemetry/supervisord.conf ├─16147 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─16148 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─16149 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─16150 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─16151 timeout 10010 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 1000 --long_run_iteration 10000 -o /opt/ufm/files/log -i mlx5_0 --mads_timeout 50 --config_file /opt/ufm/conf/opensm/ibdiag> ├─16152 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 1000 --long_run_iteration 10000 -o /opt/ufm/files/log -i mlx5_0 --mads_timeout 50 --config_file /opt/ufm/conf/opensm/ibdiag.conf --key_up> ├─16199 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini ├─16200 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini ├─16201 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini ├─16202 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini ├─16206 timeout 12010 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 300000 --long_run_iteration 40 -o /opt/ufm/files/log/secondary_telemetry -i mlx5_0 --pm_pause 0 --config_file /opt/ufm/conf> ├─16207 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 300000 --long_run_iteration 40 -o /opt/ufm/files/log/secondary_telemetry -i mlx5_0 --pm_pause 0 --config_file /opt/ufm/conf/opensm/ibdiag> ├─16497 /opt/ufm/venv_ufm/bin/python3 -W ignore::DeprecationWarning -O /opt/ufm/gvvm/authentication_server/auth_server_main.pyc ├─16780 "/opt/ufm/venv_ufm/bin/python3 -O /opt/ufm/unhealthyports/upcore/unhealthy_ports_main.pyc" ├─16864 /opt/ufm/venv_ufm/bin/python3 /opt/ufm/ufmtelemetrysampling/sampling.pyc └─17088 /opt/ufm/venv_ufm/bin/python3 /opt/ufm/ufmhealth/UfmHealthRunner.pyc --config_file /opt/ufm/files/conf/UFMHealthConfiguration.xml --second_config_file /opt/ufm/files/conf/UFMInfraHealthConf> Nov 21 16:09:08 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com su[16712]: pam_unix(su:session): session closed for user ufmapp Nov 21 16:09:08 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14743]: Starting UFM main module: [ OK ] Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14743]: Starting UnhealthyPorts: [ OK ] Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14743]: Starting Telemetry Sampling: [ OK ] Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com sudo[16898]: root : PWD=/opt/ufm/gvvm/infra ; USER=root ; COMMAND=/sbin/apachectl graceful Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com sudo[16898]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0) Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com sudo[16898]: pam_unix(sudo:session): session closed for user root Nov 21 16:09:12 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com crontab[17107]: (root) LIST (root) Nov 21 16:09:12 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com crontab[17100]: (root) REPLACE (root) Nov 21 16:09:12 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: Finished UFM Enterprise.

If the service does not start make sure there is no other Subnet Manager running on the fabric. The following error will show in the service status if that is the case.

# systemctl start ufm-enterprise.service Job for ufm-enterprise.service failed because the control process exited with error code. See "systemctl status ufm-enterprise.service" and "journalctl -xeu ufm-enterprise.service" for details. [root@nvd-srv-26 conf]# systemctl status ufm-enterprise.service × ufm-enterprise.service - UFM Enterprise Loaded: loaded (/usr/lib/systemd/system/ufm-enterprise.service; disabled; preset: disabled) Active: failed (Result: exit-code) since Fri 2025-11-21 10:34:22 EST; 5s ago Process: 14049 ExecStart=/etc/init.d/ufmd start (code=exited, status=1/FAILURE) Main PID: 14049 (code=exited, status=1/FAILURE) CPU: 380ms Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14137]: <13>Nov 21 10:34:22 ufm: Validation of UFM configuration files failed! Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com crontab[14218]: (root) LIST (root) Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com crontab[14221]: (root) REPLACE (root) Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufm[14238]: Other SM is in the fabric: lid:1, guid:0xfc6a1c0300e7ecc0, priority:15, state:SMINFO_MASTER Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14137]: Other SM is in the fabric: lid:1, guid:0xfc6a1c0300e7ecc0, priority:15, state:SMINFO_MASTER Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufm[14241]: Other SM is master in the fabric. Please stop all other SM and start UFM. Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14137]: <13>Nov 21 10:34:22 ufm: Other SM is master in the fabric. Please stop all other SM and start UFM. Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: ufm-enterprise.service: Main process exited, code=exited, status=1/FAILURE Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: ufm-enterprise.service: Failed with result 'exit-code'. Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: Failed to start UFM Enterprise.

We can also look at the status of the license on our system which in this case is just an evaluation.

# ufmlicense |------------------------------------------------------------------------------------------------------------------------------------------| | Customer ID | SN | swName | Type | MAC Address | Exp. Date |Limit| Functionality | Status | |------------------------------------------------------------------------------------------------------------------------------------------| |986799359 |1234567899 |UFM Enterprise |Evaluation |NA |2025-12-21 |1024 |Advanced |Valid | |------------------------------------------------------------------------------------------------------------------------------------------|

If all went well we should be able to login to the UFM Web UI. The default credentials are admin with password as 123456.

UFM Overview Web UI Video

The following video gives a brief overview of UFM.  Keep in mind this was my first exposure to UFM.