Tuesday, December 02, 2025

NVIDIA Unified Fabric Manager (UFM) on RHEL9

The UFM platform empowers research and industrial data center operators to efficiently provision, monitor, manage, and preventively troubleshoot and maintain their high-performance InfiniBand networking fabric. The UFM platform is made up of multiple solution levels and a comprehensive feature set to meet the broadest range of modern, scale-out data center requirements. Using UFM, you can realize higher utilization of fabric resources and gain a competitive advantage, while reducing opex.

As indicated UFM is made up of multiple solution levels which include UFM Telemetry, UFM Enterprise and UFM Cyber-AI.  This blog will focus on UFM Enterprise and its relationship to the the Infiniband fabric.  More information around UFM can be found at here.

The rest of this blog will describe the process of getting UFM up and running on a host and then taking a test drive of the UFM web interface.  The blog is broken down into the following workflow sections:

  • Environment
  • Configure Repos
  • Set Firewall Rules
  • Disable SELinux
  • Install Software Dependencies
  • Install UFM Software
  • Configure UFM
  • Start UFM Services
  • UFM Overview Web UI Video

Environment

The test environment consists of a R760xa server running Red Hat Enterprise Linux 9.7.  There is also an infiniband interface to communicate with the infiniband fabric in the system.  

# cat /etc/redhat-release Red Hat Enterprise Linux release 9.7 (Plow) # uname -a Linux nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com 5.14.0-611.8.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov 13 05:30:00 EST 2025 x86_64 x86_64 x86_64 GNU/Linux

Configure Repos

We need to configure a few repositories on the UFM host. Those repositories include: CodeReady Builder, EPEL, NVIDIA Doca and Docker. First we will enable the CodeReady Builder repository (assuming RHEL host is registered and has entitlement).

# subscription-manager repos --enable codeready-builder-for-rhel-9-$(arch)-rpms Repository 'codeready-builder-for-rhel-9-x86_64-rpms' is enabled for this system.

Next we can enable the EPEL respository.

# dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm -y Updating Subscription Management repositories. Red Hat CodeReady Linux Builder for RHEL 9 x86_64 (RPMs) 47 MB/s | 15 MB 00:00 epel-release-latest-9.noarch.rpm 1.1 MB/s | 19 kB 00:00 Dependencies resolved. =================================================================================================================================================================================================================== Package Architecture Version Repository Size =================================================================================================================================================================================================================== Installing: epel-release noarch 9-10.el9 @commandline 19 k Transaction Summary =================================================================================================================================================================================================================== Install 1 Package Total size: 19 k Installed size: 26 k Downloading Packages: Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : epel-release-9-10.el9.noarch 1/1 Running scriptlet: epel-release-9-10.el9.noarch 1/1 Many EPEL packages require the CodeReady Builder (CRB) repository. It is recommended that you run /usr/bin/crb enable to enable the CRB repository. Verifying : epel-release-9-10.el9.noarch 1/1 Installed products updated. Installed: epel-release-9-10.el9.noarch Complete!

Now we need to add the NVIDIA DOCA repository.

# cat <<EOF > /etc/yum.repos.d/doca.repo [doca] name=DOCA Online Repo baseurl=https://linux.mellanox.com/public/repo/doca/3.1.0/rhel9.6/x86_64/ enabled=1 gpgcheck=0 EOF

Finally we will enable the Docker repository which is used for Docker as a requirement around UFM plugins which run as containers.

# dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo Updating Subscription Management repositories. Adding repo from: https://download.docker.com/linux/rhel/docker-ce.repo

With all the repositories added our repolist should look like the following.

# yum repolist Updating Subscription Management repositories. repo id repo name codeready-builder-for-rhel-9-x86_64-rpms Red Hat CodeReady Linux Builder for RHEL 9 x86_64 (RPMs) doca DOCA Online Repo docker-ce-stable Docker CE Stable - x86_64 epel Extra Packages for Enterprise Linux 9 - x86_64 epel-cisco-openh264 Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - x86_64 rhel-9-for-x86_64-appstream-rpms Red Hat Enterprise Linux 9 for x86_64 - AppStream (RPMs) rhel-9-for-x86_64-baseos-rpms Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)

Set Firewall Rules

There are some firewall rules we need to add in order to access the UFM web interface. Below we basically need to open up http and https ports permanently.

# firewall-cmd --get-active-zones public interfaces: eno12399 enp55s0np0 # firewall-cmd --zone=public --add-service=http success # firewall-cmd --zone=public --add-service=https success # firewall-cmd --permanent --zone=public --add-service=http success # firewall-cmd --permanent --zone=public --add-service=https success # firewall-cmd --reload success # firewall-cmd --zone=public --list-services cockpit dhcpv6-client http https ssh

Disable SELinux

UFM requires SeLinux to be disabled as per NVIDIA's official documentation so we will set it to disabled using the following sed command.

# sed -i "s/SELINUX=.*/SELINUX=disabled/" /etc/selinux/config

We will need to reboot the node for the change to take effect

Validate otherwise UFM will complain.

# sestatus SELinux status: disabled

Install Software Dependencies

There are a variety of software packages that need to be installed as dependencies before UFM can be installed. We will capture those here for installation.

# dnf install -y wget bc mod_ldap sshpass lftp zip rsync telnet qperf dos2unix httpd php net-snmp net-snmp-libs net-snmp-utils mod_ssl libnsl libxslt sqlite mod_session cairo apr-util-openssl net-tools docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Start, enable and check status of Docker

# systemctl start docker # systemctl enable docker Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /usr/lib/systemd/system/docker.service. # systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; preset: disabled) Active: active (running) since Thu 2025-11-20 17:04:39 EST; 11s ago TriggeredBy: ● docker.socket Docs: https://docs.docker.com Main PID: 3625 (dockerd) Tasks: 21 Memory: 107.7M (peak: 110.5M) CPU: 103ms CGroup: /system.slice/docker.service └─3625 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock Nov 20 17:04:38 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:38.602574940-05:00" level=info msg="Deleting nftables IPv6 rules" error="exit status 1" Nov 20 17:04:38 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:38.615498854-05:00" level=info msg="Firewalld: created docker-forwarding policy" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.151067145-05:00" level=info msg="Loading containers: done." Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.156910506-05:00" level=info msg="Docker daemon" commit=e9ff10b containerd-snapshotter=true storage-driver=overla> Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.157296220-05:00" level=info msg="Initializing buildkit" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.161871789-05:00" level=warning msg="git source cannot be enabled: failed to find git binary: exec: \"git\": exec> Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.163207822-05:00" level=info msg="Completed buildkit initialization" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.165944984-05:00" level=info msg="Daemon has completed initialization" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com dockerd[3625]: time="2025-11-20T17:04:39.165978374-05:00" level=info msg="API listen on /run/docker.sock" Nov 20 17:04:39 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: Started Docker Application Container Engine.

Install the DOCA drivers required to meet requirements for UFM.

# dnf install doca-ufm doca-kernel Updating Subscription Management repositories. Last metadata expiration check: 0:17:44 ago on Thu 20 Nov 2025 04:54:18 PM EST. Dependencies resolved. =================================================================================================================================================================================================================== Package Architecture Version Repository Size =================================================================================================================================================================================================================== Installing: doca-kernel x86_64 3.1.0-091000 doca 7.3 k doca-ufm x86_64 3.1.0-091000 doca 6.9 k Upgrading: rdma-core x86_64 2507mlnx58-1.2507097 doca 46 k Installing dependencies: ibutils2 x86_64 2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097 doca 3.9 M infiniband-diags x86_64 2507mlnx58-1.2507097 doca 314 k kernel-core x86_64 5.14.0-570.62.1.el9_6 rhel-9-for-x86_64-baseos-rpms 18 M kernel-modules-core x86_64 5.14.0-570.62.1.el9_6 rhel-9-for-x86_64-baseos-rpms 31 M kmod-iser x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 43 k kmod-isert x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 46 k kmod-kernel-mft-mlnx x86_64 4.33.0-1.rhel9u6 doca 41 k kmod-knem x86_64 1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6 doca 37 k kmod-mlnx-ofa_kernel x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 1.9 M kmod-srp x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 62 k kmod-xpmem x86_64 2.7.4-1.2507097.rhel9u6.rhel9u6 doca 492 k libibumad x86_64 2507mlnx58-1.2507097 doca 27 k lsof x86_64 4.94.0-3.el9 rhel-9-for-x86_64-baseos-rpms 241 k mlnx-ofa_kernel x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 38 k mlnx-ofa_kernel-devel x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 2.3 M mlnx-ofa_kernel-source x86_64 25.07-OFED.25.07.0.9.7.1.rhel9u6 doca 2.8 M mlnx-tools x86_64 25.07-0.2507097 doca 78 k ofed-scripts x86_64 25.07-OFED.25.07.0.9.7 doca 65 k xpmem x86_64 2.7.4-1.2507097.rhel9u6 doca 20 k Transaction Summary =================================================================================================================================================================================================================== Install 21 Packages Upgrade 1 Package Total download size: 61 M Is this ok [y/N]: y Downloading Packages: (1/22): doca-kernel-3.1.0-091000.x86_64.rpm 21 kB/s | 7.3 kB 00:00 (2/22): doca-ufm-3.1.0-091000.x86_64.rpm 18 kB/s | 6.9 kB 00:00 (3/22): kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 92 kB/s | 43 kB 00:00 (4/22): infiniband-diags-2507mlnx58-1.2507097.x86_64.rpm 496 kB/s | 314 kB 00:00 (5/22): ibutils2-2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097.x86_64.rpm 3.9 MB/s | 3.9 MB 00:00 (6/22): kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 98 kB/s | 46 kB 00:00 (7/22): kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64.rpm 102 kB/s | 41 kB 00:00 (8/22): kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 94 kB/s | 37 kB 00:00 (9/22): kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 131 kB/s | 62 kB 00:00 (10/22): kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64.rpm 706 kB/s | 492 kB 00:00 (11/22): kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 2.2 MB/s | 1.9 MB 00:00 (12/22): libibumad-2507mlnx58-1.2507097.x86_64.rpm 67 kB/s | 27 kB 00:00 (13/22): mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 96 kB/s | 38 kB 00:00 (14/22): mlnx-tools-25.07-0.2507097.x86_64.rpm 164 kB/s | 78 kB 00:00 (15/22): mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 2.6 MB/s | 2.3 MB 00:00 (16/22): mlnx-ofa_kernel-source-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64.rpm 3.1 MB/s | 2.8 MB 00:00 (17/22): lsof-4.94.0-3.el9.x86_64.rpm 1.1 MB/s | 241 kB 00:00 (18/22): ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64.rpm 139 kB/s | 65 kB 00:00 (19/22): xpmem-2.7.4-1.2507097.rhel9u6.x86_64.rpm 50 kB/s | 20 kB 00:00 (20/22): kernel-core-5.14.0-570.62.1.el9_6.x86_64.rpm 81 MB/s | 18 MB 00:00 (21/22): kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64.rpm 75 MB/s | 31 MB 00:00 (22/22): rdma-core-2507mlnx58-1.2507097.x86_64.rpm 97 kB/s | 46 kB 00:00 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Total 16 MB/s | 61 MB 00:03 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64 1/23 Installing : kernel-core-5.14.0-570.62.1.el9_6.x86_64 2/23 Running scriptlet: kernel-core-5.14.0-570.62.1.el9_6.x86_64 2/23 Installing : kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 3/23 Running scriptlet: kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 3/23 Installing : libibumad-2507mlnx58-1.2507097.x86_64 4/23 Running scriptlet: libibumad-2507mlnx58-1.2507097.x86_64 4/23 Installing : ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64 5/23 Running scriptlet: ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64 5/23 Installing : mlnx-tools-25.07-0.2507097.x86_64 6/23 Installing : ibutils2-2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097.x86_64 7/23 Installing : infiniband-diags-2507mlnx58-1.2507097.x86_64 8/23 Running scriptlet: infiniband-diags-2507mlnx58-1.2507097.x86_64 8/23 Installing : kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 9/23 Running scriptlet: kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 9/23 Installing : kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 10/23 Running scriptlet: kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 10/23 Installing : kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 11/23 Running scriptlet: kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 11/23 Installing : kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64 12/23 Running scriptlet: kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64 12/23 Upgrading : rdma-core-2507mlnx58-1.2507097.x86_64 13/23 Running scriptlet: rdma-core-2507mlnx58-1.2507097.x86_64 13/23 Installing : lsof-4.94.0-3.el9.x86_64 14/23 Installing : mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 15/23 Running scriptlet: mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 15/23 Configured /etc/security/limits.conf Installing : xpmem-2.7.4-1.2507097.rhel9u6.x86_64 16/23 Installing : mlnx-ofa_kernel-source-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 17/23 Installing : mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 18/23 Running scriptlet: mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 18/23 Installing : kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64 19/23 Running scriptlet: kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64 19/23 Installing : kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64 20/23 Running scriptlet: kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64 20/23 Installing : doca-kernel-3.1.0-091000.x86_64 21/23 Installing : doca-ufm-3.1.0-091000.x86_64 22/23 Running scriptlet: rdma-core-57.0-2.el9.x86_64 23/23 Cleanup : rdma-core-57.0-2.el9.x86_64 23/23 Running scriptlet: rdma-core-57.0-2.el9.x86_64 23/23 Running scriptlet: kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64 23/23 Running scriptlet: kernel-core-5.14.0-570.62.1.el9_6.x86_64 23/23 Running scriptlet: mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 23/23 Running scriptlet: rdma-core-57.0-2.el9.x86_64 23/23 Failed to start jobs: Failed to enqueue some jobs, see logs for details: No such file or directory Verifying : doca-kernel-3.1.0-091000.x86_64 1/23 Verifying : doca-ufm-3.1.0-091000.x86_64 2/23 Verifying : ibutils2-2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097.x86_64 3/23 Verifying : infiniband-diags-2507mlnx58-1.2507097.x86_64 4/23 Verifying : kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 5/23 Verifying : kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 6/23 Verifying : kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64 7/23 Verifying : kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64 8/23 Verifying : kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 9/23 Verifying : kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 10/23 Verifying : kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64 11/23 Verifying : libibumad-2507mlnx58-1.2507097.x86_64 12/23 Verifying : mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 13/23 Verifying : mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 14/23 Verifying : mlnx-ofa_kernel-source-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 15/23 Verifying : mlnx-tools-25.07-0.2507097.x86_64 16/23 Verifying : ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64 17/23 Verifying : xpmem-2.7.4-1.2507097.rhel9u6.x86_64 18/23 Verifying : lsof-4.94.0-3.el9.x86_64 19/23 Verifying : kernel-core-5.14.0-570.62.1.el9_6.x86_64 20/23 Verifying : kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64 21/23 Verifying : rdma-core-2507mlnx58-1.2507097.x86_64 22/23 Verifying : rdma-core-57.0-2.el9.x86_64 23/23 Installed products updated. Upgraded: rdma-core-2507mlnx58-1.2507097.x86_64 Installed: doca-kernel-3.1.0-091000.x86_64 doca-ufm-3.1.0-091000.x86_64 ibutils2-2.1.1-0.22300.MLNX20250720.g13bb9fedb.2507097.x86_64 infiniband-diags-2507mlnx58-1.2507097.x86_64 kernel-core-5.14.0-570.62.1.el9_6.x86_64 kernel-modules-core-5.14.0-570.62.1.el9_6.x86_64 kmod-iser-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-isert-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-kernel-mft-mlnx-4.33.0-1.rhel9u6.x86_64 kmod-knem-1.1.4.90mlnx3-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-srp-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 kmod-xpmem-2.7.4-1.2507097.rhel9u6.rhel9u6.x86_64 libibumad-2507mlnx58-1.2507097.x86_64 lsof-4.94.0-3.el9.x86_64 mlnx-ofa_kernel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 mlnx-ofa_kernel-devel-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 mlnx-ofa_kernel-source-25.07-OFED.25.07.0.9.7.1.rhel9u6.x86_64 mlnx-tools-25.07-0.2507097.x86_64 ofed-scripts-25.07-OFED.25.07.0.9.7.x86_64 xpmem-2.7.4-1.2507097.rhel9u6.x86_64 Complete!

Install UFM Software

Download the UFM software from the NVIDIA Licensing Portal. Pro-tip: In the browser use the Inspect->Network tool to grab the download URL and then use wget on the actual host to save time.

Once the UFM software is on the host gzip and untar the contents into the /tmp directory then change into the directory path. Then run the install.sh script.

# cd /tmp # ls ufm* check_ports.sh check_prereq.sh common_defines functions handle_ufmapp_user.sh install_common install.sh ufm_backup.sh ufm-repo uninstall.sh upgrade.sh # cd ufm* # /tmp/ufm-6.23.1-6.el9.x86_64 # ./install.sh Do you want to install UFM Enterprise [y|n]? y UFM IB PREREQUISITE TEST Installed distribution [OK] Server architecture [OK] NVIDIA Host Infiniband Networking Driver version [OK] Other SM [OK] Timezone configuration [OK] IPtables service [OK] Required RPM(s) [OK] Sudoers directory existence [OK] Sudoers directory inclusion [OK] Conflicting RPM(s) [OK] IB interface [OK] Localhost resolving [OK] Hostname resolving [OK] SELinux disabled [OK] Available disk space [OK] Write permissions on /tmp for other [OK] Virtual IP Port [OK] Ufmapp user definitions [OK] Checking that all required ports are available Checking tcp ports Checking state of port 3307 Port 3307 is free Checking state of port 2222 Port 2222 is free Checking state of port 8088 Port 8088 is free Checking state of port 8080 Port 8080 is free Checking state of port 8081 Port 8081 is free Checking state of port 8082 Port 8082 is free Checking state of port 8083 Port 8083 is free Checking state of port 8089 Port 8089 is free Checking udp ports Checking state of port 6306 Port 6306 is free Checking state of port 8005 Port 8005 is free Checking tcp ports allowed for httpd Checking state of port 443 Port 443 is free Checking state of port 80 Port 80 is free nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com: All prerequisite tests passed. See /tmp/ufm_prereq.log for more details Installing UFM... [*] Restoring HA flags... Default plugins bundle doesn't exist, skipping stage. Make sure the bundle tarball is in the /tmp directory. Or run it manually: /opt/ufm/scripts/manage_ufm_plugins deploy-bundle -f plugins_bundle_path [*] UFM installation log : /tmp/ufm_install_10515.log [*] UFM Installation finished successfully. [*] To enable UFM on startup run: systemctl enable ufm-enterprise.service [*] To Start UFM Please run: systemctl start ufm-enterprise.service

Do not start the service yet as we have a few configuration tasks to complete.

Configure UFM

Before we can start UFM we need to make a few changes to the initial configuration. First we need to set the infiniband interface to use. Let's find the interface first.

# find /sys/class/net -mindepth 1 -maxdepth 1 -lname '*virtual*' -prune -o -printf '%f\n' ibp13s0 eno12409 eno12399 enp55s0np0

We can tell from the output above our infiniband interface is ibp13s0 as the others are ethernet. We will use this to set the infiniband interface in the UFM configuration file.

# sed -i "s/fabric_interface =.*/fabric_interface = ibp13s0/" /opt/ufm/conf/gv.cfg

We also need to set the management interface in the configuration to our primary ethernet interface on the host which is eno12399.

# sed -i "s/mgmt_interface =.*/mgmt_interface = eno12399/" /opt/ufm/conf/gv.cfg # sed -i "s/ufma_interfaces =.*/ufma_interfaces = eno12399/" /opt/ufm/conf/gv.cfg

Next let's enable telemetry history in the configuration.

# sed -i "s/history_enabled =.*/history_enabled = true/" /opt/ufm/conf/gv.cfg

Now we need to make sure a couple of users are added to the Docker group on the system in order for the plugins web interface upload mechanism to work appropriately. We will be adding users: ufmapp and nginx.

# usermod -aG docker ufmapp # usermod -aG docker nginx

UFM has the concept of plugins to add on other features or enhancements. Some plugins, not all of them, come in a plugin bundle which can be obtained from the NVIDIA Licensing Portal. We have gone ahead and download the latest bundle to our UFM system. First we need to untar the bundle and unzip the contents.

# tar -xf ufm_plugins_bundle_20251113-0836.tar # gzip -d ufm-plugin-clusterminder_1.1.14-1293.amd64.tgz ufm-plugin-utm_1.23.1-38321085.x86_64.tgz ufm-plugin-tfs_1.1.2-0.tgz ufm-plugin-gnmi_telemetry_1.3.8-5.tgz ufm-plugin-ndt_1.1.1-25.gz ufm-plugin-kpi_1.0.10-0.tgz ufm-plugin-pmc_1.19.35.tgz ufm-plugin-cablevalidation_1.7.1-4_x86_64.tgz ufm-plugin-ib-link-resiliency_1.1.5-7.x86_64.tgz

Next we can pre-load the plugins into Docker. Here I am loading all the plugins but one might only load those that they need for their environment. I should also note that plugins can be loaded via the UFM web interface once the services are up and running.

# docker load -i ufm-plugin-clusterminder_1.1.14-1293.amd64.tar Loaded image: mellanox/ufm-plugin-clusterminder:1.1.14-1293 # docker load -i ufm-plugin-utm_1.23.1-38321085.x86_64.tar Loaded image: harbor.mellanox.com/collectx/gitlab/utm/x86_64/ufm-plugin-utm:1.23.1-38321085 # docker load -i ufm-plugin-tfs_1.1.2-0.tar Loaded image: mellanox/ufm-plugin-tfs:1.1.2-0 # docker load -i ufm-plugin-ib-link-resiliency_1.1.5-7.x86_64.tar Loaded image: mellanox/ufm-plugin-ib-link-resiliency:1.1.5-7 # docker load -i ufm-plugin-gnmi_telemetry_1.3.8-5.tar Loaded image: mellanox/ufm-plugin-gnmi_telemetry:1.3.8-5 # docker load -i ufm-plugin-ndt_1.1.1-25 Loaded image: mellanox/ufm-plugin-ndt:1.1.1-25 # docker load -i ufm-plugin-kpi_1.0.10-0.tar Loaded image: mellanox/ufm-plugin-kpi:1.0.10-0 # docker load -i ufm-plugin-pmc_1.19.35.tar Loaded image: harbor.mellanox.com/collectx/gitlab/x86_64/ufm-plugin-pmc:1.19.35 # docker load -i ufm-plugin-cablevalidation_1.7.1-4_x86_64.tar Loaded image: mellanox/ufm-plugin-cablevalidation:1.7.1-4

This completes all the pre-configuration activities.

Start UFM Services

Now we can finally start the UFM services with the following.

# systemctl start ufm-enterprise.service

Optionally we can set the services to start when the host comes up from a reboot.

# systemctl enable ufm-enterprise.service

Finally let's check the status of the services.

# systemctl status ufm-enterprise.service ● ufm-enterprise.service - UFM Enterprise Loaded: loaded (/usr/lib/systemd/system/ufm-enterprise.service; disabled; preset: disabled) Active: active (exited) since Fri 2025-11-21 16:09:12 EST; 8s ago Process: 14655 ExecStart=/etc/init.d/ufmd start (code=exited, status=0/SUCCESS) Main PID: 14655 (code=exited, status=0/SUCCESS) Tasks: 588 (limit: 1643822) Memory: 548.3M (peak: 571.2M) CPU: 7.555s CGroup: /system.slice/ufm-enterprise.service ├─15131 /opt/ufm/opensm/sbin/opensm --config /opt/ufm/files/conf/opensm/opensm.conf ├─15138 osm_crashd ├─15625 /opt/ufm/sharp2/bin/sharp_am -O /opt/ufm/files/conf/sharp/sharp_am.cfg ├─15884 /opt/ufm/telemetry/venv3/bin/python3 /opt/ufm/telemetry/venv3/bin/supervisord --config=/opt/ufm/files/conf/telemetry/supervisord.conf ├─16122 /opt/ufm/telemetry/venv3/bin/python3 /opt/ufm/telemetry/venv3/bin/supervisord --config=/opt/ufm/files/conf/secondary_telemetry/supervisord.conf ├─16147 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─16148 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─16149 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─16150 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/telemetry/launch_ibdiagnet_config.ini ├─16151 timeout 10010 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 1000 --long_run_iteration 10000 -o /opt/ufm/files/log -i mlx5_0 --mads_timeout 50 --config_file /opt/ufm/conf/opensm/ibdiag> ├─16152 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 1000 --long_run_iteration 10000 -o /opt/ufm/files/log -i mlx5_0 --mads_timeout 50 --config_file /opt/ufm/conf/opensm/ibdiag.conf --key_up> ├─16199 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini ├─16200 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini ├─16201 /opt/ufm/telemetry/bin/launch_ibdiagnet --config /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini ├─16202 /opt/ufm/telemetry/bin/watcher --config /opt/ufm/files/conf/secondary_telemetry/launch_ibdiagnet_config.ini ├─16206 timeout 12010 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 300000 --long_run_iteration 40 -o /opt/ufm/files/log/secondary_telemetry -i mlx5_0 --pm_pause 0 --config_file /opt/ufm/conf> ├─16207 /opt/ufm/telemetry/bin/ibdiagnet --long_run_timeout 300000 --long_run_iteration 40 -o /opt/ufm/files/log/secondary_telemetry -i mlx5_0 --pm_pause 0 --config_file /opt/ufm/conf/opensm/ibdiag> ├─16497 /opt/ufm/venv_ufm/bin/python3 -W ignore::DeprecationWarning -O /opt/ufm/gvvm/authentication_server/auth_server_main.pyc ├─16780 "/opt/ufm/venv_ufm/bin/python3 -O /opt/ufm/unhealthyports/upcore/unhealthy_ports_main.pyc" ├─16864 /opt/ufm/venv_ufm/bin/python3 /opt/ufm/ufmtelemetrysampling/sampling.pyc └─17088 /opt/ufm/venv_ufm/bin/python3 /opt/ufm/ufmhealth/UfmHealthRunner.pyc --config_file /opt/ufm/files/conf/UFMHealthConfiguration.xml --second_config_file /opt/ufm/files/conf/UFMInfraHealthConf> Nov 21 16:09:08 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com su[16712]: pam_unix(su:session): session closed for user ufmapp Nov 21 16:09:08 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14743]: Starting UFM main module: [ OK ] Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14743]: Starting UnhealthyPorts: [ OK ] Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14743]: Starting Telemetry Sampling: [ OK ] Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com sudo[16898]: root : PWD=/opt/ufm/gvvm/infra ; USER=root ; COMMAND=/sbin/apachectl graceful Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com sudo[16898]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0) Nov 21 16:09:11 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com sudo[16898]: pam_unix(sudo:session): session closed for user root Nov 21 16:09:12 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com crontab[17107]: (root) LIST (root) Nov 21 16:09:12 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com crontab[17100]: (root) REPLACE (root) Nov 21 16:09:12 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: Finished UFM Enterprise.

If the service does not start make sure there is no other Subnet Manager running on the fabric. The following error will show in the service status if that is the case.

# systemctl start ufm-enterprise.service Job for ufm-enterprise.service failed because the control process exited with error code. See "systemctl status ufm-enterprise.service" and "journalctl -xeu ufm-enterprise.service" for details. [root@nvd-srv-26 conf]# systemctl status ufm-enterprise.service × ufm-enterprise.service - UFM Enterprise Loaded: loaded (/usr/lib/systemd/system/ufm-enterprise.service; disabled; preset: disabled) Active: failed (Result: exit-code) since Fri 2025-11-21 10:34:22 EST; 5s ago Process: 14049 ExecStart=/etc/init.d/ufmd start (code=exited, status=1/FAILURE) Main PID: 14049 (code=exited, status=1/FAILURE) CPU: 380ms Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14137]: <13>Nov 21 10:34:22 ufm: Validation of UFM configuration files failed! Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com crontab[14218]: (root) LIST (root) Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com crontab[14221]: (root) REPLACE (root) Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufm[14238]: Other SM is in the fabric: lid:1, guid:0xfc6a1c0300e7ecc0, priority:15, state:SMINFO_MASTER Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14137]: Other SM is in the fabric: lid:1, guid:0xfc6a1c0300e7ecc0, priority:15, state:SMINFO_MASTER Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufm[14241]: Other SM is master in the fabric. Please stop all other SM and start UFM. Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com ufmd[14137]: <13>Nov 21 10:34:22 ufm: Other SM is master in the fabric. Please stop all other SM and start UFM. Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: ufm-enterprise.service: Main process exited, code=exited, status=1/FAILURE Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: ufm-enterprise.service: Failed with result 'exit-code'. Nov 21 10:34:22 nvd-srv-26.nvidia.eng.rdu2.dc.redhat.com systemd[1]: Failed to start UFM Enterprise.

We can also look at the status of the license on our system which in this case is just an evaluation.

# ufmlicense |------------------------------------------------------------------------------------------------------------------------------------------| | Customer ID | SN | swName | Type | MAC Address | Exp. Date |Limit| Functionality | Status | |------------------------------------------------------------------------------------------------------------------------------------------| |986799359 |1234567899 |UFM Enterprise |Evaluation |NA |2025-12-21 |1024 |Advanced |Valid | |------------------------------------------------------------------------------------------------------------------------------------------|

If all went well we should be able to login to the UFM Web UI. The default credentials are admin with password as 123456.

UFM Overview Web UI Video

The following video gives a brief overview of UFM.  Keep in mind this was my first exposure to UFM.