Tuesday, February 24, 2026

OpenShift Network Card Rail Mapping

The goal of this writeup is to provide a simple mechanism to map which GPUs are associated to which NICs on the same PCIe switch inside a physical system. This mapped information can then assist in generating a OpenShift MachineConfig that can identify one network card per GPU on the same PCI root complex and persistently name that network device a rail(some number) while marking any others as secondary. This is primarily for NVIDIA's Spectrum-X stack but could be used across any platform where GPU to NIC coherency is important in regards to configuration for OpenShift.

Why?

For optimal cluster performance and minimal latency, it’s essential to align each GPU with its nearest high-speed network card, ideally on the same NUMA node and PCIe root complex. This ensures that data traveling to and from each GPU takes the shortest, most efficient path, which is especially critical for GPUDirect RDMA and high-throughput AI/HPC workloads.

While there are tools that can provide pieces of this view all the commands have to be run manually and then its up to the user to fit it all together. Ideally there should be one solution that can provide all the details in a concise manner.

Hwloc

The Portable Hardware Locality (hwloc) software package provides a portable abstraction of the hierarchical topology of modern architectures, including NUMA memory nodes (DRAM, HBM, non-volatile memory, CXL, etc.), processor packages, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs.  A sample image that it can generate is shown below.

Hwloc primarily aims at helping applications with gathering information about increasingly complex parallel computing platforms so as to exploit them accordingly and efficiently. For instance, two tasks that tightly cooperate should probably be placed onto cores sharing a cache. However, two independent memory-intensive tasks should better be spread out onto different processor packages so as to maximize their memory throughput.

However Hwloc does not ship in OpenShift today.  Further it does not generate UDEV rules, MachineConfigs and seems heavy handed for the task at hand.

Rail Mappings

The gpu-nic-rail-mapping script aims to provide a simple example to identify the GPU to NIC relationship and then generates the MachineConfig for OpenShift to ensure there is one rail per GPU marked. Below is an example run on a Dell 9680 (H200) system with the following devices in it:

  • 8 x H200 GPUs - Device ID 10de:2335
  • 14 x BF3 Cards - Device ID 15b3:a2dc
sh-5.1# ./gpu-nic-rail-mapping -g 10de:2335 -n 15b3:a2dc -u 70-persistent-net.rules -r worker GPU BusAddr NIC BusAddr PCIe Switch NIC Slot NIC Port UDEV Eth UDEV IB ==================================================================================================== 1b:00.0 18:00.0 15:01.0/16:00.0 40 1 eth_rail0 roce_rail0 1b:00.0 1a:00.0 15:01.0/16:00.0 42 1 eth_sec0 roce_sec0 3c:00.0 3a:00.0 37:01.0/38:00.0 41 1 eth_rail1 roce_rail1 4b:00.0 4d:00.0 48:01.0/49:00.0 38 1 eth_rail2 roce_rail2 5c:00.0 5d:00.0 59:01.0/5a:00.0 37 1 eth_rail3 roce_rail3 5c:00.0 5f:00.0 59:01.0/5a:00.0 39 1 eth_sec1 roce_sec1 5c:00.0 5f:00.1 59:01.0/5a:00.0 39 2 eth_sec2 roce_sec2 9a:00.0 9b:00.0 97:01.0/98:00.0 32 1 eth_rail4 roce_rail4 bb:00.0 ba:00.0 b7:01.0/b8:00.0 31 1 eth_rail5 roce_rail5 bb:00.0 bc:00.0 b7:01.0/b8:00.0 33 1 eth_sec3 roce_sec3 bb:00.0 bc:00.1 b7:01.0/b8:00.0 33 2 eth_sec4 roce_sec4 cd:00.0 ca:00.0 c7:01.0/c8:00.0 36 1 eth_rail6 roce_rail6 cd:00.0 cc:00.0 c7:01.0/c8:00.0 34 1 eth_sec5 roce_sec5 dc:00.0 db:00.0 d7:01.0/d8:00.0 35 1 eth_rail7 roce_rail7 Generated 99-machine-config-udev-network.yaml file for OpenShift

Here was the 70-persistent-net.rules file generated.

sh-5.1# cat 70-persistent-net.rules ACTION=="add", KERNELS=="0000:18:00.0", SUBSYSTEM=="net", NAME="eth_rail0" ACTION=="add", KERNELS=="0000:18:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail0" ACTION=="add", KERNELS=="0000:1a:00.0", SUBSYSTEM=="net", NAME="eth_sec0" ACTION=="add", KERNELS=="0000:1a:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec0" ACTION=="add", KERNELS=="0000:3a:00.0", SUBSYSTEM=="net", NAME="eth_rail1" ACTION=="add", KERNELS=="0000:3a:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail1" ACTION=="add", KERNELS=="0000:4d:00.0", SUBSYSTEM=="net", NAME="eth_rail2" ACTION=="add", KERNELS=="0000:4d:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail2" ACTION=="add", KERNELS=="0000:5d:00.0", SUBSYSTEM=="net", NAME="eth_rail3" ACTION=="add", KERNELS=="0000:5d:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail3" ACTION=="add", KERNELS=="0000:5f:00.0", SUBSYSTEM=="net", NAME="eth_sec1" ACTION=="add", KERNELS=="0000:5f:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec1" ACTION=="add", KERNELS=="0000:5f:00.1", SUBSYSTEM=="net", NAME="eth_sec2" ACTION=="add", KERNELS=="0000:5f:00.1", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec2" ACTION=="add", KERNELS=="0000:9b:00.0", SUBSYSTEM=="net", NAME="eth_rail4" ACTION=="add", KERNELS=="0000:9b:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail4" ACTION=="add", KERNELS=="0000:ba:00.0", SUBSYSTEM=="net", NAME="eth_rail5" ACTION=="add", KERNELS=="0000:ba:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail5" ACTION=="add", KERNELS=="0000:bc:00.0", SUBSYSTEM=="net", NAME="eth_sec3" ACTION=="add", KERNELS=="0000:bc:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec3" ACTION=="add", KERNELS=="0000:bc:00.1", SUBSYSTEM=="net", NAME="eth_sec4" ACTION=="add", KERNELS=="0000:bc:00.1", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec4" ACTION=="add", KERNELS=="0000:ca:00.0", SUBSYSTEM=="net", NAME="eth_rail6" ACTION=="add", KERNELS=="0000:ca:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail6" ACTION=="add", KERNELS=="0000:cc:00.0", SUBSYSTEM=="net", NAME="eth_sec5" ACTION=="add", KERNELS=="0000:cc:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_sec5" ACTION=="add", KERNELS=="0000:db:00.0", SUBSYSTEM=="net", NAME="eth_rail7" ACTION=="add", KERNELS=="0000:db:00.0", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_FIXED roce_rail7"

And finally the OpenShift MachineConfig 99-machine-config-udev-network.yaml for the udev rule naming.

sh-5.1# cat 99-machine-config-udev-network.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-machine-config-udev-network spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,QUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MTg6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9yYWlsMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MTg6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2VfcmFpbDAiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjFhOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfc2VjMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6MWE6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2Vfc2VjMCIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6M2E6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9yYWlsMSIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6M2E6MDAuMCIsIFNVQlNZU1RFTT09ImluZmluaWJhbmQiLCBQUk9HUkFNPSJyZG1hX3JlbmFtZSAlayBOQU1FX0ZJWEVEIHJvY2VfcmFpbDEiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjRkOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjRkOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWwyIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZDowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3JhaWwzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZDowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9yYWlsMyIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6NWY6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9zZWMxIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZjowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9zZWMxIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDo1ZjowMC4xIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjVmOjAwLjEiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzIiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjliOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOjliOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw0IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYTowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3JhaWw1IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYTowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9yYWlsNSIKQUNUSU9OPT0iYWRkIiwgS0VSTkVMUz09IjAwMDA6YmM6MDAuMCIsIFNVQlNZU1RFTT09Im5ldCIsIE5BTUU9ImV0aF9zZWMzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYzowMC4wIiwgU1VCU1lTVEVNPT0iaW5maW5pYmFuZCIsIFBST0dSQU09InJkbWFfcmVuYW1lICVrIE5BTUVfRklYRUQgcm9jZV9zZWMzIgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpiYzowMC4xIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmJjOjAwLjEiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzQiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNhOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDYiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNhOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw2IgpBQ1RJT049PSJhZGQiLCBLRVJORUxTPT0iMDAwMDpjYzowMC4wIiwgU1VCU1lTVEVNPT0ibmV0IiwgTkFNRT0iZXRoX3NlYzUiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmNjOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3NlYzUiCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmRiOjAwLjAiLCBTVUJTWVNURU09PSJuZXQiLCBOQU1FPSJldGhfcmFpbDciCkFDVElPTj09ImFkZCIsIEtFUk5FTFM9PSIwMDAwOmRiOjAwLjAiLCBTVUJTWVNURU09PSJpbmZpbmliYW5kIiwgUFJPR1JBTT0icmRtYV9yZW5hbWUgJWsgTkFNRV9GSVhFRCByb2NlX3JhaWw3Igo= filesystem: root mode: 420 path: /etc/udev/rules.d/70-persistent-net.rules

The above MachineConfig can now be applied on the worker nodes of an OpenShift cluster of homogeneous nodes and persistently name the the rail devices mapped to the according GPUs.

In this next example we tried this on an SuperMicro AMD Instinct type system which had the following devices in it:

  • 8 x MI325X - Device ID 1002:74a5
  • 7 x AMD Pensando Systems POLLARA-1Q400 100/200/400G 1-port Card - Device ID 1dd8:1002
  • 1 x NVIDIA ConnectX-7 - Device ID 15b3:1021

This system was interesting because it had multiple network card types associated with GPUs which allowed us to test the script behavior in that scenario.   One caveat on this system was that dmidecode and lspci both failed to show the physical slot number for the Pollara cards while the CX7 card showed its physical slot just fine.

# ./gpu-nic-rail-mapping -g 1002:74a5 -n 1dd8:1002,15b3:1021 -u 70-persistent-net.rules -r worker GPU BusAddr NIC BusAddr PCIe Switch NIC Slot NIC Port UDEV Eth UDEV IB ==================================================================================================== 05:00.0 09:00.0 00:01.1/01:00.0 NA 1 eth_rail0 roce_rail0 15:00.0 19:00.0 10:01.1/11:00.0 NA 1 eth_rail1 roce_rail1 65:00.0 69:00.0 60:01.1/61:00.0 NA 1 eth_rail2 roce_rail2 75:00.0 79:00.0 70:01.1/71:00.0 NA 1 eth_rail3 roce_rail3 85:00.0 89:00.0 80:01.1/81:00.0 NA 1 eth_rail4 roce_rail4 95:00.0 99:00.0 90:01.1/91:00.0 NA 1 eth_rail5 roce_rail5 e5:00.0 e6:00.0 e0:01.1/e1:00.0 1 1 eth_rail6 roce_rail6 f5:00.0 f9:00.0 f0:01.1/f1:00.0 NA 1 eth_rail7 roce_rail7 Generated 99-machine-config-udev-network.yaml file for OpenShift

Whilst a 70-persistent-net.rules file and 99-machine-config-udev-network.yaml machineconfig were generated here as well they look very much like the H200 example.

The overall idea here was to automate an otherwise tedious task when it came to identifying and mapping the same GPU and network devices on the same pcie root complex.   Hopefully this provided a simple example to accomplish that task.  For those interested in seeing the script the repository is here.