Friday, July 11, 2025

Real Time at The Edge

 

Edge computing is all the rage now given small devices can often provide the performance required to process the workload in the given edge location.  However before migrating applications and workloads from their existing proprietary systems to a more general Linux environment consumers need to feel confident their workloads will perform just as this did in the legacy systems.   After all some of these systems protect the operators from life or death situations.

Workloads at the edge are often mission critical and perform an elegant orchestrated dance within the confines of their resources.  This might mean some processes share the same processor core and intertwined to ensure each process gets the guaranteed amount of clock cycle but also does not put pressure on other processes on the same core even if the process runs afoul due to an environment or code based problem.

One tool that edge developers can use is called rt-app.  If rt-app doesn't sound familiar, it is a testing tool that can be used to start multiple periodic threads in order to simulate a real-time periodic workload use case.  Not only the sleep and run pattern can be emulated but also the dependency between tasks like accessing same critical resources, creating sequential wake up or syncing the wake up of threads. The use case is described in a json like file which is processed by rt-app.

The rest of this blog will cover an example of testing a real-time group of tasks that run on the same core.  The example will show how we can schedule them without them overlapping and also an example of where a task is broken and it interferes with the the other tasks on the core.  However before I proceed I do want to recognize this work builds upon the efforts Daniel Bristot de Oliveira of Red Hat built out in the following repository.  Daniel was an amazing person to work with and took great strides in explaining things to me that I did not understand.   Unfortunately Daniel passed away a short time after we did this work together  over a year ago.   I have greatly missed him as a colleague, mentor and friend.

Contents of Repository

The repository for the work described is located here and consists of the following:
  • Dockerfile - To build the container to run the tests
  • entrypoint.sh - The script that runs within the container to kickoff the rt-app workload test
  • run.sh - The script that takes Daniel's work here and collapses it into one script and launches rt-app via containers.
  • basic.json - This is used to compute the CAL (Function Call Interrupt) on a core
  • single.json - Example json
  • template.json - Example json

Build the Container

We can build the container using the files in the repository. This container build process has been tested on both x86_64 and aarch64.

# podman build -f Dockerfile --build-arg ARCH=`uname -i` -t quay.io/bschmaus/rt-app-container:latest [1/2] STEP 1/7: FROM registry.access.redhat.com/ubi9/ubi-minimal:9.3 AS builder [1/2] STEP 2/7: RUN echo "builder:x:1001:" >> /etc/group && echo "builder:x:1001:1001:Builder:/home/build:/bin/bash" >> /etc/passwd && install -o builder -g builder -m 0700 -d /home/build --> Using cache 3a05dd8b2a4da05ef3af9f0ed71ad3033f7f9ecd36c1554a9fc12237f39a41a6 --> 3a05dd8b2a4d (...) [2/2] STEP 11/11: ENTRYPOINT ["/usr/local/bin/entrypoint.sh"] [2/2] COMMIT quay.io/bschmaus/rt-app-container:latest --> c7764c58580b Successfully tagged quay.io/bschmaus/rt-app-container:latest c7764c58580b549c18f1a2cf59194e8657620d12289573861640f608b9f0a1fe

Test Framework

We will be doing our testing on a Red Hat Enterprise Linux 9.3 system with low latency tuned profiles.

# uname -a Linux edge-24.edge.lab.eng.rdu2.redhat.com 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 3 11:12:36 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release Red Hat Enterprise Linux release 9.3 (Plow)

The first step we need to perform is to install the tuned-profiles-realtime and tuned. I should note here that for aarch64 I needed to manually download the tuned-profiles-realtime from Red Hat Portal because even though the rpm package is a noarch it is only available in the x86_64 repos.

# dnf install tuned tuned-profiles-realtime Updating Subscription Management repositories. Last metadata expiration check: 0:55:11 ago on Tue 23 Apr 2024 01:02:02 PM EDT. Package tuned-2.21.0-1.el9_3.noarch is already installed. Dependencies resolved. ============================================================================================================================================================================================================================================== Package Architecture Version Repository Size ============================================================================================================================================================================================================================================== Installing: tuned-profiles-realtime noarch 2.21.0-1.el9_3 beaker-NFV 15 k Installing dependencies: tuna noarch 0.18-12.el9 beaker-BaseOS 166 k Transaction Summary ============================================================================================================================================================================================================================================== Install 2 Packages Total download size: 182 k Installed size: 590 k Is this ok [y/N]: y Downloading Packages: (1/2): tuned-profiles-realtime-2.21.0-1.el9_3.noarch.rpm 1.7 MB/s | 15 kB 00:00 (2/2): tuna-0.18-12.el9.noarch.rpm 14 MB/s | 166 kB 00:00 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Total 14 MB/s | 182 kB 00:00 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : tuna-0.18-12.el9.noarch 1/2 Installing : tuned-profiles-realtime-2.21.0-1.el9_3.noarch 2/2 Running scriptlet: tuned-profiles-realtime-2.21.0-1.el9_3.noarch 2/2 Verifying : tuna-0.18-12.el9.noarch 1/2 Verifying : tuned-profiles-realtime-2.21.0-1.el9_3.noarch 2/2 Installed products updated. Installed: tuna-0.18-12.el9.noarch tuned-profiles-realtime-2.21.0-1.el9_3.noarch Complete!

With the tuned profiles installed lets determine which cores we would like to set isolated.

# numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 63628 MB node 0 free: 60714 MB node distances: node 0 0: 10

Since everything is in one NUMA here we are just going to isolate cores 4-7 for our testing. To prepare for that we need to edit the following file /etc/tuned/realtime-variables.conf and set the isolcpus. Since the default setting in the file is isolated_cores=\${f:calc_isolated_cores:1} we can use a simple sed to make our change.

# sed -i s/isolated_cores=\${f:calc_isolated_cores:1}/isolated_cores=4-7/g /etc/tuned/realtime-variables.conf # cat /etc/tuned/realtime-variables.conf|grep ^isolated_cores isolated_cores=4-7

Now let's set the tuned profile and reboot for the changes to take effect.

# tuned-adm profile realtime # reboot

To capture a kernel trace which we can view with KernelShark we will need to install trace-cmd

# dnf install -y trace-cmd Updating Subscription Management repositories. Last metadata expiration check: 1:43:35 ago on Tue 23 Apr 2024 01:02:02 PM EDT. Dependencies resolved. ============================================================================================================================================================================================================================================== Package Architecture Version Repository Size ============================================================================================================================================================================================================================================== Installing: trace-cmd x86_64 2.9.2-10.el9 beaker-BaseOS 233 k Installing dependencies: libtracecmd x86_64 0-10.el9 beaker-BaseOS 100 k libtracefs x86_64 1.3.1-1.el9 beaker-BaseOS 75 k Transaction Summary ============================================================================================================================================================================================================================================== Install 3 Packages Total download size: 408 k Installed size: 893 k Is this ok [y/N]: y Downloading Packages: (1/3): libtracecmd-0-10.el9.x86_64.rpm 6.4 MB/s | 100 kB 00:00 (2/3): libtracefs-1.3.1-1.el9.x86_64.rpm 4.2 MB/s | 75 kB 00:00 (3/3): trace-cmd-2.9.2-10.el9.x86_64.rpm 11 MB/s | 233 kB 00:00 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Total 19 MB/s | 408 kB 00:00 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : libtracefs-1.3.1-1.el9.x86_64 1/3 Installing : libtracecmd-0-10.el9.x86_64 2/3 Installing : trace-cmd-2.9.2-10.el9.x86_64 3/3 Running scriptlet: trace-cmd-2.9.2-10.el9.x86_64 3/3 Verifying : libtracecmd-0-10.el9.x86_64 1/3 Verifying : libtracefs-1.3.1-1.el9.x86_64 2/3 Verifying : trace-cmd-2.9.2-10.el9.x86_64 3/3 Installed products updated. Installed: libtracecmd-0-10.el9.x86_64 libtracefs-1.3.1-1.el9.x86_64 trace-cmd-2.9.2-10.el9.x86_64 Complete!

Running a Test

After we have built our container and have installed and configured out how we can run a test. The run.sh script can perform three different tests which are defined by the TYPE variable inside the script. Those tests are: single, three and broken. In our test below we set the TYPE to three and CPUS to core 5. Then ran the test which looks like the following:

# ./run.sh Enable DEADLINE hrtick... Allow real-time tasks may use up to 100% of CPU times... sysctl: setting key "kernel.sched_rt_runtime_us": Device or resource busy Set preemptive scheduling to full... Creating log and json directories... Set variable values for run... Measure the CAL for core 5... Build up test json files... Create and run the pods... 34e09802149a25585d54f7ed2117202b69afa5619c08312e19e190d174a2842b UN-container 5763bf6938629ef3cb2985a927a7978a64617fe0e937543b2b752b588507f773 DEUX-container 70671f8e1b42a7548f6515399ac8e3a8ad4087285e39e2ffcc049ef5db847df3 TROIS-container Gather the trace-cmd recording... CPU0 data recorded at offset=0xaba000 294912 bytes in size CPU1 data recorded at offset=0xb02000 520192 bytes in size CPU2 data recorded at offset=0xb81000 360448 bytes in size CPU3 data recorded at offset=0xbd9000 303104 bytes in size CPU4 data recorded at offset=0xc23000 0 bytes in size CPU5 data recorded at offset=0xc23000 39940096 bytes in size CPU6 data recorded at offset=0x323a000 0 bytes in size CPU7 data recorded at offset=0x323a000 0 bytes in size Cleanup the pods... 5763bf6938629ef3cb2985a927a7978a64617fe0e937543b2b752b588507f773 34e09802149a25585d54f7ed2117202b69afa5619c08312e19e190d174a2842b 70671f8e1b42a7548f6515399ac8e3a8ad4087285e39e2ffcc049ef5db847df3

Once the test has run take the trace.dat output and look at it in KernelShark and make sure that the iterations and cycles do not overrun one another.

NVIDIA GPU Direct Storage on OpenShift

GPU Direct Storage enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. Using this direct path can relieve system bandwidth bottlenecks and decrease the latency and utilization load on the CPU. GPU Direct Storage can be used with NVMe or even NFS on a Netapp filer, the latter which this blog will cover.

Workflow

This blog is laid out with the follow sections all which build on top of one another to get the goal of successful GPU Direct Storage over NFS.

  • Assumptions
  • Considerations
  • Architecture
  • SRIOV Operator Configuration
  • Netapp VServer Setup
  • Netapp Trident CSI Operator Configuration
  • NVIDIA Network Operator Configuration
  • NVIDIA GPU Operator Configuration
  • GDS Cuda Workload Container

Assumptions

This document assumes that we have already deployed a OpenShift Cluster and have installed the necessary operators required for GPU Direct Storage. Those operators would be Node Feature Discover which should also be configured along with the base installation of the NVIDIA Network Operator (no NicClusterPolicy yet) and the NVIDIA GPU Operator (no GpuClusterPolicy yet), SRIOV Operator (no SRIOV policies or instances) and the Trident CSI Operator (No orchestrators or backends configured yet).

Considerations

If any of the nvme devices in the system participate in either the operating system or other services (machine configs for LVMs or other customized access) the nvme kernel modules will not be able to unload properly even with the workaround defined in this documentation. Any use of GDS requires that the nvme drives are not in use during the deployment of the Network Operator in order for the Network Operator to be able to unload in-tree drivers and then load NVIDIA's out of tree drivers in place.

Architecture

Below is a diagram of how the environment was architected from a networking perspective.

SRIOV Operator Configuration

For GPU Direct Storage over NFS to make performance sense we will need to use SRIOV here. So we first need to configure the SRIOV Operator assuming the SRIOV Operator is installed. The first step is to generate a basic SriovOperatorConfig custom resource file.

$ cat <<EOF > sriov-operator-config.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovOperatorConfig metadata: name: default namespace: openshift-sriov-network-operator spec: enableInjector: true enableOperatorWebhook: true logLevel: 2 EOF

Next we create the SriovOperatorConfig on the cluster.

$ oc create -f sriov-operator-config.yaml sriovoperatorconfig.sriovnetwork.openshift.io/default created

Now one key step here is to patch the SriovOperatorConfig so that it is aware of the NVIDIA Network Operator.

$ oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }' sriovoperatorconfig.sriovnetwork.openshift.io/default patched

Now we can move onto generating a SriovNetworkNodePolicy which will define the interface that we want to have VFs. In the case of multiple interfaces we would want to create multiple SriovNetworkNodePolicy files. The example below demonstrates how to configure an interface with an MTU of 9000 and generate 8 VFs.

$ cat <<EOF > sriov-network-node-policy.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: sriov-legacy-policy namespace: openshift-sriov-network-operator spec: deviceType: netdevice mtu: 9000 nicSelector: vendor: "15b3" pfNames: ["enp55s0np0#0-7"] nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" numVfs: 8 priority: 90 isRdma: true resourceName: sriovlegacy EOF

With the SriovNetworkNodePolicy generated we can create it on the cluster which will cause the worker nodes where it is applied to reboot.

$ oc create -f sriov-network-node-policy.yaml sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created

Once the node has rebooted we can optionally open a debug pod on the worker nodes and verify with ip link to confirm the interfaces were created. If we are ready to move forward we can next generate the SriovNetwork for the resource we created in the SriovNetworkNodePolicy. Again if we have multiple SriovNetworkNodePolicy files we will also have multiple SriovNetwork files. These define the network space for the VF interfaces. I should note that these networks need to have access to the Netapp data LIF as well in order for RDMA to function. In my example below I excluded the ipaddresses in range of 102.168.10.100-110 because my Netapp data LIF will have ipaddresss in that space.

$ cat <<EOF > sriov-network.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: sriov-network namespace: openshift-sriov-network-operator spec: vlan: 0 networkNamespace: "default" resourceName: "sriovlegacy" ipam: | { "type": "whereabouts", "range": "192.168.10.0/24", "exclude": [ "192.168.10.100/30", "192.168.10.110/32" ] } EOF

Now we can create the SriovNetwork custom resource on the cluster.

$ oc create -f sriov-network.yaml sriovnetwork.sriovnetwork.openshift.io/sriov-network created

At this point we have configured everything we need for SRIOV and can move onto the next section of the documentation.

Netapp VServer Setup

This section is really just to cover a few items of importance from the Netapp vserver perspective. This does not aim to be a comprehensive guide on how to setup a Netapp MetroCluster or the vservers within them. First in our example environment we had a vserver created and that vserver as two logical interfaces: management and data. With the management interface we can access the vserver and look at a few things. Depending on the environment this may or may not be accessible for the OpenShift administrator. In my case the storage team gave me access. To get on the vserver we can ssh to the vserver ipaddress or fqdn if it exists in DNS.

$ ssh trident@10.6.136.110 (trident@10.6.136.110) Password: Last login time: 5/7/2025 19:31:11

Once we are logged in I want to confirm that NFS 4 is enabled along with RDMA by using vserver nfs show.

ntap-rdu3-nv01-nvidia::> vserver nfs show Vserver: ntap-rdu3-nv01-nvidia General Access: true v3: enabled v4.0: enabled 4.1: enabled UDP: enabled TCP: enabled RDMA: enabled Default Windows User: - Default Windows Group: -

The above output looks good for my needs when doing GPU Direct Storage. Another item we can check is the export-policies with vserver export-policy show.

ntap-rdu3-nv01-nvidia::> vserver export-policy show Vserver Policy Name --------------- ------------------- ntap-rdu3-nv01-nvidia default ntap-rdu3-nv01-nvidia trident-8d6b2406-551a-416b-bcce-22626ed60242 2 entries were displayed.

And finally I wanted to confirm that my data interfaces connected to the NVIDIA high speed switch were indeed operating with jumbo frames. I can see that with the network port show command. Because this is a MetroCluster pair setup we can see the interfaces on both nodes is set appropriately.

ntap-rdu3-nv01-nvidia::> network port show Node: ntap-rdu3-nv01-a Speed(Mbps) Health Port Broadcast Domain Link MTU Admin/Oper Status --------- ------------ ---------------- ---- ---- ----------- -------- e0M Management up 1500 auto/1000 healthy e1b - down 1500 auto/- - e2a nvidia up 9000 auto/200000 healthy e2b - up 1500 auto/100000 healthy e2b-710 nfs up 1500 -/- healthy e6a - down 1500 auto/- - e6b - down 1500 auto/- - e7b - down 1500 auto/- - e8a - down 1500 auto/- - e8b - down 1500 auto/- - Node: ntap-rdu3-nv01-b Speed(Mbps) Health Port Broadcast Domain Link MTU Admin/Oper Status --------- ------------ ---------------- ---- ---- ----------- -------- e0M Management up 1500 auto/1000 healthy e1b - down 1500 auto/- - e2a nvidia up 9000 auto/200000 healthy e2b - up 1500 auto/100000 healthy e2b-710 nfs up 1500 -/- healthy e6a - down 1500 auto/- - e6b - down 1500 auto/- - e7b - down 1500 auto/- - e8a - down 1500 auto/- - e8b - down 1500 auto/- - 20 entries were displayed.

At this point we can exit out of the vserver and move onto configuring the Netapp Trident CSI operator.

Netapp Trident CSI Operator Configuration

Trident is an open-source and fully supported storage orchestrator for containers and Kubernetes distributions, including Red Hat OpenShift. Trident works with the entire NetApp storage portfolio, including the NetApp ONTAP and Element storage systems, and it also supports NFS and iSCSI connections. Trident accelerates the DevOps workflow by allowing end users to provision and manage storage from their NetApp storage systems without requiring intervention from a storage administrator.

We have made the assumption that the Trident Operator and the default Trident Orchestrator have already been deployed.  Our next step will be to configure the secret for the Netapp vfiler with the credentials so that Trident knows how which username and password to connect. 

$ cat <<EOF > netapp-phy-secret.yaml apiVersion: v1 kind: Secret metadata: name: netapp-phy-secret namespace: trident type: Opaque stringData: username: vserver-user password: verserv-password
Once we have our custom resource file generated we can create it on the cluster.
$ oc create -f netapp-phy-secret.yaml secret/netapp-phy-secret created
Next we need to configure the TridentBackendConfig so that Trident knows how to communicate with the Netapp from both a management and data perspective.  Note the credentials we created are referenced here.
$ cat <<EOF > netapp-phy-tridentbackendconfig.yaml apiVersion: trident.netapp.io/v1 kind: TridentBackendConfig metadata: name: netapp-phy-nfs-backend namespace: trident spec: version: 1 storageDriverName: ontap-nas-flexgroup managementLIF: 10.6.136.110 dataLIF: 192.168.10.101 backendName: phy-nfs-backend svm: ntap-rdu3-nv01-nvidia autoExportPolicy: true credentials: name: netapp-phy-secret
With the custom resource file generated we can create it on the cluster.
$ oc create -f netapp-phy-tridentbackendconfig.yaml tridentbackendconfig.trident.netapp.io/netapp-phy-nfs-backend created
We can validate the backend is there with the follow check.
$ oc get tridentbackend -n trident NAME BACKEND BACKEND UUID tbe-n59xq phy-nfs-backend 8d6b2406-551a-416b-bcce-22626ed60242
We can also describe the backend as well.
$ oc describe tridentbackend tbe-n59xq -n trident Name: tbe-n59xq Namespace: trident Labels: <none> Annotations: <none> API Version: trident.netapp.io/v1 Backend Name: phy-nfs-backend Backend UUID: 8d6b2406-551a-416b-bcce-22626ed60242 Config: ontap_config: Aggregate: Auto Export CID Rs: 0.0.0.0/0 ::/0 Auto Export Policy: true Backend Name: phy-nfs-backend Backend Pools: eyJzdm1VVUlEIjoiNjE2OTg1YTYtMjlkZi0xMWYwLWI4YzctZDAzOWVhYzA0MDUzIn0= Chap Initiator Secret: Chap Target Initiator Secret: Chap Target Username: Chap Username: Client Certificate: Client Private Key: Clone Split Delay: 10 Credentials: Name: netapp-phy-secret Data LIF: 192.168.10.101 Debug: false Debug Trace Flags: <nil> Defaults: LUKS Encryption: false Adaptive Qos Policy: Encryption: Export Policy: <automatic> File System Type: ext4 Format Options: Mirroring: false Name Template: Qos Policy: Security Style: unix Size: 1G Skip Recovery Queue: false Snapshot Dir: false Snapshot Policy: none Snapshot Reserve: Space Allocation: true Space Reserve: none Split On Clone: false Tiering Policy: Unix Permissions: ---rwxrwxrwx Deny New Volume Pools: false Disable Delete: false Empty Flexvol Deferred Delete Period: Flags: Disaggregated: false Personality: Unified San Optimized: false Flexgroup Aggregate List: Igroup Name: Labels: <nil> Limit Aggregate Usage: Limit Volume Pool Size: Limit Volume Size: Luns Per Flexvol: Management LIF: 10.6.136.110 Nas Type: nfs Nfs Mount Options: Password: secret:netapp-phy-secret Qtree Prune Flexvols Period: Qtree Quota Resize Period: Qtrees Per Flexvol: Region: Replication Policy: Replication Schedule: San Type: iscsi Smb Share: Storage: <nil> Storage Driver Name: ontap-nas-flexgroup Storage Prefix: Supported Topologies: <nil> Svm: ntap-rdu3-nv01-nvidia Trusted CA Certificate: Usage Heartbeat: Use CHAP: false Use REST: <nil> User State: Username: secret:netapp-phy-secret Version: 1 Zone: Config Ref: 9e1ff3f2-8a2d-4efa-859c-712b920d269b Kind: TridentBackend Metadata: Creation Timestamp: 2025-05-07T19:31:56Z Finalizers: trident.netapp.io Generate Name: tbe- Generation: 1 Resource Version: 38713504 UID: 6536970f-b10e-4e04-8a37-8da56deaf69e Online: true State: online User State: normal Version: 1 Events: <none>
We can also use the tridentctl command to validate the backend and confirm its online.
$ ./trident-installer/tridentctl get backend -n trident +-----------------+---------------------+--------------------------------------+--------+------------+---------+ | NAME | STORAGE DRIVER | UUID | STATE | USER-STATE | VOLUMES | +-----------------+---------------------+--------------------------------------+--------+------------+---------+ | phy-nfs-backend | ontap-nas-flexgroup | 8d6b2406-551a-416b-bcce-22626ed60242 | online | normal | 0 | +-----------------+---------------------+--------------------------------------+--------+------------+---------+
With the Trident backend configured we can move onto generating a storageclass resource file.  Note while this looks just like a standard Trident NFS storageclass the designation of the rdma makes it special.
$ cat <<EOF > netapp-phy-rdma-storageclass.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: netapp-phy-nfs provisioner: csi.trident.netapp.io parameters: backendType: "ontap-nas-flexgroup" mountOptions: - vers=4.1 - proto=rdma - max_connect=16 - rsize=262144 - wsize=262144 - write=eager EOF
Once we have generated the custom resource file we can create it on the cluster.
$ oc create -f netapp-phy-rdma-storageclass.yaml storageclass.storage.k8s.io/netapp-phy-nfs created
We can validate the storageclass by looking at the storage classes available.
$ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE netapp-phy-nfs csi.trident.netapp.io Delete Immediate false 4s
Now with the storagclass configured we can generate a persistent volume resource file.
$ cat <<EOF > netapp-phy-pvc.yaml kind: PersistentVolumeClaim apiVersion: v1 metadata: name: pvc-netapp-phy-test spec: accessModes: - ReadWriteOnce resources: requests: storage: 850Gi storageClassName: netapp-phy-nfs EOF
We can take the persistent volume resource and create it on the cluster.
$ oc create -f netapp-phy-pvc.yaml persistentvolumeclaim/pvc-netapp-phy-test created
We can validate the persistent volume by looking at the pvc.
$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE pvc-netapp-phy-test Bound pvc-ae477c5c-cf10-4bc0-bb71-39d214a237f0 850Gi RWO netapp-phy-nfs <unset> 45s

At this point we have completed the setup of the Trident storage side in preparation for GPU Direct Storage.

NVIDIA Network Operator Configuration

We assume the Network Operator has already been installed on the cluster but the NicClusterPolicy still needs to be created. The following NicClusterPolicy example will provide the needed configuration to ensure RDMA is properly loaded for NFS. The key option in this policy is the ENABLE_NFSRDMA variable and having it set to true. I want to note that this policy also optinonally has an rdmaSharedDevice and ENTRYPOINT_DEBUG set to true for more verbose logging.

$ cat <<EOF > network-sriovleg-nic-cluster-policy.yaml apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: 25.01-0.6.0.0-0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true" - name: ENABLE_NFSRDMA value: "true" - name: ENTRYPOINT_DEBUG value: 'true' EOF

Before creating the NicClusterPolicy on the cluster we need to prepare a script which will allow us to workaround an issue with GPU Direct Storage in the NVIDIA Network Operator. This script when run right after creating the NicClusterPolicy will determine which nodes have mofed pods running on them and based on that node list will ssh as the core user into each node and unload the following modules: nvme, nvme_tcp, nvme_fabrics, nvme_core. By using the script to unload the modules while the mofed container is busying building the doca drivers we eliminate an issue where when the mofed container goes to install the compiled doca drivers there is a failure to load. One might ask what does NVMe have to do with NFS and unfortunately GPU Direct Storage enablement does both so we have to work around this issue.

$ cat <<EOF > nvme-fixer.sh #!/bin/bash ### Set array of modules to be unloaded declare -a modarr=("nvme" "nvme_tcp" "nvme_fabrics" "nvme_core") ### Determine which hosts have mofed container running on them declare -a hostarr=(`oc get pods -n nvidia-network-operator -o custom-columns=POD:.metadata.name,NODE:.spec..nodeName --no-headers|grep mofed|awk {'print $2'}`) ### Iterate through modules on each host and unload them for host in "${hostarr[@]}" do echo "Unloading nvme dependencies on $host..." for module in "${modarr[@]}" do echo "Unloading module $module..." ssh core@$host sudo rmmod $module done done EOF

Change the execute bit on the file.

$ chmod +x nvme-fixer.sh

Now we are ready to create the NicClusterPolicy on the cluster and follow it up by running the nvme-fixer.sh script. If there are any rmmod errors those can safely be ignored as the module was not loaded to start with. In the example below we had two workers nodes that had mofed pods running on them so the script went ahead and unloaded the nvme modules.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml nicclusterpolicy.mellanox.com/nic-cluster-policy created $ ./nvme-fixer.sh Unloading nvme dependencies on nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... rmmod: ERROR: Module nvme_tcp is not currently loaded Unloading module nvme_fabrics... rmmod: ERROR: Module nvme_fabrics is not currently loaded Unloading module nvme_core... Unloading nvme dependencies on nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com... Unloading module nvme... Unloading module nvme_tcp... Unloading module nvme_fabrics... Unloading module nvme_core...

Now we wait for the mofed pod to finish compiling and installed the GPU Direct Storage modules. We will know its complete when the pods are in a running state like below:

$ oc get pods -n nvidia-network-operator NAME READY STATUS RESTARTS AGE mofed-rhcos4.16-56c9d799bf-ds-bvhmj 2/2 Running 0 20h mofed-rhcos4.16-56c9d799bf-ds-jdzxj 2/2 Running 0 20h nvidia-network-operator-controller-manager-85b78c49f6-9lchx 1/1 Running 4 (3h26m ago) 3d14h

This completes the NVIDIA Network Operator portion of the configuration for GPU Direct Storage.

NVIDIA GPU Operator Configuration

Now that the NicClusterPolicy is defined and the proper nvme modules have been loaded we can move onto configuring our GPU ClusterPolicy. The below example is a policy that will enable GPU Direct Storage on the worker nodes that have a proper NVIDIA GPU.

$ cat <<EOF > gpu-cluster-policy.yaml apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' enabled: true serviceMonitor: enabled: true cdi: default: false enabled: false driver: licensingConfig: configMapName: '' nlsEnabled: true enabled: true kernelModuleType: open certConfig: name: '' useNvidiaDriverCRD: false kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' devicePlugin: config: default: '' name: '' enabled: true mps: root: /run/nvidia/mps gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: nvidia-fs repository: nvcr.io/nvidia/cloud-native version: 2.25.7 vgpuManager: enabled: false vfioManager: enabled: true toolkit: enabled: true installDir: /usr/local/nvidia EOF

Now let's create the policy on the cluster.

$ oc create -f gpu-cluster-policy.yaml clusterpolicy.nvidia.com/gpu-cluster-policy created

Once the policy is created let's validate the pods are running before we move onto the next step.

$ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-nttht 1/1 Running 0 20h gpu-feature-discovery-r4ktv 1/1 Running 0 20h gpu-operator-7d7f694bfb-957mv 1/1 Running 0 20h nvidia-container-toolkit-daemonset-h96t6 1/1 Running 0 20h nvidia-container-toolkit-daemonset-hqtrl 1/1 Running 0 20h nvidia-cuda-validator-66ml7 0/1 Completed 0 20h nvidia-dcgm-exporter-hbk4r 1/1 Running 0 20h nvidia-dcgm-exporter-pgh4q 1/1 Running 0 20h nvidia-dcgm-nttds 1/1 Running 0 20h nvidia-dcgm-zb4fl 1/1 Running 0 20h nvidia-device-plugin-daemonset-d99md 1/1 Running 0 20h nvidia-device-plugin-daemonset-w7tc4 1/1 Running 0 20h nvidia-driver-daemonset-416.94.202504151456-0-8bdl5 4/4 Running 26 (20h ago) 2d2h nvidia-driver-daemonset-416.94.202504151456-0-j8gps 4/4 Running 20 (20h ago) 2d2h nvidia-node-status-exporter-b22hk 1/1 Running 4 2d2h nvidia-node-status-exporter-lwqhb 1/1 Running 3 2d2h nvidia-operator-validator-cvqn5 1/1 Running 0 20h nvidia-operator-validator-zxrpb 1/1 Running 0 20h

With the NVIDIA GPU Operator pods running we can rsh into the daemonset pods and confirm GDS is enabled by running the lsmod command and cat out the /proc/driver/nvidia-fs/stats file.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202504151456-0-8bdl5 sh-4.4# lsmod|grep nvidia nvidia_fs 327680 0 nvidia_modeset 1720320 0 video 73728 1 nvidia_modeset nvidia_uvm 4087808 12 nvidia 11665408 36 nvidia_uvm,nvidia_fs,gdrdrv,nvidia_modeset drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200 sh-4.4# cat /proc/driver/nvidia-fs/stats GDS Version: 1.10.0.4 NVFS statistics(ver: 4.0) NVFS Driver(version: 2.20.5) Mellanox PeerDirect Supported: False IO stats: Disabled, peer IO stats: Disabled Logging level: info Active Shadow-Buffer (MiB): 0 Active Process: 0 Reads : err=0 io_state_err=0 Sparse Reads : n=0 io=0 holes=0 pages=0 Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0 Mmap : n=0 ok=0 err=0 munmap=0 Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0 Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0 Ops : Read=0 Write=0 BatchIO=0

If everything looks good we can move onto an additional step to confirm GDS is ready for workload consumption.

GDS Cuda Workload Container

Once the GPU Direct Storage drivers are loaded we can use one more additional tool to check and confirm GDS capability. This involves building a container that contains the CUDA packages and then running it on a node.

Now let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > nvidiatools-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: nvidiatools namespace: default EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z nvidiatools clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "nvidiatools"

With the service account defined and our pod yaml ready we can create it on the cluster.

The following pod yaml defines this configuration.

$ cat <<EOF > nvidiatools-30-workload.yaml apiVersion: v1 kind: Pod metadata: name: nvidiatools-30-workload namespace: default annotations: # JSON list is the canonical form; adjust if your NAD lives in another namespace k8s.v1.cni.cncf.io/networks: '[{ "name": "sriov-network" }]' spec: serviceAccountName: nvidiatools nodeSelector: kubernetes.io/hostname: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com volumes: - name: rdma-pv-storage persistentVolumeClaim: claimName: pvc-netapp-phy-test - name: nordma-pv-storage persistentVolumeClaim: claimName: pvc-netapp-phy-nordma-test containers: - name: nvidiatools-30-workload image: quay.io/redhat_emp1/ecosys-nvidia/nvidia-tools:0.0.3 imagePullPolicy: IfNotPresent securityContext: privileged: true capabilities: add: ["IPC_LOCK"] resources: limits: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 requests: nvidia.com/gpu: 1 openshift.io/sriovlegacy: 1 volumeMounts: - name: rdma-pv-storage mountPath: /nfsfast - name: nordma-pv-storage mountPath: /nfsslow EOF
$ oc create -f nvidiatools-30-workload.yaml nvidiatools-30-workload created $ oc get pods NAME READY STATUS RESTARTS AGE nvidiatools-30-workload 1/1 Running 0 3s

Once the pod is up and running we can rsh into the pod and run the gdscheck tool to confirm capabilities and configuration of GPU Direct Storage.

$ oc rsh nvidiatools-30-workload sh-5.1# /usr/local/cuda/gds/tools/gdscheck -p GDS release version: 1.13.1.3 nvidia_fs version: 2.20 libcufile version: 2.12 Platform: x86_64 ============ ENVIRONMENT: ============ ===================== DRIVER CONFIGURATION: ===================== NVMe P2PDMA : Unsupported NVMe : Supported NVMeOF : Supported SCSI : Unsupported ScaleFlux CSD : Unsupported NVMesh : Unsupported DDN EXAScaler : Unsupported IBM Spectrum Scale : Unsupported NFS : Supported BeeGFS : Unsupported WekaFS : Unsupported Userspace RDMA : Unsupported --Mellanox PeerDirect : Disabled --rdma library : Not Loaded (libcufile_rdma.so) --rdma devices : Not configured --rdma_device_status : Up: 0 Down: 0 ===================== CUFILE CONFIGURATION: ===================== properties.use_pci_p2pdma : false properties.use_compat_mode : true properties.force_compat_mode : false properties.gds_rdma_write_support : true properties.use_poll_mode : false properties.poll_mode_max_size_kb : 4 properties.max_batch_io_size : 128 properties.max_batch_io_timeout_msecs : 5 properties.max_direct_io_size_kb : 16384 properties.max_device_cache_size_kb : 131072 properties.max_device_pinned_mem_size_kb : 33554432 properties.posix_pool_slab_size_kb : 4 1024 16384 properties.posix_pool_slab_count : 128 64 64 properties.rdma_peer_affinity_policy : RoundRobin properties.rdma_dynamic_routing : 0 fs.generic.posix_unaligned_writes : false fs.lustre.posix_gds_min_kb: 0 fs.beegfs.posix_gds_min_kb: 0 fs.weka.rdma_write_support: false fs.gpfs.gds_write_support: false fs.gpfs.gds_async_support: true profile.nvtx : false profile.cufile_stats : 0 miscellaneous.api_check_aggressive : false execution.max_io_threads : 4 execution.max_io_queue_depth : 128 execution.parallel_io : true execution.min_io_threshold_size_kb : 8192 execution.max_request_parallelism : 4 properties.force_odirect_mode : false properties.prefer_iouring : false ========= GPU INFO: ========= GPU index 0 NVIDIA L40S bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled ============== PLATFORM INFO: ============== IOMMU: disabled Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed) Cuda Driver Version Installed: 12080 Platform: PowerEdge R760xa, Arch: x86_64(Linux 5.14.0-427.65.1.el9_4.x86_64) Platform verification succeeded

Now let's confirm our GPU Direct NFS mount is mounted. Notice in the output the proto is rdma.

sh-5.1# mount|grep nfs 192.168.10.101:/trident_pvc_ae477c5c_cf10_4bc0_bb71_39d214a237f0 on /mnt type nfs4 (rw,relatime,vers=4.1,rsize=262144,wsize=262144,namlen=255,hard,proto=rdma,max_connect=16,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=192.168.10.30,local_lock=none,write=eager,addr=192.168.10.101)

Next we can use gdsio to run some benchmarks across the GPU Direct NFS mount. Before we run the benchmarks let's familiarize ourselves with the all the gdsio switches and what they mean.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -h gdsio version :1.12 Usage [using config file]: gdsio rw-sample.gdsio Usage [using cmd line options]:/usr/local/cuda-12.8/gds/tools/gdsio -f <file name> -D <directory name> -d <gpu_index (refer nvidia-smi)> -n <numa node> -m <memory type(0 - (cudaMalloc), 1 - (cuMem), 2 - (cudaMallocHost), 3 - (malloc) 4 - (mmap))> -w <number of threads for a job> -s <file size(K|M|G)> -o <start offset(K|M|G)> -i <io_size(K|M|G)> <min_size:max_size:step_size> -p <enable nvlinks> -b <skip bufregister> -V <verify IO> -x <xfer_type> [0(GPU_DIRECT), 1(CPU_ONLY), 2(CPU_GPU), 3(CPU_ASYNC_GPU), 4(CPU_CACHED_GPU), 5(GPU_DIRECT_ASYNC), 6(GPU_BATCH), 7(GPU_BATCH_STREAM)] -B <batch size> -I <(read) 0|(write)1| (randread) 2| (randwrite) 3> -T <duration in seconds> -k <random_seed> (number e.g. 3456) to be used with random read/write> -U <use unaligned(4K) random offsets> -R <fill io buffer with random data> -F <refill io buffer with random data during each write> -a <alignment size in case of random IO> -M <mixed_rd_wr_percentage in case of regular batch mode> -P <rdma url> -J <per job statistics> xfer_type: 0 - Storage->GPU (GDS) 1 - Storage->CPU 2 - Storage->CPU->GPU 3 - Storage->CPU->GPU_ASYNC 4 - Storage->PAGE_CACHE->CPU->GPU 5 - Storage->GPU_ASYNC 6 - Storage->GPU_BATCH 7 - Storage->GPU_BATCH_STREAM Note: read test (-I 0) with verify option (-V) should be used with files written (-I 1) with -V option read test (-I 2) with verify option (-V) should be used with files written (-I 3) with -V option, using same random seed (-k), same number of threads(-w), offset(-o), and data size(-s) write test (-I 1/3) with verify option (-V) will perform writes followed by read

Before we begin running some tests I want to note that the tests are being run from a standard Dell R760xa and from the nvidia-smi topo output we can see we are dealing with a non optimal setup of NODE where the connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node. Ideally for peformant numbers we would want to run this on a H100 or B200 where the GPU and NIC are connected to the same PCIe switch and yield a PHB,PXB or PIX connection.

sh-5.1# nvidia-smi topo -mp GPU0 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE 0,2,4,6,8,10 0 N/A NIC0 NODE X NODE NODE NODE NODE NODE NODE NODE NODE NODE NIC1 NODE NODE X PIX PIX PIX PIX PIX PIX PIX PIX NIC2 NODE NODE PIX X PIX PIX PIX PIX PIX PIX PIX NIC3 NODE NODE PIX PIX X PIX PIX PIX PIX PIX PIX NIC4 NODE NODE PIX PIX PIX X PIX PIX PIX PIX PIX NIC5 NODE NODE PIX PIX PIX PIX X PIX PIX PIX PIX NIC6 NODE NODE PIX PIX PIX PIX PIX X PIX PIX PIX NIC7 NODE NODE PIX PIX PIX PIX PIX PIX X PIX PIX NIC8 NODE NODE PIX PIX PIX PIX PIX PIX PIX X PIX NIC9 NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9

Now let's run a few gdsio tests across our RDMA nfs mount. Please note these runs were not performance tuned in any way.  This is merely a demonstration to show the feature functionality.   

In this first example, gdsio is used to generate a random write load of small IOs (4k) to one of the NFS mount point

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 500M -i 4K -x 0 -I 3 -T 120 IoType: RANDWRITE XferType: GPUD Threads: 32 DataSetSize: 43222136/16384000(KiB) IOSize: 4(KiB) Throughput: 0.344940 GiB/sec, Avg_Latency: 352.314946 usecs ops: 10805534 total_time 119.498576 secs

Next we will repeat the same test but for random reads.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 500M -i 4K -x 0 -I 2 -T 120 IoType: RANDREAD XferType: GPUD Threads: 32 DataSetSize: 71313540/16384000(KiB) IOSize: 4(KiB) Throughput: 0.569229 GiB/sec, Avg_Latency: 214.448246 usecs ops: 17828385 total_time 119.477201 secs

Small and random IOs are all about IOPS and latency. For our next test we will determine throughput. We will use larger files sizes and much larger IO sizes.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 1G -i 1M -x 0 -I 1 -T 120 IoType: WRITE XferType: GPUD Threads: 32 DataSetSize: 320301056/33554432(KiB) IOSize: 1024(KiB) Throughput: 2.547637 GiB/sec, Avg_Latency: 12487.658159 usecs ops: 312794 total_time 119.900455 secs

This concludes the workflow of configuring and testing GPU Direct Storage on OpenShift over an RDMA NFS mount.