Monday, January 06, 2025

RDMA+CUDA with NVIDIA on OpenShift

In a previous blog I described how to configure an OpenShift cluster with RDMA when using the NVIDIA Network Operator and NVIDIA GPU Operator.  However in that blog we only did simple RDMA testing across the network interfaces with no involvement of the GPU.   In this blog I will show the testing so it does involve the GPU and CUDA libraries.   Keep in mind though this testing is for validating that the configuration is setup correctly and should not replace real world workload testing of an application.

In this example we are using the same versions of OpenShift and the operators as in the previous blog so I will not go into those details here.  What we will capture below is how to configure the container appropriately to do the RDMA+CUDA testing.

The first thing we need to do is create a ServiceAccount in the default namespace. We can do so by generating the custom resource file below and creating on the cluster.
$ cat <<EOF > default-serviceaccount.yaml apiVersion: v1 kind: ServiceAccount metadata: name: rdma namespace: default EOF $ oc create -f default-serviceaccount.yaml serviceaccount/rdma created

Now that the rdma account is created let's give it privileged access.

oc -n default adm policy add-scc-to-user privileged -z rdma clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

Next we will generate two pod custom resource files to run our workload pod image on the two baremetal a100 nodes in our environment.

$ cat <<EOF > rdma-eth-a100-01-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-a100-01-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: a100-1.private.openshiftvcn.schmaustech.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-eth-a100-01-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF $ cat <<EOF > rdma-eth-a100-02-workload.yaml apiVersion: v1 kind: Pod metadata: name: rdma-eth-a100-02-workload namespace: default annotations: k8s.v1.cni.cncf.io/networks: rdmashared-net spec: nodeSelector: kubernetes.io/hostname: a100-2.private.openshiftvcn.schmaustech.com serviceAccountName: rdma containers: - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools name: rdma-eth-a100-02-workload command: - sh - -c - sleep inf securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 requests: nvidia.com/gpu: 1 rdma/rdma_shared_device_eth: 1 EOF

With the pod files generated we can create them on the cluster.

$ oc create -f oci-agent-pod-a100-01.yaml pod/oci-agent-pod-a100-01 created $ oc create -f oci-agent-pod-a100-02.yaml pod/oci-agent-pod-a100-02 created

Validate that the pods are running.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-eth-a100-01-workload 1/1 Running 0 1m rdma-eth-a100-02-workload 1/1 Running 0 1m

Next we can rsh into each of them in separate terminal windows.

$ oc rsh rdma-eth-a100-01-workload sh-5.1# cd /root sh-5.1# $ oc rsh rdma-eth-a100-02-workload sh-5.1# cd /root sh-5.1# 

Building RDMA Validation Tests

The next steps are required on both running pods and enable perftest to have CUDA capable binaries.

First we need to download the CUDA repo and since our image is Fedora 35 based we will pulling down a Fedora 35 based package with wget. Note one might have to install wget.

sh-5.1# wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm --2024-11-20 16:06:08-- https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.20.126 Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.20.126|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 3795608809 (3.5G) [application/x-rpm] Saving to: 'cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm' cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm 100%[=========================================================================================================================================>] 3.53G 28.0MB/s in 2m 10s 2024-11-20 16:08:18 (27.9 MB/s) - 'cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm' saved [3795608809/3795608809]

Once the package is downloaded install it with rpm command.

sh-5.1# rpm -i cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm warning: cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID d42d0685: NOKEY

Then clean all local repos with dnf clean all.

sh-5.1# dnf clean all 42 files removed And finally install the CUDA toolkit. sh-5.1# dnf -y install cuda Fedora 35 - x86_64 - Updates 32 MB/s | 34 MB 00:01 Fedora Modular 35 - x86_64 - Updates 7.2 MB/s | 3.9 MB 00:00 Dependencies resolved. ============================================================================================================================================================================================================================================== Package Architecture Version Repository Size ============================================================================================================================================================================================================================================== Installing: cuda x86_64 11.7.0-1 cuda-fedora35-11-7-local 2.7 k Upgrading: systemd-libs x86_64 249.13-6.fc35 updates 599 k Installing dependencies: NetworkManager-libnm x86_64 1:1.32.12-2.fc35 updates 1.7 M acl x86_64 2.3.1-2.fc35 fedora 71 k (...) tracker-3.2.1-1.fc35.x86_64 tracker-miners-3.2.2-1.fc35.x86_64 ttmkfdir-3.0.9-64.fc35.x86_64 tzdata-java-2022g-1.fc35.noarch uchardet-0.0.6-14.fc35.x86_64 upower-0.99.13-1.fc35.x86_64 vulkan-loader-1.3.204.0-1.fc35.x86_64 which-2.21-27.fc35.x86_64 xcb-util-0.4.0-18.fc35.x86_64 xcb-util-image-0.4.0-18.fc35.x86_64 xcb-util-keysyms-0.4.0-16.fc35.x86_64 xcb-util-renderutil-0.3.9-19.fc35.x86_64 xcb-util-wm-0.4.1-21.fc35.x86_64 xkbcomp-1.4.5-2.fc35.x86_64 xkeyboard-config-2.33-2.fc35.noarch xml-common-0.6.3-57.fc35.noarch xorg-x11-drv-libinput-1.2.0-1.fc35.x86_64 xorg-x11-fonts-Type1-7.5-32.fc35.noarch xorg-x11-proto-devel-2021.5-1.fc35.noarch xorg-x11-server-Xorg-1.20.14-9.fc35.x86_64 xorg-x11-server-common-1.20.14-9.fc35.x86_64 xz-5.2.5-7.fc35.x86_64 Failed: nvidia-driver-cuda-3:515.43.04-1.fc35.x86_64 nvidia-persistenced-3:515.43.04-1.fc35.x86_64 Error: Transaction failed

The CUDA toolkit installation will say transaction failed but this is okay. The necessary files were installed to provide what we need for building perftest.

Set the LD_LIBRARY_PATH and LIBRARY_PATH variables below.

sh-5.1# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH sh-5.1# export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

Next remove the existing /root/perftest directory in the pod and git clone down the perftest repository.

sh-5.1# rm -r -f perftest sh-5.1# git clone https://github.com/linux-rdma/perftest.git Cloning into 'perftest'... remote: Enumerating objects: 6077, done. remote: Counting objects: 100% (2157/2157), done. remote: Compressing objects: 100% (398/398), done. remote: Total 6077 (delta 1876), reused 1920 (delta 1747), pack-reused 3920 (from 1) Receiving objects: 100% (6077/6077), 1.89 MiB | 43.11 MiB/s, done. Resolving deltas: 100% (4826/4826), done.

Finally change into the perftest directory and build the binaries.

sh-5.1# cd perftest/ sh-5.1# ./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'. libtoolize: copying file 'config/ltmain.sh' libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'. libtoolize: copying file 'm4/libtool.m4' libtoolize: copying file 'm4/ltoptions.m4' libtoolize: copying file 'm4/ltsugar.m4' libtoolize: copying file 'm4/ltversion.m4' libtoolize: copying file 'm4/lt~obsolete.m4' libtoolize: 'AC_PROG_RANLIB' is rendered obsolete by 'LT_INIT' configure.ac:55: installing 'config/compile' configure.ac:59: installing 'config/config.guess' configure.ac:59: installing 'config/config.sub' configure.ac:36: installing 'config/install-sh' configure.ac:36: installing 'config/missing' Makefile.am: installing 'config/depcomp' configure: loading site script /usr/share/config.site checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /usr/bin/mkdir -p checking for gawk... gawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking whether make supports nested variables... (cached) yes checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking whether gcc understands -c and -o together... yes checking whether make supports the include directive... yes (GNU style) checking dependency style of gcc... gcc3 checking for g++... g++ checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking dependency style of g++... gcc3 checking dependency style of gcc... gcc3 checking build system type... x86_64-pc-linux-gnu checking host system type... x86_64-pc-linux-gnu checking how to print strings... printf checking for a sed that does not truncate output... /usr/bin/sed checking for grep that handles long lines and -e... /usr/bin/grep checking for egrep... /usr/bin/grep -E checking for fgrep... /usr/bin/grep -F checking for ld used by gcc... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B checking the name lister (/usr/bin/nm -B) interface... BSD nm checking whether ln -s works... yes checking the maximum length of command line arguments... 1572864 checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop checking for /usr/bin/ld option to reload object files... -r checking for objdump... objdump checking how to recognize dependent libraries... pass_all checking for dlltool... no checking how to associate runtime and link libraries... printf %s\n checking for ar... ar checking for archiver @FILE support... @ checking for strip... strip checking for ranlib... ranlib checking command to parse /usr/bin/nm -B output from gcc object... ok checking for sysroot... no checking for a working dd... /usr/bin/dd checking how to truncate binary pipes... /usr/bin/dd bs=4096 count=1 checking for mt... no checking if : is a manifest tool... no checking how to run the C preprocessor... gcc -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking for dlfcn.h... yes checking for objdir... .libs checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC -DPIC checking if gcc PIC flag -fPIC -DPIC works... yes checking if gcc static flag -static works... no checking if gcc supports -c -o file.o... yes checking if gcc supports -c -o file.o... (cached) yes checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking how to run the C++ preprocessor... g++ -E checking for ld used by g++... /usr/bin/ld -m elf_x86_64 checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking for g++ option to produce PIC... -fPIC -DPIC checking if g++ PIC flag -fPIC -DPIC works... yes checking if g++ static flag -static works... no checking if g++ supports -c -o file.o... yes checking if g++ supports -c -o file.o... (cached) yes checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking dynamic linker characteristics... (cached) GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking for ranlib... (cached) ranlib checking for ANSI C header files... (cached) yes checking infiniband/verbs.h usability... yes checking infiniband/verbs.h presence... yes checking for infiniband/verbs.h... yes checking for ibv_get_device_list in -libverbs... yes checking for rdma_create_event_channel in -lrdmacm... yes checking for umad_init in -libumad... yes checking for log in -lm... yes checking for ibv_reg_dmabuf_mr in -libverbs... yes checking pci/pci.h usability... yes checking pci/pci.h presence... yes checking for pci/pci.h... yes checking for pci_init in -lpci... yes checking for cuMemGetHandleForAddressRange in -lcuda... yes checking for efadv_create_qp_ex in -lefa... yes checking for mlx5dv_create_qp in -lmlx5... yes checking for hnsdv_query_device in -lhns... no checking that generated files are newer than configure... done configure: creating ./config.status config.status: creating Makefile config.status: creating config.h config.status: executing depfiles commands config.status: executing libtool commands config.status: executing man commands make all-am make[1]: Entering directory '/root/perftest' ln -s .././man/perftest.1 man/ib_read_bw.1 ln -s .././man/perftest.1 man/ib_write_bw.1 ln -s .././man/perftest.1 man/ib_send_bw.1 ln -s .././man/perftest.1 man/ib_atomic_bw.1 ln -s .././man/perftest.1 man/ib_write_lat.1 ln -s .././man/perftest.1 man/ib_read_lat.1 ln -s .././man/perftest.1 man/ib_send_lat.1 ln -s .././man/perftest.1 man/ib_atomic_lat.1 ln -s .././man/perftest.1 man/raw_ethernet_bw.1 ln -s .././man/perftest.1 man/raw_ethernet_lat.1 CC src/send_bw.o ln -s .././man/perftest.1 man/raw_ethernet_burst_lat.1 ln -s .././man/perftest.1 man/raw_ethernet_fs_rate.1 CC src/multicast_resources.o CC src/get_clock.o CC src/perftest_communication.o CC src/perftest_parameters.o CC src/perftest_resources.o CC src/perftest_counters.o CC src/host_memory.o CC src/mmap_memory.o CC src/cuda_memory.o CC src/raw_ethernet_resources.o CC src/send_lat.o CC src/write_lat.o CC src/write_bw.o CC src/read_lat.o CC src/read_bw.o CC src/atomic_lat.o CC src/atomic_bw.o CC src/raw_ethernet_send_bw.o CC src/raw_ethernet_send_lat.o CC src/raw_ethernet_send_burst_lat.o CC src/raw_ethernet_fs_rate.o AR libperftest.a CCLD ib_send_bw CCLD ib_write_lat CCLD ib_send_lat CCLD ib_write_bw CCLD ib_read_lat CCLD ib_read_bw CCLD ib_atomic_lat CCLD ib_atomic_bw CCLD raw_ethernet_bw CCLD raw_ethernet_lat CCLD raw_ethernet_burst_lat CCLD raw_ethernet_fs_rate make[1]: Leaving directory '/root/perftest'

With the binaries built we can move onto running our validation tests.

Running RDMA Validation Tests

We already should have our workload pods running on the cluster in the default namespace.

$ oc get pods NAME READY STATUS RESTARTS AGE rdma-eth-a100-01-workload 1/1 Running 0 15m rdma-eth-a100-02-workload 1/1 Running 0 15m

Next we will need to open two rsh connections one into each pod.

$ oc rsh rdma-eth-a100-01-workload sh-5.1# $ oc rsh rdma-eth-a100-02-workload sh-5.1#

Then in the rsh connection into rdma-eth-a100-01-workload we will run the following ib_write_bw command.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 172.16.0.1 WARNING: BW peak won't be measured in this run. ************************************ * Waiting for client to connect... * ************************************

Then in the second rsh connection into rdma-eth-a100-02-workload we will run the following ib_write_bw command. Note this test is without cuda and will take a few minutes.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 172.16.0.2 172.16.0.1 WARNING: BW peak won't be measured in this run. --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00bd PSN 0x6e902d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00be PSN 0xdf3b13 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00bf PSN 0x14ba61 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c0 PSN 0xd9209c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c1 PSN 0xc07f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c2 PSN 0xf06575 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c3 PSN 0x481230 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c4 PSN 0xc1a69 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c5 PSN 0x7c6e59 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c6 PSN 0xf16f67 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c7 PSN 0xe82e7f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c8 PSN 0xf0a6a6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00c9 PSN 0x41069a GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00ca PSN 0xe2153f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00cb PSN 0xed2a91 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00cc PSN 0x2f3581 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 remote address: LID 0000 QPN 0x00c7 PSN 0xc8665d GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00c8 PSN 0xce8d83 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00c9 PSN 0x4b7411 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00ca PSN 0x3a508c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00cb PSN 0xb3c9af GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00cc PSN 0xac6ee5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00cd PSN 0x12b6e0 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00ce PSN 0x8a5959 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00cf PSN 0x8da89 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d0 PSN 0x2e9fd7 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d1 PSN 0xba6e2f GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d2 PSN 0xede496 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d3 PSN 0xfa05ca GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d4 PSN 0x2bdcaf GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d5 PSN 0xc5b541 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d6 PSN 0x3c6271 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 5296615 0.00 92.56 0.176539 ---------------------------------------------------------------------------------------

Now we are going to repeat the test but include the GPU with the --use_cuda switch on the command. So in the first rsh connection run.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 172.16.0.1 --use_cuda=0 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 ************************************ * Waiting for client to connect... * ************************************

Then in the second rsh connection run the following.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_1 -p 10000 --source_ip 172.16.0.2 --use_cuda=0 172.16.0.1 WARNING: BW peak won't be measured in this run. Perftest doesn't supports CUDA tests with inline messages: inline size set to 0 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 0F:00 Picking device No. 0 [pid = 4488, dev = 0] device name = [NVIDIA A100-SXM4-80GB] creating CUDA Ctx making it the current CUDA Ctx CUDA device integrated: 0 cuMemAlloc() of a 2097152 bytes GPU buffer allocated GPU buffer address at 00007fbebf200000 pointer=0x7fbebf200000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 16 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using DDP : OFF TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm TOS : 41 --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00ce PSN 0x282aa6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00cf PSN 0x3ab698 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d0 PSN 0x9dd002 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d1 PSN 0x11fc29 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d2 PSN 0x72e988 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d3 PSN 0xb5f44a GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d4 PSN 0x1540e1 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d5 PSN 0x8801c6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d6 PSN 0xd77ef2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d7 PSN 0xacf68c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d8 PSN 0x47f740 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00d9 PSN 0x286d3 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00da PSN 0xc1e7c3 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00db PSN 0xd8c9b4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00dc PSN 0xf51e62 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 local address: LID 0000 QPN 0x00dd PSN 0x4bcb7e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100 remote address: LID 0000 QPN 0x00d8 PSN 0x533e76 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00d9 PSN 0x1f0628 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00da PSN 0xf1052 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00db PSN 0xd23e39 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00dc PSN 0x696a58 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00dd PSN 0x25acda GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00de PSN 0x383631 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00df PSN 0x9054d6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e0 PSN 0xc33cc2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e1 PSN 0x55a81c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e2 PSN 0x62f190 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e3 PSN 0x22fae3 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e4 PSN 0x99b293 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e5 PSN 0xb10444 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e6 PSN 0x636db2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 remote address: LID 0000 QPN 0x00e7 PSN 0x45708e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 3994687 0.00 69.79 0.133122 --------------------------------------------------------------------------------------- deallocating GPU buffer 00007fbebf200000 destroying current CUDA Ctx

Once the tests complete we have confirmed the RDMA is working and we can now add our real-world workload.

Hopefully this blog was useful in showing RDMA with CUDA testing on an OpenShift environment.