SCHMAUSTECH: Monday, January 6, 2025

In a previous blog I described how to configure an OpenShift cluster with RDMA when using the NVIDIA Network Operator and NVIDIA GPU Operator. However in that blog we only did simple RDMA testing across the network interfaces with no involvement of the GPU. In this blog I will show the testing so it does involve the GPU and CUDA libraries. Keep in mind though this testing is for validating that the configuration is setup correctly and should not replace real world workload testing of an application.

In this example we are using the same versions of OpenShift and the operators as in the previous blog so I will not go into those details here. What we will capture below is how to configure the container appropriately to do the RDMA+CUDA testing.

The first thing we need to do is create a ServiceAccount in the default namespace. We can do so by generating the custom resource file below and creating on the cluster.

$ cat <<EOF > default-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rdma
  namespace: default
EOF

$ oc create -f default-serviceaccount.yaml
serviceaccount/rdma created

Now that the rdma account is created let's give it privileged access.

oc -n default adm policy add-scc-to-user privileged -z rdma
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "rdma"

Next we will generate two pod custom resource files to run our workload pod image on the two baremetal a100 nodes in our environment.

$ cat <<EOF > rdma-eth-a100-01-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-eth-a100-01-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: a100-1.private.openshiftvcn.schmaustech.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: rdma-eth-a100-01-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
EOF

$ cat <<EOF > rdma-eth-a100-02-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rdma-eth-a100-02-workload
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  nodeSelector: 
    kubernetes.io/hostname: a100-2.private.openshiftvcn.schmaustech.com
  serviceAccountName: rdma
  containers:
  - image: quay.io/redhat_emp1/ecosys-nvidia/gpu-operator:tools
    name: rdma-eth-a100-02-workload
    command:
      - sh
      - -c
      - sleep inf
    securityContext:
      privileged: true
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_eth: 1
EOF

With the pod files generated we can create them on the cluster.

$ oc create -f oci-agent-pod-a100-01.yaml
pod/oci-agent-pod-a100-01 created

$ oc create -f oci-agent-pod-a100-02.yaml
pod/oci-agent-pod-a100-02 created

Validate that the pods are running.

$ oc get pods
NAME                        READY   STATUS    RESTARTS   AGE
rdma-eth-a100-01-workload   1/1     Running   0          1m
rdma-eth-a100-02-workload   1/1     Running   0          1m

Next we can rsh into each of them in separate terminal windows.

$ oc rsh rdma-eth-a100-01-workload
sh-5.1# cd /root
sh-5.1#

$ oc rsh rdma-eth-a100-02-workload
sh-5.1# cd /root
sh-5.1#

Building RDMA Validation Tests

The next steps are required on both running pods and enable perftest to have CUDA capable binaries.

First we need to download the CUDA repo and since our image is Fedora 35 based we will pulling down a Fedora 35 based package with wget. Note one might have to install wget.

sh-5.1# wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm
--2024-11-20 16:06:08--  https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.20.126
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.20.126|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3795608809 (3.5G) [application/x-rpm]
Saving to: 'cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm'

cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm 100%[=========================================================================================================================================>]   3.53G  28.0MB/s    in 2m 10s  

2024-11-20 16:08:18 (27.9 MB/s) - 'cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm' saved [3795608809/3795608809]

Once the package is downloaded install it with rpm command.

sh-5.1# rpm -i cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm
warning: cuda-repo-fedora35-11-7-local-11.7.0_515.43.04-1.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID d42d0685: NOKEY

Then clean all local repos with dnf clean all.

sh-5.1# dnf clean all
42 files removed

And finally install the CUDA toolkit.

sh-5.1# dnf -y install cuda
Fedora 35 - x86_64 - Updates                                                                                                                                                                                   32 MB/s |  34 MB     00:01    
Fedora Modular 35 - x86_64 - Updates                                                                                                                                                                          7.2 MB/s | 3.9 MB     00:00    
Dependencies resolved.
==============================================================================================================================================================================================================================================
 Package                                                            Architecture                              Version                                                       Repository                                                   Size
==============================================================================================================================================================================================================================================
Installing:
 cuda                                                               x86_64                                    11.7.0-1                                                      cuda-fedora35-11-7-local                                    2.7 k
Upgrading:
 systemd-libs                                                       x86_64                                    249.13-6.fc35                                                 updates                                                     599 k
Installing dependencies:
 NetworkManager-libnm                                               x86_64                                    1:1.32.12-2.fc35                                              updates                                                     1.7 M
 acl                                                                x86_64                                    2.3.1-2.fc35                                                  fedora                                                       71 k
(...)
  tracker-3.2.1-1.fc35.x86_64                              tracker-miners-3.2.2-1.fc35.x86_64                            ttmkfdir-3.0.9-64.fc35.x86_64                           tzdata-java-2022g-1.fc35.noarch                           
  uchardet-0.0.6-14.fc35.x86_64                            upower-0.99.13-1.fc35.x86_64                                  vulkan-loader-1.3.204.0-1.fc35.x86_64                   which-2.21-27.fc35.x86_64                                 
  xcb-util-0.4.0-18.fc35.x86_64                            xcb-util-image-0.4.0-18.fc35.x86_64                           xcb-util-keysyms-0.4.0-16.fc35.x86_64                   xcb-util-renderutil-0.3.9-19.fc35.x86_64                  
  xcb-util-wm-0.4.1-21.fc35.x86_64                         xkbcomp-1.4.5-2.fc35.x86_64                                   xkeyboard-config-2.33-2.fc35.noarch                     xml-common-0.6.3-57.fc35.noarch                           
  xorg-x11-drv-libinput-1.2.0-1.fc35.x86_64                xorg-x11-fonts-Type1-7.5-32.fc35.noarch                       xorg-x11-proto-devel-2021.5-1.fc35.noarch               xorg-x11-server-Xorg-1.20.14-9.fc35.x86_64                
  xorg-x11-server-common-1.20.14-9.fc35.x86_64             xz-5.2.5-7.fc35.x86_64                                       
Failed:
  nvidia-driver-cuda-3:515.43.04-1.fc35.x86_64                                                                          nvidia-persistenced-3:515.43.04-1.fc35.x86_64                                                                         

Error: Transaction failed

The CUDA toolkit installation will say transaction failed but this is okay. The necessary files were installed to provide what we need for building perftest.

Set the LD_LIBRARY_PATH and LIBRARY_PATH variables below.

sh-5.1# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
sh-5.1# export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

Next remove the existing /root/perftest directory in the pod and git clone down the perftest repository.

sh-5.1# rm -r -f perftest
sh-5.1# git clone https://github.com/linux-rdma/perftest.git
Cloning into 'perftest'...
remote: Enumerating objects: 6077, done.
remote: Counting objects: 100% (2157/2157), done.
remote: Compressing objects: 100% (398/398), done.
remote: Total 6077 (delta 1876), reused 1920 (delta 1747), pack-reused 3920 (from 1)
Receiving objects: 100% (6077/6077), 1.89 MiB | 43.11 MiB/s, done.
Resolving deltas: 100% (4826/4826), done.

Finally change into the perftest directory and build the binaries.

sh-5.1# cd perftest/
sh-5.1# ./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'.
libtoolize: copying file 'config/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
libtoolize: 'AC_PROG_RANLIB' is rendered obsolete by 'LT_INIT'
configure.ac:55: installing 'config/compile'
configure.ac:59: installing 'config/config.guess'
configure.ac:59: installing 'config/config.sub'
configure.ac:36: installing 'config/install-sh'
configure.ac:36: installing 'config/missing'
Makefile.am: installing 'config/depcomp'
configure: loading site script /usr/share/config.site
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking whether make supports the include directive... yes (GNU style)
checking dependency style of gcc... gcc3
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking dependency style of gcc... gcc3
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for ar... ar
checking for archiver @FILE support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for a working dd... /usr/bin/dd
checking how to truncate binary pipes... /usr/bin/dd bs=4096 count=1
checking for mt... no
checking if : is a manifest tool... no
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... no
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for g++ option to produce PIC... -fPIC -DPIC
checking if g++ PIC flag -fPIC -DPIC works... yes
checking if g++ static flag -static works... no
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... (cached) GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for ranlib... (cached) ranlib
checking for ANSI C header files... (cached) yes
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
checking for ibv_get_device_list in -libverbs... yes
checking for rdma_create_event_channel in -lrdmacm... yes
checking for umad_init in -libumad... yes
checking for log in -lm... yes
checking for ibv_reg_dmabuf_mr in -libverbs... yes
checking pci/pci.h usability... yes
checking pci/pci.h presence... yes
checking for pci/pci.h... yes
checking for pci_init in -lpci... yes
checking for cuMemGetHandleForAddressRange in -lcuda... yes
checking for efadv_create_qp_ex in -lefa... yes
checking for mlx5dv_create_qp in -lmlx5... yes
checking for hnsdv_query_device in -lhns... no
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating config.h
config.status: executing depfiles commands
config.status: executing libtool commands
config.status: executing man commands
make  all-am
make[1]: Entering directory '/root/perftest'
ln -s .././man/perftest.1 man/ib_read_bw.1
ln -s .././man/perftest.1 man/ib_write_bw.1
ln -s .././man/perftest.1 man/ib_send_bw.1
ln -s .././man/perftest.1 man/ib_atomic_bw.1
ln -s .././man/perftest.1 man/ib_write_lat.1
ln -s .././man/perftest.1 man/ib_read_lat.1
ln -s .././man/perftest.1 man/ib_send_lat.1
ln -s .././man/perftest.1 man/ib_atomic_lat.1
ln -s .././man/perftest.1 man/raw_ethernet_bw.1
ln -s .././man/perftest.1 man/raw_ethernet_lat.1
  CC       src/send_bw.o
ln -s .././man/perftest.1 man/raw_ethernet_burst_lat.1
ln -s .././man/perftest.1 man/raw_ethernet_fs_rate.1
  CC       src/multicast_resources.o
  CC       src/get_clock.o
  CC       src/perftest_communication.o
  CC       src/perftest_parameters.o
  CC       src/perftest_resources.o
  CC       src/perftest_counters.o
  CC       src/host_memory.o
  CC       src/mmap_memory.o
  CC       src/cuda_memory.o
  CC       src/raw_ethernet_resources.o
  CC       src/send_lat.o
  CC       src/write_lat.o
  CC       src/write_bw.o
  CC       src/read_lat.o
  CC       src/read_bw.o
  CC       src/atomic_lat.o
  CC       src/atomic_bw.o
  CC       src/raw_ethernet_send_bw.o
  CC       src/raw_ethernet_send_lat.o
  CC       src/raw_ethernet_send_burst_lat.o
  CC       src/raw_ethernet_fs_rate.o
  AR       libperftest.a
  CCLD     ib_send_bw
  CCLD     ib_write_lat
  CCLD     ib_send_lat
  CCLD     ib_write_bw
  CCLD     ib_read_lat
  CCLD     ib_read_bw
  CCLD     ib_atomic_lat
  CCLD     ib_atomic_bw
  CCLD     raw_ethernet_bw
  CCLD     raw_ethernet_lat
  CCLD     raw_ethernet_burst_lat
  CCLD     raw_ethernet_fs_rate
make[1]: Leaving directory '/root/perftest'

With the binaries built we can move onto running our validation tests.

Running RDMA Validation Tests

We already should have our workload pods running on the cluster in the default namespace.

$ oc get pods
NAME                        READY   STATUS    RESTARTS   AGE
rdma-eth-a100-01-workload   1/1     Running   0          15m
rdma-eth-a100-02-workload   1/1     Running   0          15m

Next we will need to open two rsh connections one into each pod.

$ oc rsh rdma-eth-a100-01-workload
sh-5.1# 

$ oc rsh rdma-eth-a100-02-workload
sh-5.1#

Then in the rsh connection into rdma-eth-a100-01-workload we will run the following ib_write_bw command.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 172.16.0.1
 WARNING: BW peak won't be measured in this run.

************************************
* Waiting for client to connect... *
************************************

Then in the second rsh connection into rdma-eth-a100-02-workload we will run the following ib_write_bw command. Note this test is without cuda and will take a few minutes.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 172.16.0.2 172.16.0.1
 WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_1
 Number of qps   : 16        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON        Lock-free      : OFF
 ibv_wr* API     : ON        Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm     TOS    : 41
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x00bd PSN 0x6e902d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00be PSN 0xdf3b13
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00bf PSN 0x14ba61
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c0 PSN 0xd9209c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c1 PSN 0xc07f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c2 PSN 0xf06575
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c3 PSN 0x481230
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c4 PSN 0xc1a69
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c5 PSN 0x7c6e59
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c6 PSN 0xf16f67
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c7 PSN 0xe82e7f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c8 PSN 0xf0a6a6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00c9 PSN 0x41069a
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00ca PSN 0xe2153f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00cb PSN 0xed2a91
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00cc PSN 0x2f3581
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 remote address: LID 0000 QPN 0x00c7 PSN 0xc8665d
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00c8 PSN 0xce8d83
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00c9 PSN 0x4b7411
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00ca PSN 0x3a508c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00cb PSN 0xb3c9af
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00cc PSN 0xac6ee5
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00cd PSN 0x12b6e0
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00ce PSN 0x8a5959
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00cf PSN 0x8da89
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00d0 PSN 0x2e9fd7
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00d1 PSN 0xba6e2f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00d2 PSN 0xede496
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00d3 PSN 0xfa05ca
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00d4 PSN 0x2bdcaf
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00d5 PSN 0xc5b541
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00d6 PSN 0x3c6271
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5296615          0.00               92.56               0.176539
---------------------------------------------------------------------------------------

Now we are going to repeat the test but include the GPU with the --use_cuda switch on the command. So in the first rsh connection run.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 172.16.0.1 --use_cuda=0
 WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************

Then in the second rsh connection run the following.

sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60  -d mlx5_1 -p 10000 --source_ip 172.16.0.2 --use_cuda=0 172.16.0.1
 WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 0F:00

Picking device No. 0
[pid = 4488, dev = 0] device name = [NVIDIA A100-SXM4-80GB]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 2097152 bytes GPU buffer
allocated GPU buffer address at 00007fbebf200000 pointer=0x7fbebf200000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_1
 Number of qps   : 16        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON        Lock-free      : OFF
 ibv_wr* API     : ON        Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm     TOS    : 41
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x00ce PSN 0x282aa6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00cf PSN 0x3ab698
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d0 PSN 0x9dd002
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d1 PSN 0x11fc29
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d2 PSN 0x72e988
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d3 PSN 0xb5f44a
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d4 PSN 0x1540e1
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d5 PSN 0x8801c6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d6 PSN 0xd77ef2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d7 PSN 0xacf68c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d8 PSN 0x47f740
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00d9 PSN 0x286d3
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00da PSN 0xc1e7c3
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00db PSN 0xd8c9b4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00dc PSN 0xf51e62
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 local address: LID 0000 QPN 0x00dd PSN 0x4bcb7e
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:100
 remote address: LID 0000 QPN 0x00d8 PSN 0x533e76
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00d9 PSN 0x1f0628
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00da PSN 0xf1052
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00db PSN 0xd23e39
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00dc PSN 0x696a58
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00dd PSN 0x25acda
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00de PSN 0x383631
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00df PSN 0x9054d6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00e0 PSN 0xc33cc2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00e1 PSN 0x55a81c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00e2 PSN 0x62f190
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00e3 PSN 0x22fae3
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00e4 PSN 0x99b293
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00e5 PSN 0xb10444
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00e6 PSN 0x636db2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
 remote address: LID 0000 QPN 0x00e7 PSN 0x45708e
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:03:200
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      3994687          0.00               69.79               0.133122
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007fbebf200000
destroying current CUDA Ctx

Once the tests complete we have confirmed the RDMA is working and we can now add our real-world workload.

Hopefully this blog was useful in showing RDMA with CUDA testing on an OpenShift environment.

SCHMAUSTECH

Monday, January 06, 2025

RDMA+CUDA with NVIDIA on OpenShift

Building RDMA Validation Tests

Running RDMA Validation Tests