SCHMAUSTECH: Kubernetes

Showing posts with label Kubernetes. Show all posts

Friday, July 11, 2025

NVIDIA GPU Direct Storage on OpenShift

GPU Direct Storage enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. Using this direct path can relieve system bandwidth bottlenecks and decrease the latency and utilization load on the CPU. GPU Direct Storage can be used with NVMe or even NFS on a Netapp filer, the latter which this blog will cover.

Workflow

This blog is laid out with the follow sections all which build on top of one another to get the goal of successful GPU Direct Storage over NFS.

Assumptions
Considerations
Architecture
SRIOV Operator Configuration
Netapp VServer Setup
Netapp Trident CSI Operator Configuration
NVIDIA Network Operator Configuration
NVIDIA GPU Operator Configuration
GDS Cuda Workload Container

Assumptions

This document assumes that we have already deployed a OpenShift Cluster and have installed the necessary operators required for GPU Direct Storage. Those operators would be Node Feature Discover which should also be configured along with the base installation of the NVIDIA Network Operator (no NicClusterPolicy yet) and the NVIDIA GPU Operator (no GpuClusterPolicy yet), SRIOV Operator (no SRIOV policies or instances) and the Trident CSI Operator (No orchestrators or backends configured yet).

Considerations

If any of the nvme devices in the system participate in either the operating system or other services (machine configs for LVMs or other customized access) the nvme kernel modules will not be able to unload properly even with the workaround defined in this documentation. Any use of GDS requires that the nvme drives are not in use during the deployment of the Network Operator in order for the Network Operator to be able to unload in-tree drivers and then load NVIDIA's out of tree drivers in place.

Architecture

Below is a diagram of how the environment was architected from a networking perspective.

SRIOV Operator Configuration

For GPU Direct Storage over NFS to make performance sense we will need to use SRIOV here. So we first need to configure the SRIOV Operator assuming the SRIOV Operator is installed. The first step is to generate a basic SriovOperatorConfig custom resource file.

$ cat <<EOF > sriov-operator-config.yaml 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
  name: default
  namespace: openshift-sriov-network-operator
spec:
  enableInjector: true
  enableOperatorWebhook: true
  logLevel: 2
EOF

Next we create the SriovOperatorConfig on the cluster.

$ oc create -f sriov-operator-config.yaml 
sriovoperatorconfig.sriovnetwork.openshift.io/default created

Now one key step here is to patch the SriovOperatorConfig so that it is aware of the NVIDIA Network Operator.

$ oc patch sriovoperatorconfig default   --type=merge -n openshift-sriov-network-operator   --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
sriovoperatorconfig.sriovnetwork.openshift.io/default patched

Now we can move onto generating a SriovNetworkNodePolicy which will define the interface that we want to have VFs. In the case of multiple interfaces we would want to create multiple SriovNetworkNodePolicy files. The example below demonstrates how to configure an interface with an MTU of 9000 and generate 8 VFs.

$ cat <<EOF > sriov-network-node-policy.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace:  openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 9000
  nicSelector:
    vendor: "15b3"
    pfNames: ["enp55s0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy
EOF

With the SriovNetworkNodePolicy generated we can create it on the cluster which will cause the worker nodes where it is applied to reboot.

$ oc create -f sriov-network-node-policy.yaml 
sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created

Once the node has rebooted we can optionally open a debug pod on the worker nodes and verify with ip link to confirm the interfaces were created. If we are ready to move forward we can next generate the SriovNetwork for the resource we created in the SriovNetworkNodePolicy. Again if we have multiple SriovNetworkNodePolicy files we will also have multiple SriovNetwork files. These define the network space for the VF interfaces. I should note that these networks need to have access to the Netapp data LIF as well in order for RDMA to function. In my example below I excluded the ipaddresses in range of 102.168.10.100-110 because my Netapp data LIF will have ipaddresss in that space.

$ cat <<EOF > sriov-network.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-network
  namespace:  openshift-sriov-network-operator
spec:
  vlan: 0
  networkNamespace: "default"
  resourceName: "sriovlegacy"
  ipam: |
    {
      "type": "whereabouts",
      "range": "192.168.10.0/24",
      "exclude": [
       "192.168.10.100/30",
       "192.168.10.110/32"
      ]
    }
EOF

Now we can create the SriovNetwork custom resource on the cluster.

$ oc create -f sriov-network.yaml
sriovnetwork.sriovnetwork.openshift.io/sriov-network created

At this point we have configured everything we need for SRIOV and can move onto the next section of the documentation.

Netapp VServer Setup

This section is really just to cover a few items of importance from the Netapp vserver perspective. This does not aim to be a comprehensive guide on how to setup a Netapp MetroCluster or the vservers within them. First in our example environment we had a vserver created and that vserver as two logical interfaces: management and data. With the management interface we can access the vserver and look at a few things. Depending on the environment this may or may not be accessible for the OpenShift administrator. In my case the storage team gave me access. To get on the vserver we can ssh to the vserver ipaddress or fqdn if it exists in DNS.

$ ssh trident@10.6.136.110
(trident@10.6.136.110) Password:

Last login time: 5/7/2025 19:31:11

Once we are logged in I want to confirm that NFS 4 is enabled along with RDMA by using vserver nfs show.

ntap-rdu3-nv01-nvidia::> vserver nfs show

Vserver: ntap-rdu3-nv01-nvidia

        General Access:  true
                    v3:  enabled
                  v4.0:  enabled
                   4.1:  enabled
                   UDP:  enabled
                   TCP:  enabled
                  RDMA:  enabled
  Default Windows User:  -
 Default Windows Group:  -

The above output looks good for my needs when doing GPU Direct Storage. Another item we can check is the export-policies with vserver export-policy show.

ntap-rdu3-nv01-nvidia::> vserver export-policy show
Vserver          Policy Name
---------------  -------------------
ntap-rdu3-nv01-nvidia  
                 default
ntap-rdu3-nv01-nvidia  
                 trident-8d6b2406-551a-416b-bcce-22626ed60242
2 entries were displayed.

And finally I wanted to confirm that my data interfaces connected to the NVIDIA high speed switch were indeed operating with jumbo frames. I can see that with the network port show command. Because this is a MetroCluster pair setup we can see the interfaces on both nodes is set appropriately.

ntap-rdu3-nv01-nvidia::> network port show  

Node: ntap-rdu3-nv01-a
                                                  Speed(Mbps) Health
Port      Broadcast Domain Link MTU  Admin/Oper  Status
--------- ------------ ---------------- ---- ---- ----------- --------
e0M       Management       up   1500  auto/1000  healthy
e1b       -                down 1500  auto/-     -
e2a       nvidia           up   9000  auto/200000 
                                                 healthy
e2b       -                up   1500  auto/100000 
                                                 healthy
e2b-710   nfs              up   1500     -/-     healthy
e6a       -                down 1500  auto/-     -
e6b       -                down 1500  auto/-     -
e7b       -                down 1500  auto/-     -
e8a       -                down 1500  auto/-     -
e8b       -                down 1500  auto/-     -

Node: ntap-rdu3-nv01-b
                                                  Speed(Mbps) Health
Port      Broadcast Domain Link MTU  Admin/Oper  Status
--------- ------------ ---------------- ---- ---- ----------- --------
e0M       Management       up   1500  auto/1000  healthy
e1b       -                down 1500  auto/-     -
e2a       nvidia           up   9000  auto/200000 
                                                 healthy
e2b       -                up   1500  auto/100000 
                                                 healthy
e2b-710   nfs              up   1500     -/-     healthy
e6a       -                down 1500  auto/-     -
e6b       -                down 1500  auto/-     -
e7b       -                down 1500  auto/-     -
e8a       -                down 1500  auto/-     -
e8b       -                down 1500  auto/-     -
20 entries were displayed.

At this point we can exit out of the vserver and move onto configuring the Netapp Trident CSI operator.

Netapp Trident CSI Operator Configuration

Trident is an open-source and fully supported storage orchestrator for containers and Kubernetes distributions, including Red Hat OpenShift. Trident works with the entire NetApp storage portfolio, including the NetApp ONTAP and Element storage systems, and it also supports NFS and iSCSI connections. Trident accelerates the DevOps workflow by allowing end users to provision and manage storage from their NetApp storage systems without requiring intervention from a storage administrator.

We have made the assumption that the Trident Operator and the default Trident Orchestrator have already been deployed. Our next step will be to configure the secret for the Netapp vfiler with the credentials so that Trident knows how which username and password to connect.

$ cat <<EOF > netapp-phy-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: netapp-phy-secret
  namespace: trident
type: Opaque
stringData:
  username: vserver-user
  password: verserv-password

Once we have our custom resource file generated we can create it on the cluster.

$ oc create -f netapp-phy-secret.yaml
secret/netapp-phy-secret created

Next we need to configure the TridentBackendConfig so that Trident knows how to communicate with the Netapp from both a management and data perspective. Note the credentials we created are referenced here.

$ cat <<EOF > netapp-phy-tridentbackendconfig.yaml
apiVersion: trident.netapp.io/v1
kind: TridentBackendConfig
metadata:
  name: netapp-phy-nfs-backend
  namespace: trident
spec:
  version: 1
  storageDriverName: ontap-nas-flexgroup
  managementLIF: 10.6.136.110
  dataLIF: 192.168.10.101
  backendName: phy-nfs-backend
  svm: ntap-rdu3-nv01-nvidia
  autoExportPolicy: true
  credentials:
    name: netapp-phy-secret

With the custom resource file generated we can create it on the cluster.

$ oc create -f netapp-phy-tridentbackendconfig.yaml
tridentbackendconfig.trident.netapp.io/netapp-phy-nfs-backend created

We can validate the backend is there with the follow check.

$ oc get tridentbackend -n trident
NAME        BACKEND           BACKEND UUID
tbe-n59xq   phy-nfs-backend   8d6b2406-551a-416b-bcce-22626ed60242

We can also describe the backend as well.

$ oc describe tridentbackend tbe-n59xq -n trident
Name:          tbe-n59xq
Namespace:     trident
Labels:        <none>
Annotations:   <none>
API Version:   trident.netapp.io/v1
Backend Name:  phy-nfs-backend
Backend UUID:  8d6b2406-551a-416b-bcce-22626ed60242
Config:
  ontap_config:
    Aggregate:  
    Auto Export CID Rs:
      0.0.0.0/0
      ::/0
    Auto Export Policy:  true
    Backend Name:        phy-nfs-backend
    Backend Pools:
      eyJzdm1VVUlEIjoiNjE2OTg1YTYtMjlkZi0xMWYwLWI4YzctZDAzOWVhYzA0MDUzIn0=
    Chap Initiator Secret:         
    Chap Target Initiator Secret:  
    Chap Target Username:          
    Chap Username:                 
    Client Certificate:            
    Client Private Key:            
    Clone Split Delay:             10
    Credentials:
      Name:             netapp-phy-secret
    Data LIF:           192.168.10.101
    Debug:              false
    Debug Trace Flags:  <nil>
    Defaults:
      LUKS Encryption:                     false
      Adaptive Qos Policy:                 
      Encryption:                          
      Export Policy:                       <automatic>
      File System Type:                    ext4
      Format Options:                      
      Mirroring:                           false
      Name Template:                       
      Qos Policy:                          
      Security Style:                      unix
      Size:                                1G
      Skip Recovery Queue:                 false
      Snapshot Dir:                        false
      Snapshot Policy:                     none
      Snapshot Reserve:                    
      Space Allocation:                    true
      Space Reserve:                       none
      Split On Clone:                      false
      Tiering Policy:                      
      Unix Permissions:                    ---rwxrwxrwx
    Deny New Volume Pools:                 false
    Disable Delete:                        false
    Empty Flexvol Deferred Delete Period:  
    Flags:
      Disaggregated:  false
      Personality:    Unified
      San Optimized:  false
    Flexgroup Aggregate List:
    Igroup Name:                  
    Labels:                       <nil>
    Limit Aggregate Usage:        
    Limit Volume Pool Size:       
    Limit Volume Size:            
    Luns Per Flexvol:             
    Management LIF:               10.6.136.110
    Nas Type:                     nfs
    Nfs Mount Options:            
    Password:                     secret:netapp-phy-secret
    Qtree Prune Flexvols Period:  
    Qtree Quota Resize Period:    
    Qtrees Per Flexvol:           
    Region:                       
    Replication Policy:           
    Replication Schedule:         
    San Type:                     iscsi
    Smb Share:                    
    Storage:                      <nil>
    Storage Driver Name:          ontap-nas-flexgroup
    Storage Prefix:
    Supported Topologies:    <nil>
    Svm:                     ntap-rdu3-nv01-nvidia
    Trusted CA Certificate:  
    Usage Heartbeat:         
    Use CHAP:                false
    Use REST:                <nil>
    User State:              
    Username:                secret:netapp-phy-secret
    Version:                 1
    Zone:                    
Config Ref:                  9e1ff3f2-8a2d-4efa-859c-712b920d269b
Kind:                        TridentBackend
Metadata:
  Creation Timestamp:  2025-05-07T19:31:56Z
  Finalizers:
    trident.netapp.io
  Generate Name:     tbe-
  Generation:        1
  Resource Version:  38713504
  UID:               6536970f-b10e-4e04-8a37-8da56deaf69e
Online:              true
State:               online
User State:          normal
Version:             1
Events:              <none>

We can also use the tridentctl command to validate the backend and confirm its online.

$ ./trident-installer/tridentctl get backend -n trident
+-----------------+---------------------+--------------------------------------+--------+------------+---------+
|      NAME       |   STORAGE DRIVER    |                 UUID                 | STATE  | USER-STATE | VOLUMES |
+-----------------+---------------------+--------------------------------------+--------+------------+---------+
| phy-nfs-backend | ontap-nas-flexgroup | 8d6b2406-551a-416b-bcce-22626ed60242 | online | normal     |       0 |
+-----------------+---------------------+--------------------------------------+--------+------------+---------+

With the Trident backend configured we can move onto generating a storageclass resource file. Note while this looks just like a standard Trident NFS storageclass the designation of the rdma makes it special.

$ cat <<EOF > netapp-phy-rdma-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: netapp-phy-nfs
provisioner: csi.trident.netapp.io
parameters:
  backendType: "ontap-nas-flexgroup"
mountOptions:
  - vers=4.1
  - proto=rdma
  - max_connect=16
  - rsize=262144
  - wsize=262144
  - write=eager
EOF

Once we have generated the custom resource file we can create it on the cluster.

$ oc create -f netapp-phy-rdma-storageclass.yaml
storageclass.storage.k8s.io/netapp-phy-nfs created

We can validate the storageclass by looking at the storage classes available.

$ oc get sc
NAME             PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
netapp-phy-nfs   csi.trident.netapp.io   Delete          Immediate              false                  4s

Now with the storagclass configured we can generate a persistent volume resource file.

$ cat <<EOF > netapp-phy-pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-netapp-phy-test
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 850Gi
  storageClassName: netapp-phy-nfs
EOF

We can take the persistent volume resource and create it on the cluster.

$ oc create -f netapp-phy-pvc.yaml
persistentvolumeclaim/pvc-netapp-phy-test created

We can validate the persistent volume by looking at the pvc.

$ oc get pvc
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS            VOLUMEATTRIBUTESCLASS   AGE
pvc-netapp-phy-test          Bound    pvc-ae477c5c-cf10-4bc0-bb71-39d214a237f0   850Gi      RWO            netapp-phy-nfs          <unset>                 45s

At this point we have completed the setup of the Trident storage side in preparation for GPU Direct Storage.

NVIDIA Network Operator Configuration

We assume the Network Operator has already been installed on the cluster but the NicClusterPolicy still needs to be created. The following NicClusterPolicy example will provide the needed configuration to ensure RDMA is properly loaded for NFS. The key option in this policy is the ENABLE_NFSRDMA variable and having it set to true. I want to note that this policy also optinonally has an rdmaSharedDevice and ENTRYPOINT_DEBUG set to true for more verbose logging.

$ cat <<EOF > network-sriovleg-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.01-0.6.0.0-0
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    env:
    - name: UNLOAD_STORAGE_MODULES
      value: "true"
    - name: RESTORE_DRIVER_ON_POD_TERMINATION
      value: "true"
    - name: CREATE_IFNAMES_UDEV
      value: "true"
    - name: ENABLE_NFSRDMA
      value: "true"
    - name: ENTRYPOINT_DEBUG
      value: 'true'
EOF

Before creating the NicClusterPolicy on the cluster we need to prepare a script which will allow us to workaround an issue with GPU Direct Storage in the NVIDIA Network Operator. This script when run right after creating the NicClusterPolicy will determine which nodes have mofed pods running on them and based on that node list will ssh as the core user into each node and unload the following modules: nvme, nvme_tcp, nvme_fabrics, nvme_core. By using the script to unload the modules while the mofed container is busying building the doca drivers we eliminate an issue where when the mofed container goes to install the compiled doca drivers there is a failure to load. One might ask what does NVMe have to do with NFS and unfortunately GPU Direct Storage enablement does both so we have to work around this issue.

$ cat <<EOF > nvme-fixer.sh 
#!/bin/bash

### Set array of modules to be unloaded
declare -a modarr=("nvme" "nvme_tcp" "nvme_fabrics" "nvme_core")

### Determine which hosts have mofed container running on them
declare -a hostarr=(`oc get pods -n nvidia-network-operator -o custom-columns=POD:.metadata.name,NODE:.spec..nodeName --no-headers|grep mofed|awk {'print $2'}`)

### Iterate through modules on each host and unload them 
for host in "${hostarr[@]}"
do
    echo "Unloading nvme dependencies on $host..."
    for module in "${modarr[@]}"
    do
       echo "Unloading module $module..."
       ssh core@$host sudo rmmod $module
    done
done
EOF

Change the execute bit on the file.

$ chmod +x nvme-fixer.sh

Now we are ready to create the NicClusterPolicy on the cluster and follow it up by running the nvme-fixer.sh script. If there are any rmmod errors those can safely be ignored as the module was not loaded to start with. In the example below we had two workers nodes that had mofed pods running on them so the script went ahead and unloaded the nvme modules.

$ oc create -f network-sharedrdma-nic-cluster-policy.yaml 
nicclusterpolicy.mellanox.com/nic-cluster-policy created

$ ./nvme-fixer.sh 
Unloading nvme dependencies on nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com...
Unloading module nvme...
Unloading module nvme_tcp...
rmmod: ERROR: Module nvme_tcp is not currently loaded
Unloading module nvme_fabrics...
rmmod: ERROR: Module nvme_fabrics is not currently loaded
Unloading module nvme_core...
Unloading nvme dependencies on nvd-srv-29.nvidia.eng.rdu2.dc.redhat.com...
Unloading module nvme...
Unloading module nvme_tcp...
Unloading module nvme_fabrics...
Unloading module nvme_core...

Now we wait for the mofed pod to finish compiling and installed the GPU Direct Storage modules. We will know its complete when the pods are in a running state like below:

$ oc get pods -n nvidia-network-operator
NAME                                                          READY   STATUS    RESTARTS        AGE
mofed-rhcos4.16-56c9d799bf-ds-bvhmj                           2/2     Running   0               20h
mofed-rhcos4.16-56c9d799bf-ds-jdzxj                           2/2     Running   0               20h
nvidia-network-operator-controller-manager-85b78c49f6-9lchx   1/1     Running   4 (3h26m ago)   3d14h

This completes the NVIDIA Network Operator portion of the configuration for GPU Direct Storage.

NVIDIA GPU Operator Configuration

Now that the NicClusterPolicy is defined and the proper nvme modules have been loaded we can move onto configuring our GPU ClusterPolicy. The below example is a policy that will enable GPU Direct Storage on the worker nodes that have a proper NVIDIA GPU.

$ cat <<EOF > gpu-cluster-policy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    licensingConfig:
      configMapName: ''
      nlsEnabled: true
    enabled: true
    kernelModuleType: open
    certConfig:
      name: ''
    useNvidiaDriverCRD: false
    kernelModuleConfig:
      name: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig:
      configMapName: ''
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
    mps:
      root: /run/nvidia/mps
  gdrcopy:
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: true
    image: nvidia-fs
    repository: nvcr.io/nvidia/cloud-native
    version: 2.25.7
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
EOF

Now let's create the policy on the cluster.

$ oc create -f gpu-cluster-policy.yaml 
clusterpolicy.nvidia.com/gpu-cluster-policy created

Once the policy is created let's validate the pods are running before we move onto the next step.

$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS      RESTARTS       AGE
gpu-feature-discovery-nttht                           1/1     Running     0              20h
gpu-feature-discovery-r4ktv                           1/1     Running     0              20h
gpu-operator-7d7f694bfb-957mv                         1/1     Running     0              20h
nvidia-container-toolkit-daemonset-h96t6              1/1     Running     0              20h
nvidia-container-toolkit-daemonset-hqtrl              1/1     Running     0              20h
nvidia-cuda-validator-66ml7                           0/1     Completed   0              20h
nvidia-dcgm-exporter-hbk4r                            1/1     Running     0              20h
nvidia-dcgm-exporter-pgh4q                            1/1     Running     0              20h
nvidia-dcgm-nttds                                     1/1     Running     0              20h
nvidia-dcgm-zb4fl                                     1/1     Running     0              20h
nvidia-device-plugin-daemonset-d99md                  1/1     Running     0              20h
nvidia-device-plugin-daemonset-w7tc4                  1/1     Running     0              20h
nvidia-driver-daemonset-416.94.202504151456-0-8bdl5   4/4     Running     26 (20h ago)   2d2h
nvidia-driver-daemonset-416.94.202504151456-0-j8gps   4/4     Running     20 (20h ago)   2d2h
nvidia-node-status-exporter-b22hk                     1/1     Running     4              2d2h
nvidia-node-status-exporter-lwqhb                     1/1     Running     3              2d2h
nvidia-operator-validator-cvqn5                       1/1     Running     0              20h
nvidia-operator-validator-zxrpb                       1/1     Running     0              20h

With the NVIDIA GPU Operator pods running we can rsh into the daemonset pods and confirm GDS is enabled by running the lsmod command and cat out the /proc/driver/nvidia-fs/stats file.

$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202504151456-0-8bdl5
sh-4.4# lsmod|grep nvidia
nvidia_fs             327680  0
nvidia_modeset       1720320  0
video                  73728  1 nvidia_modeset
nvidia_uvm           4087808  12
nvidia              11665408  36 nvidia_uvm,nvidia_fs,gdrdrv,nvidia_modeset
drm                   741376  5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200

sh-4.4# cat /proc/driver/nvidia-fs/stats
GDS Version: 1.10.0.4 
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.20.5)
Mellanox PeerDirect Supported: False
IO stats: Disabled, peer IO stats: Disabled
Logging level: info

Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads                : err=0 io_state_err=0
Sparse Reads                : n=0 io=0 holes=0 pages=0 
Writes                : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap                : n=0 ok=0 err=0 munmap=0
Bar1-map            : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error                : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops                : Read=0 Write=0 BatchIO=0

If everything looks good we can move onto an additional step to confirm GDS is ready for workload consumption.

GDS Cuda Workload Container

Once the GPU Direct Storage drivers are loaded we can use one more additional tool to check and confirm GDS capability. This involves building a container that contains the CUDA packages and then running it on a node.

Now let's generate a service account CRD to use in the default namespace.

$ cat <<EOF > nvidiatools-serviceaccount.yaml 
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nvidiatools
  namespace: default
EOF

Next we can create it on our cluster.

$ oc create -f default-serviceaccount.yaml 
serviceaccount/rdma created

Finally with the service account create we can add privleges to it.

$ oc -n default adm policy add-scc-to-user privileged -z nvidiatools
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "nvidiatools"

With the service account defined and our pod yaml ready we can create it on the cluster.

The following pod yaml defines this configuration.

$ cat <<EOF > nvidiatools-30-workload.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: nvidiatools-30-workload
  namespace: default
  annotations:
    # JSON list is the canonical form; adjust if your NAD lives in another namespace
    k8s.v1.cni.cncf.io/networks: '[{ "name": "sriov-network" }]'
spec:
  serviceAccountName: nvidiatools
  nodeSelector:
    kubernetes.io/hostname: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com
  volumes:
    - name: rdma-pv-storage
      persistentVolumeClaim:
        claimName: pvc-netapp-phy-test
    - name: nordma-pv-storage
      persistentVolumeClaim:
        claimName: pvc-netapp-phy-nordma-test
  containers:
    - name: nvidiatools-30-workload
      image: quay.io/redhat_emp1/ecosys-nvidia/nvidia-tools:0.0.3
      imagePullPolicy: IfNotPresent
      securityContext:
        privileged: true
        capabilities:
          add: ["IPC_LOCK"]
      resources:
        limits:
          nvidia.com/gpu: 1
          openshift.io/sriovlegacy: 1
        requests:
          nvidia.com/gpu: 1
          openshift.io/sriovlegacy: 1
      volumeMounts:
        - name: rdma-pv-storage
          mountPath: /nfsfast
        - name: nordma-pv-storage
          mountPath: /nfsslow
EOF

$ oc create -f nvidiatools-30-workload.yaml 
nvidiatools-30-workload created

$ oc get pods
NAME                 READY   STATUS    RESTARTS   AGE
nvidiatools-30-workload   1/1     Running   0          3s

Once the pod is up and running we can rsh into the pod and run the gdscheck tool to confirm capabilities and configuration of GPU Direct Storage.

$ oc rsh nvidiatools-30-workload

sh-5.1# /usr/local/cuda/gds/tools/gdscheck -p
 GDS release version: 1.13.1.3
 nvidia_fs version:  2.20 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA        : Unsupported
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Supported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_pci_p2pdma : false
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 64 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 fs.gpfs.gds_async_support: true
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA L40S bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12080
 Platform: PowerEdge R760xa, Arch: x86_64(Linux 5.14.0-427.65.1.el9_4.x86_64)
 Platform verification succeeded

Now let's confirm our GPU Direct NFS mount is mounted. Notice in the output the proto is rdma.

sh-5.1# mount|grep nfs
192.168.10.101:/trident_pvc_ae477c5c_cf10_4bc0_bb71_39d214a237f0 on /mnt type nfs4 (rw,relatime,vers=4.1,rsize=262144,wsize=262144,namlen=255,hard,proto=rdma,max_connect=16,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=192.168.10.30,local_lock=none,write=eager,addr=192.168.10.101)

Next we can use gdsio to run some benchmarks across the GPU Direct NFS mount. Before we run the benchmarks let's familiarize ourselves with the all the gdsio switches and what they mean.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -h
gdsio version :1.12
Usage [using config file]: gdsio rw-sample.gdsio
Usage [using cmd line options]:/usr/local/cuda-12.8/gds/tools/gdsio 
         -f <file name>
         -D <directory name>
         -d <gpu_index (refer nvidia-smi)>
         -n <numa node>
         -m <memory type(0 - (cudaMalloc), 1 - (cuMem), 2 - (cudaMallocHost), 3 - (malloc) 4 - (mmap))>
         -w <number of threads for a job>
         -s <file size(K|M|G)>
         -o <start offset(K|M|G)>
         -i <io_size(K|M|G)> <min_size:max_size:step_size>
         -p <enable nvlinks> 
         -b <skip bufregister> 
         -V <verify IO>
         -x <xfer_type> [0(GPU_DIRECT), 1(CPU_ONLY), 2(CPU_GPU), 3(CPU_ASYNC_GPU), 4(CPU_CACHED_GPU), 5(GPU_DIRECT_ASYNC), 6(GPU_BATCH), 7(GPU_BATCH_STREAM)]
         -B <batch size>
         -I <(read) 0|(write)1| (randread) 2| (randwrite) 3>
         -T <duration in seconds>
         -k <random_seed> (number e.g. 3456) to be used with random read/write> 
         -U <use unaligned(4K) random offsets>
         -R <fill io buffer with random data>
         -F <refill io buffer with random data during each write>
         -a <alignment size in case of random IO>
         -M <mixed_rd_wr_percentage in case of regular batch mode>
         -P <rdma url>
         -J <per job statistics>

xfer_type:
0 - Storage->GPU (GDS)
1 - Storage->CPU
2 - Storage->CPU->GPU
3 - Storage->CPU->GPU_ASYNC
4 - Storage->PAGE_CACHE->CPU->GPU
5 - Storage->GPU_ASYNC
6 - Storage->GPU_BATCH
7 - Storage->GPU_BATCH_STREAM

Note:
read test (-I 0) with verify option (-V) should be used with files written (-I 1) with -V option
read test (-I 2) with verify option (-V) should be used with files written (-I 3) with -V option, using same random seed (-k),
same number of threads(-w), offset(-o), and data size(-s)
write test (-I 1/3) with verify option (-V) will perform writes followed by read

Before we begin running some tests I want to note that the tests are being run from a standard Dell R760xa and from the nvidia-smi topo output we can see we are dealing with a non optimal setup of NODE where the connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node. Ideally for peformant numbers we would want to run this on a H100 or B200 where the GPU and NIC are connected to the same PCIe switch and yield a PHB,PXB or PIX connection.

sh-5.1# nvidia-smi topo -mp
        GPU0    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity    GPU NUMA ID
GPU0     X     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    0,2,4,6,8,10    0        N/A
NIC0    NODE     X     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE                
NIC1    NODE    NODE     X     PIX    PIX    PIX    PIX    PIX    PIX    PIX    PIX                
NIC2    NODE    NODE    PIX     X     PIX    PIX    PIX    PIX    PIX    PIX    PIX                
NIC3    NODE    NODE    PIX    PIX     X     PIX    PIX    PIX    PIX    PIX    PIX                
NIC4    NODE    NODE    PIX    PIX    PIX     X     PIX    PIX    PIX    PIX    PIX                
NIC5    NODE    NODE    PIX    PIX    PIX    PIX     X     PIX    PIX    PIX    PIX                
NIC6    NODE    NODE    PIX    PIX    PIX    PIX    PIX     X     PIX    PIX    PIX                
NIC7    NODE    NODE    PIX    PIX    PIX    PIX    PIX    PIX     X     PIX    PIX                
NIC8    NODE    NODE    PIX    PIX    PIX    PIX    PIX    PIX    PIX     X     PIX                
NIC9    NODE    NODE    PIX    PIX    PIX    PIX    PIX    PIX    PIX    PIX     X                 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

Now let's run a few gdsio tests across our RDMA nfs mount. Please note these runs were not performance tuned in any way. This is merely a demonstration to show the feature functionality.

In this first example, gdsio is used to generate a random write load of small IOs (4k) to one of the NFS mount point

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 500M -i 4K -x 0 -I 3 -T 120
IoType: RANDWRITE XferType: GPUD Threads: 32 DataSetSize: 43222136/16384000(KiB) IOSize: 4(KiB) Throughput: 0.344940 GiB/sec, Avg_Latency: 352.314946 usecs ops: 10805534 total_time 119.498576 secs

Next we will repeat the same test but for random reads.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 500M -i 4K -x 0 -I 2 -T 120
IoType: RANDREAD XferType: GPUD Threads: 32 DataSetSize: 71313540/16384000(KiB) IOSize: 4(KiB) Throughput: 0.569229 GiB/sec, Avg_Latency: 214.448246 usecs ops: 17828385 total_time 119.477201 secs

Small and random IOs are all about IOPS and latency. For our next test we will determine throughput. We will use larger files sizes and much larger IO sizes.

sh-5.1# /usr/local/cuda-12.8/gds/tools/gdsio -D /nfsfast -d 0 -w 32 -s 1G -i 1M -x 0 -I 1 -T 120
IoType: WRITE XferType: GPUD Threads: 32 DataSetSize: 320301056/33554432(KiB) IOSize: 1024(KiB) Throughput: 2.547637 GiB/sec, Avg_Latency: 12487.658159 usecs ops: 312794 total_time 119.900455 secs

This concludes the workflow of configuring and testing GPU Direct Storage on OpenShift over an RDMA NFS mount.

Wednesday, February 22, 2023

OpenShift to MicroShift Resource Mapping

Recently I was approached with the task of understanding what were the resource differences between OpenShift and MicroShift. This is especially important if one is interested in applying governance policies and rules across a fleet of systems that might be a mix of both OpenShift and MicroShift. Knowing the availability of specific resource definitions that might be common or disparate between the two Kubernetes experience can help an administrator know if they can use the same policy across both instances or if they need to specifically craft a policy for one or the other. Given that this information might be important for administrators I decided to map it out.

Below is a table that shows the resource definition and then if defined in OpenShift 4.12, MicroShift 4.12 or both, the corresponding API version.

Resource	OpenShift	Microshift	API Version
alertmanagerconfigs	Yes	No	monitoring.coreos.com/v1alpha1, monitoring.coreos.com/v1beta1
alertmanagers	Yes	No	monitoring.coreos.com/v1
apirequestcounts	Yes	No	apiserver.openshift.io/v1
apiservers	Yes	No	config.openshift.io/v1
apiservices	Yes	Yes	apiregistration.k8s.io/v1
appliedclusterresourcequotas	Yes	No	quota.openshift.io/v1
authentications	Yes	No	config.openshift.io/v1, operator.openshift.io/v1
baremetalhosts	Yes	No	metal3.io/v1alpha1
bmceventsubscriptions	Yes	No	metal3.io/v1alpha1
brokertemplateinstances	Yes	No	template.openshift.io/v1
buildconfigs	Yes	No	build.openshift.io/v1
builds	Yes	No	build.openshift.io/v1, config.openshift.io/v1
catalogsources	Yes	No	operators.coreos.com/v1alpha1
certificatesigningrequests	Yes	Yes	certificates.k8s.io/v1
cloudcredentials	Yes	No	operator.openshift.io/v1
clusterautoscalers	Yes	No	autoscaling.openshift.io/v1
clustercsidrivers	Yes	No	operator.openshift.io/v1
clusteroperators	Yes	No	config.openshift.io/v1
clusterresourcequotas	Yes	No	quota.openshift.io/v1
clusterrolebindings	Yes	Yes	authorization.openshift.io/v1, rbac.authorization.k8s.io/v1
clusterroles	Yes	Yes	authorization.openshift.io/v1, rbac.authorization.k8s.io/v1
clusterserviceversions	Yes	No	operators.coreos.com/v1alpha1
clusterversions	Yes	No	config.openshift.io/v1
componentstatuses	Yes	Yes	v1
configmaps	Yes	Yes	v1
configs	Yes	No	imageregistry.operator.openshift.io/v1, operator.openshift.io/v1, samples.operator.openshift.io/v1
consoleclidownloads	Yes	No	console.openshift.io/v1
consoleexternalloglinks	Yes	No	console.openshift.io/v1
consolelinks	Yes	No	console.openshift.io/v1
consolenotifications	Yes	No	console.openshift.io/v1
consoleplugins	Yes	No	console.openshift.io/v1, console.openshift.io/v1alpha1
consolequickstarts	Yes	No	console.openshift.io/v1
consoles	Yes	No	config.openshift.io/v1, operator.openshift.io/v1
consoleyamlsamples	Yes	No	console.openshift.io/v1
containerruntimeconfigs	Yes	No	machineconfiguration.openshift.io/v1
controllerconfigs	Yes	No	machineconfiguration.openshift.io/v1
controllerrevisions	Yes	Yes	apps/v1
controlplanemachinesets	Yes	No	machine.openshift.io/v1
credentialsrequests	Yes	No	cloudcredential.openshift.io/v1
cronjobs	Yes	Yes	batch/v1
csidrivers	Yes	Yes	storage.k8s.io/v1
csinodes	Yes	Yes	storage.k8s.io/v1
csisnapshotcontrollers	Yes	No	operator.openshift.io/v1
csistoragecapacities	Yes	Yes	storage.k8s.io/v1, storage.k8s.io/v1beta1
customresourcedefinitions	Yes	Yes	apiextensions.k8s.io/v1
daemonsets	Yes	Yes	apps/v1
deploymentconfigs	Yes	No	apps.openshift.io/v1
deployments	Yes	Yes	apps/v1
dnses	Yes	No	config.openshift.io/v1, operator.openshift.io/v1
dnsrecords	Yes	No	ingress.operator.openshift.io/v1
egressfirewalls	Yes	No	k8s.ovn.org/v1
egressips	Yes	No	k8s.ovn.org/v1
egressqoses	Yes	No	k8s.ovn.org/v1
egressrouters	Yes	No	network.operator.openshift.io/v1
endpoints	Yes	Yes	v1
endpointslices	Yes	Yes	discovery.k8s.io/v1
etcds	Yes	No	operator.openshift.io/v1
events	Yes	Yes	v1, events.k8s.io/v1
featuregates	Yes	No	config.openshift.io/v1
firmwareschemas	Yes	No	metal3.io/v1alpha1
flowschemas	Yes	Yes	flowcontrol.apiserver.k8s.io/v1beta1, flowcontrol.apiserver.k8s.io/v1beta2
groups	Yes	No	user.openshift.io/v1
hardwaredata	Yes	No	metal3.io/v1alpha1
helmchartrepositories	Yes	No	helm.openshift.io/v1beta1
horizontalpodautoscalers	Yes	Yes	autoscaling/v1, autoscaling/v2, autoscaling/v2beta2
hostfirmwaresettings	Yes	No	metal3.io/v1alpha1
identities	Yes	No	user.openshift.io/v1
imagecontentpolicies	Yes	No	config.openshift.io/v1
imagecontentsourcepolicies	Yes	No	operator.openshift.io/v1alpha1
imagepruners	Yes	No	imageregistry.operator.openshift.io/v1
images	Yes	No	config.openshift.io/v1, image.openshift.io/v1
imagesignatures	Yes	No	image.openshift.io/v1
imagestreams	Yes	No	image.openshift.io/v1
imagestreamtags	Yes	No	image.openshift.io/v1
imagetags	Yes	No	image.openshift.io/v1
infrastructures	Yes	No	config.openshift.io/v1
ingressclasses	Yes	Yes	networking.k8s.io/v1
ingresscontrollers	Yes	No	operator.openshift.io/v1
ingresses	Yes	Yes	config.openshift.io/v1, networking.k8s.io/v1
insightsoperators	Yes	No	operator.openshift.io/v1
installplans	Yes	No	operators.coreos.com/v1alpha1
ippools	Yes	No	whereabouts.cni.cncf.io/v1alpha1
jobs	Yes	Yes	batch/v1
kubeapiservers	Yes	No	operator.openshift.io/v1
kubecontrollermanagers	Yes	No	operator.openshift.io/v1
kubeletconfigs	Yes	No	machineconfiguration.openshift.io/v1
kubeschedulers	Yes	No	operator.openshift.io/v1
kubestorageversionmigrators	Yes	No	operator.openshift.io/v1
leases	Yes	Yes	coordination.k8s.io/v1
limitranges	Yes	Yes	v1
machineautoscalers	Yes	No	autoscaling.openshift.io/v1beta1
machineconfigpools	Yes	No	machineconfiguration.openshift.io/v1
machineconfigs	Yes	No	machineconfiguration.openshift.io/v1
machinehealthchecks	Yes	No	machine.openshift.io/v1beta1
machines	Yes	No	machine.openshift.io/v1beta1
machinesets	Yes	No	machine.openshift.io/v1beta1
mutatingwebhookconfigurations	Yes	Yes	admissionregistration.k8s.io/v1
namespaces	Yes	Yes	v1
network-attachment-definitions	Yes	No	k8s.cni.cncf.io/v1
networkpolicies	Yes	Yes	networking.k8s.io/v1
networks	Yes	No	config.openshift.io/v1, operator.openshift.io/v1
nodes	Yes	Yes	v1, config.openshift.io/v1, metrics.k8s.io/v1beta1
oauthaccesstokens	Yes	No	oauth.openshift.io/v1
oauthauthorizetokens	Yes	No	oauth.openshift.io/v1
oauthclientauthorizations	Yes	No	oauth.openshift.io/v1
oauthclients	Yes	No	oauth.openshift.io/v1
oauths	Yes	No	config.openshift.io/v1
olmconfigs	Yes	No	operators.coreos.com/v1
openshiftapiservers	Yes	No	operator.openshift.io/v1
openshiftcontrollermanagers	Yes	No	operator.openshift.io/v1
operatorconditions	Yes	No	operators.coreos.com/v1, operators.coreos.com/v2
operatorgroups	Yes	No	operators.coreos.com/v1, operators.coreos.com/v1alpha2
operatorhubs	Yes	No	config.openshift.io/v1
operatorpkis	Yes	No	network.operator.openshift.io/v1
operators	Yes	No	operators.coreos.com/v1
overlappingrangeipreservations	Yes	No	whereabouts.cni.cncf.io/v1alpha1
packagemanifests	Yes	No	packages.operators.coreos.com/v1
performanceprofiles	Yes	No	performance.openshift.io/v1, performance.openshift.io/v1alpha1, performance.openshift.io/v2
persistentvolumeclaims	Yes	Yes	v1
persistentvolumes	Yes	Yes	v1
poddisruptionbudgets	Yes	Yes	policy/v1
podmonitors	Yes	No	monitoring.coreos.com/v1
podnetworkconnectivitychecks	Yes	No	controlplane.operator.openshift.io/v1alpha1
pods	Yes	Yes	v1, metrics.k8s.io/v1beta1
podtemplates	Yes	Yes	v1
preprovisioningimages	Yes	No	metal3.io/v1alpha1
priorityclasses	Yes	Yes	scheduling.k8s.io/v1
prioritylevelconfigurations	Yes	Yes	flowcontrol.apiserver.k8s.io/v1beta1, flowcontrol.apiserver.k8s.io/v1beta2
probes	Yes	No	monitoring.coreos.com/v1
profiles	Yes	No	tuned.openshift.io/v1
projecthelmchartrepositories	Yes	No	helm.openshift.io/v1beta1
projectrequests	Yes	No	project.openshift.io/v1
projects	Yes	No	config.openshift.io/v1, project.openshift.io/v1
prometheuses	Yes	No	monitoring.coreos.com/v1
prometheusrules	Yes	No	monitoring.coreos.com/v1
provisionings	Yes	No	metal3.io/v1alpha1
proxies	Yes	No	config.openshift.io/v1
rangeallocations	Yes	Yes	security.internal.openshift.io/v1, security.openshift.io/v1
replicasets	Yes	Yes	apps/v1
replicationcontrollers	Yes	Yes	v1
resourceaccessreviews	Yes	No	authorization.openshift.io/v1
resourcequotas	Yes	Yes	v1
rolebindingrestrictions	Yes	No	authorization.openshift.io/v1
rolebindings	Yes	Yes	authorization.openshift.io/v1, rbac.authorization.k8s.io/v1
roles	Yes	Yes	authorization.openshift.io/v1, rbac.authorization.k8s.io/v1
routes	Yes	Yes	route.openshift.io/v1
runtimeclasses	Yes	Yes	node.k8s.io/v1
schedulers	Yes	No	config.openshift.io/v1
secrets	Yes	Yes	v1
securitycontextconstraints	Yes	Yes	security.openshift.io/v1
selfsubjectaccessreviews	Yes	Yes	authorization.k8s.io/v1
selfsubjectrulesreviews	Yes	Yes	authorization.k8s.io/v1
serviceaccounts	Yes	Yes	v1
servicecas	Yes	No	operator.openshift.io/v1
servicemonitors	Yes	No	monitoring.coreos.com/v1
services	Yes	Yes	v1
statefulsets	Yes	Yes	apps/v1
storageclasses	Yes	Yes	storage.k8s.io/v1
storages	Yes	No	operator.openshift.io/v1
storagestates	Yes	No	migration.k8s.io/v1alpha1
storageversionmigrations	Yes	No	migration.k8s.io/v1alpha1
subjectaccessreviews	Yes	Yes	authorization.k8s.io/v1, authorization.openshift.io/v1
subscriptions	Yes	No	operators.coreos.com/v1alpha1
templateinstances	Yes	No	template.openshift.io/v1
templates	Yes	No	template.openshift.io/v1
thanosrulers	Yes	No	monitoring.coreos.com/v1
tokenreviews	Yes	Yes	authentication.k8s.io/v1, oauth.openshift.io/v1
tuneds	Yes	No	tuned.openshift.io/v1
useridentitymappings	Yes	No	user.openshift.io/v1
useroauthaccesstokens	Yes	No	oauth.openshift.io/v1
users	Yes	No	user.openshift.io/v1
validatingwebhookconfigurations	Yes	Yes	admissionregistration.k8s.io/v1
volumeattachments	Yes	Yes	storage.k8s.io/v1
volumesnapshotclasses	Yes	No	snapshot.storage.k8s.io/v1
volumesnapshotcontents	Yes	No	snapshot.storage.k8s.io/v1
volumesnapshots	Yes	No	snapshot.storage.k8s.io/v1