SCHMAUSTECH: Exploring the NVIDIA Maintenance Operator

The NVIDIA Maintenance Operator provides Kubernetes API(Custom Resource Definition) to allow node maintenance operators in K8s cluster in a coordinated manner. It performs some common operations to prepare a node for maintenance such as cordoning the node as well as draining it. Users/Consumers can request to perform maintenance on a node by creating NodeMaintenance custom resource. The operator will then reconcile NodeMaintenance custom resources with the following flow:

Scheduling of NodeMaintenance to be processed by the operator, taking into account constraints such as the maximal allowed parallel operations.
Node preparation for maintenance such as cordon and draining of the node
Mark NodeMaintenance as Ready (via condition)
Cleanup on deletion of NodeMaintenance such as node uncordon

Assumptions

This workflow assumes we already have a OpenShift cluster installed along with some resources related to NVIDIA. In our example single node Openshift cluster below we have a GPU resource and also have the NVIDIA Network Operator installed and configured.

Installing and Configuring NVIDIA Maintenance Operator

Before we can demonstrate the NVIDIA Maintenance Operator we first have to install it and configure it. Let's generate the following custom resource file to create the namespace, operatorgroup and subscription for the NVIDIA Maintenance Operator.

$ cat <<EOF > node-maintenance-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-maintenance-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: nvidia-maintenance-operator
  namespace: nvidia-maintenance-operator
spec:
  targetNamespaces:
  - nvidia-maintenance-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nvidia-maintenance-operator
  namespace: nvidia-maintenance-operator
spec:
  channel: v0.2
  installPlanApproval: Automatic
  name: nvidia-maintenance-operator
  source: certified-operators
  sourceNamespace: openshift-marketplace
  startingCSV: nvidia-maintenance-operator.v0.2.2
EOF

Now let's create the resource using the custom resource file we generated on the cluster.

$ oc create -f node-maintenance-operator.yaml 
namespace/nvidia-maintenance-operator created
operatorgroup.operators.coreos.com/nvidia-maintenance-operator created
subscription.operators.coreos.com/nvidia-maintenance-operator created

Next we can validate the operator manager is up and running.

$ oc get pods -n nvidia-maintenance-operator
NAME                                                      READY   STATUS    RESTARTS   AGE
maintenance-operator-controller-manager-d8db7f84b-hsmfd   1/1     Running   0          8m

Now we need to configure the MaintenanceOperatorConfig custom resource file. In this file we can specify the log level, the number of parallel operations (ie how many nodes to take offline at once) and the time the node is kept in maintenance (a number in seconds that provides enough time for the maintenance work to happen before the operator will remove the node maintenance policy). In our example we are just going to allow one maintenance operation at a time and that operation has 300 seconds to finish before the node is returned to schedulable.

$ cat <<EOF > maintenance-operator-config.yaml
apiVersion: maintenance.nvidia.com/v1alpha1
kind: MaintenanceOperatorConfig
metadata:
  name: default
  namespace: nvidia-maintenance-operator
spec:
  logLevel: info
  maxParallelOperations: 1
  maxNodeMaintenanceTimeSeconds: 300
EOF

Now let's create the MaintenanceOperatorConfig on the cluster.

$ oc create -f maintenance-operator-config.yaml
maintenanceoperatorconfig.maintenance.nvidia.com/default created

The MaintenanceOperatorConfig does not spin up any additional pods but we can check that it is there by running the following command.

$ oc get MaintenanceOperatorConfig -n nvidia-maintenance-operator -o yaml
apiVersion: v1
items:
- apiVersion: maintenance.nvidia.com/v1alpha1
  kind: MaintenanceOperatorConfig
  metadata:
    creationTimestamp: "2025-05-29T17:44:58Z"
    generation: 1
    name: default
    namespace: nvidia-maintenance-operator
    resourceVersion: "3912981"
    uid: 43a67142-c8ab-4d21-a788-417339f4a338
  spec:
    logLevel: info
    maxNodeMaintenanceTimeSeconds: 300
    maxParallelOperations: 1
kind: List
metadata:
  resourceVersion: ""

Validating NVIDIA Maintenance Operator

Now before we configure a NodeMaintenance resource I want to point out we currently have an Aerial app running on our cluster that does consume a GPU. This will be important when we configure our NodeMaintenance resource.

$ oc get pod -l app=aerial-gnb -n aerial
NAME                          READY   STATUS    RESTARTS   AGE
aerial-gnb-6947fc77b7-wjrsv   1/1     Running   0          76m

Below is an example NodeMaintenance resource file. In it I have specified the node it applies to along with details for how long we should let our application pod complete. In this example we will wait 60 seconds for the Aerial application to complete. If the application does not complete the NodeMaintenance resource is instructed to force the drain of the node which will terminate the pod on our running node. Notice we have eviction filters based on resources that a pod could potentially use. Again in our case the GPU is the important part here.

$ cat <<EOF > node-maintenance.yaml
apiVersion: maintenance.nvidia.com/v1alpha1
kind: NodeMaintenance
metadata:
  name: aerial-maintenance-operation
  namespace: default
spec:
  requestorID: schmaustech
  nodeName: nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com
  cordon: true
  waitForPodCompletion:
    podSelector: "app=aerial-gnb"
    timeoutSeconds: 60
  drainSpec:
    force: true
    podSelector: ""
    timeoutSeconds: 90
    deleteEmptyDir: true
    podEvictionFilters:
    - byResourceNameRegex: nvidia.com/gpu*
    - byResourceNameRegex: nvidia.com/rdma*
EOF

Now let's create our NodeMaintenance custom resource on the cluster. Note that when this gets created it will taint the node as SchedulingDisabled and the timers will begin.

$ oc create -f node-maintenance.yaml 
nodemaintenance.maintenance.nvidia.com/aerial-maintenance-operation created

We can see the NodeMaintenance resource is waiting for pods to complete.

$ oc get nodemaintenances.maintenance.nvidia.com -A
NAMESPACE   NAME                           NODE                                       REQUESTOR     READY   PHASE                  FAILED
default     aerial-maintenance-operation   nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com   schmaustech   False   WaitForPodCompletion

We can see the node has been marked SchedulingDisabled.

$ oc get nodes
NAME                                       STATUS                     ROLES                         AGE   VERSION
nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com   Ready,SchedulingDisabled   control-plane,master,worker   13d   v1.31.8

After 90 seconds we can now see the node is being drained.

$ oc get nodemaintenances.maintenance.nvidia.com -A
NAMESPACE   NAME                           NODE                                       REQUESTOR     READY   PHASE      FAILED
default     aerial-maintenance-operation   nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com   schmaustech   False   Draining

And our Aerial pod is terminating and notice a new one Pending. The new one is generated because this deployment was configured to always have one running however we are on a single node OpenShift deployment and the node has SchedulingDisabled so it will sit in a Pending state until maintenance is over.

$ oc get pods -n aerial
NAME                          READY   STATUS        RESTARTS   AGE
aerial-gnb-6947fc77b7-48vxz   1/1     Terminating   0          80m
aerial-gnb-6947fc77b7-88dv8   0/1     Pending       0          9s

Finally we can see the NodeMaintenance resource says the node is ready since all the resource that consumed GPUs have been terminated and/or moved to another node if this were a multinode cluster.

$ oc get nodemaintenances.maintenance.nvidia.com -A
NAMESPACE   NAME                           NODE                                       REQUESTOR     READY   PHASE   FAILED
default     aerial-maintenance-operation   nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com   schmaustech   True    Ready

We can also see we only have a Pending Aerial deployment pod.

$ oc get pods -n aerial
NAME                          READY   STATUS    RESTARTS   AGE
aerial-gnb-6947fc77b7-88dv8   0/1     Pending   0          49s

Then after 5 minutes and assuming our maintenance went well we will see the NodeMaintenance resource is removed by the operator automatically.

$ oc get nodemaintenances.maintenance.nvidia.com -A
No resources found

The node is no longer marked as SchedulingDisabled.

$ oc get nodes
NAME                                       STATUS   ROLES                         AGE   VERSION
nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com   Ready    control-plane,master,worker   13d   v1.31.8

And our Aerial pod that was Pending is now in a Running state.

$ oc get pods -n aerial
NAME                          READY   STATUS    RESTARTS   AGE
aerial-gnb-6947fc77b7-88dv8   1/1     Running   0          7m

Hopefully this provides an idea of what the NVIDIA Maintenance Operator can do on OpenShift from a simple point of view. For more information about the maintenance operator check out the GitHub repo here.

Saturday, June 07, 2025

Exploring the NVIDIA Maintenance Operator

Assumptions

Installing and Configuring NVIDIA Maintenance Operator

Validating NVIDIA Maintenance Operator