Saturday, June 07, 2025

Exploring the NVIDIA Maintenance Operator


The NVIDIA Maintenance Operator provides Kubernetes API(Custom Resource Definition) to allow node maintenance operators in K8s cluster in a coordinated manner. It performs some common operations to prepare a node for maintenance such as cordoning the node as well as draining it. Users/Consumers can request to perform maintenance on a node by creating NodeMaintenance custom resource. The operator will then reconcile NodeMaintenance custom resources with the following flow:
  • Scheduling of NodeMaintenance to be processed by the operator, taking into account constraints such as the maximal allowed parallel operations.
  • Node preparation for maintenance such as cordon and draining of the node
  • Mark NodeMaintenance as Ready (via condition)
  • Cleanup on deletion of NodeMaintenance such as node uncordon

Assumptions

This workflow assumes we already have a OpenShift cluster installed along with some resources related to NVIDIA. In our example single node Openshift cluster below we have a GPU resource and also have the NVIDIA Network Operator installed and configured.

Installing and Configuring NVIDIA Maintenance Operator

Before we can demonstrate the NVIDIA Maintenance Operator we first have to install it and configure it. Let's generate the following custom resource file to create the namespace, operatorgroup and subscription for the NVIDIA Maintenance Operator.

$ cat <<EOF > node-maintenance-operator.yaml apiVersion: v1 kind: Namespace metadata: name: nvidia-maintenance-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: nvidia-maintenance-operator namespace: nvidia-maintenance-operator spec: targetNamespaces: - nvidia-maintenance-operator --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nvidia-maintenance-operator namespace: nvidia-maintenance-operator spec: channel: v0.2 installPlanApproval: Automatic name: nvidia-maintenance-operator source: certified-operators sourceNamespace: openshift-marketplace startingCSV: nvidia-maintenance-operator.v0.2.2 EOF

Now let's create the resource using the custom resource file we generated on the cluster.

$ oc create -f node-maintenance-operator.yaml namespace/nvidia-maintenance-operator created operatorgroup.operators.coreos.com/nvidia-maintenance-operator created subscription.operators.coreos.com/nvidia-maintenance-operator created

Next we can validate the operator manager is up and running.

$ oc get pods -n nvidia-maintenance-operator NAME READY STATUS RESTARTS AGE maintenance-operator-controller-manager-d8db7f84b-hsmfd 1/1 Running 0 8m

Now we need to configure the MaintenanceOperatorConfig custom resource file. In this file we can specify the log level, the number of parallel operations (ie how many nodes to take offline at once) and the time the node is kept in maintenance (a number in seconds that provides enough time for the maintenance work to happen before the operator will remove the node maintenance policy). In our example we are just going to allow one maintenance operation at a time and that operation has 300 seconds to finish before the node is returned to schedulable.

$ cat <<EOF > maintenance-operator-config.yaml apiVersion: maintenance.nvidia.com/v1alpha1 kind: MaintenanceOperatorConfig metadata: name: default namespace: nvidia-maintenance-operator spec: logLevel: info maxParallelOperations: 1 maxNodeMaintenanceTimeSeconds: 300 EOF

Now let's create the MaintenanceOperatorConfig on the cluster.

$ oc create -f maintenance-operator-config.yaml maintenanceoperatorconfig.maintenance.nvidia.com/default created

The MaintenanceOperatorConfig does not spin up any additional pods but we can check that it is there by running the following command.

$ oc get MaintenanceOperatorConfig -n nvidia-maintenance-operator -o yaml apiVersion: v1 items: - apiVersion: maintenance.nvidia.com/v1alpha1 kind: MaintenanceOperatorConfig metadata: creationTimestamp: "2025-05-29T17:44:58Z" generation: 1 name: default namespace: nvidia-maintenance-operator resourceVersion: "3912981" uid: 43a67142-c8ab-4d21-a788-417339f4a338 spec: logLevel: info maxNodeMaintenanceTimeSeconds: 300 maxParallelOperations: 1 kind: List metadata: resourceVersion: ""

Validating NVIDIA Maintenance Operator

Now before we configure a NodeMaintenance resource I want to point out we currently have an Aerial app running on our cluster that does consume a GPU. This will be important when we configure our NodeMaintenance resource.

$ oc get pod -l app=aerial-gnb -n aerial NAME READY STATUS RESTARTS AGE aerial-gnb-6947fc77b7-wjrsv 1/1 Running 0 76m

Below is an example NodeMaintenance resource file. In it I have specified the node it applies to along with details for how long we should let our application pod complete. In this example we will wait 60 seconds for the Aerial application to complete. If the application does not complete the NodeMaintenance resource is instructed to force the drain of the node which will terminate the pod on our running node. Notice we have eviction filters based on resources that a pod could potentially use. Again in our case the GPU is the important part here.

$ cat <<EOF > node-maintenance.yaml apiVersion: maintenance.nvidia.com/v1alpha1 kind: NodeMaintenance metadata: name: aerial-maintenance-operation namespace: default spec: requestorID: schmaustech nodeName: nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com cordon: true waitForPodCompletion: podSelector: "app=aerial-gnb" timeoutSeconds: 60 drainSpec: force: true podSelector: "" timeoutSeconds: 90 deleteEmptyDir: true podEvictionFilters: - byResourceNameRegex: nvidia.com/gpu* - byResourceNameRegex: nvidia.com/rdma* EOF

Now let's create our NodeMaintenance custom resource on the cluster. Note that when this gets created it will taint the node as SchedulingDisabled and the timers will begin.

$ oc create -f node-maintenance.yaml nodemaintenance.maintenance.nvidia.com/aerial-maintenance-operation created

We can see the NodeMaintenance resource is waiting for pods to complete.

$ oc get nodemaintenances.maintenance.nvidia.com -A NAMESPACE NAME NODE REQUESTOR READY PHASE FAILED default aerial-maintenance-operation nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com schmaustech False WaitForPodCompletion

We can see the node has been marked SchedulingDisabled.

$ oc get nodes NAME STATUS ROLES AGE VERSION nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com Ready,SchedulingDisabled control-plane,master,worker 13d v1.31.8

After 90 seconds we can now see the node is being drained.

$ oc get nodemaintenances.maintenance.nvidia.com -A NAMESPACE NAME NODE REQUESTOR READY PHASE FAILED default aerial-maintenance-operation nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com schmaustech False Draining

And our Aerial pod is terminating and notice a new one Pending. The new one is generated because this deployment was configured to always have one running however we are on a single node OpenShift deployment and the node has SchedulingDisabled so it will sit in a Pending state until maintenance is over.

$ oc get pods -n aerial NAME READY STATUS RESTARTS AGE aerial-gnb-6947fc77b7-48vxz 1/1 Terminating 0 80m aerial-gnb-6947fc77b7-88dv8 0/1 Pending 0 9s

Finally we can see the NodeMaintenance resource says the node is ready since all the resource that consumed GPUs have been terminated and/or moved to another node if this were a multinode cluster.

$ oc get nodemaintenances.maintenance.nvidia.com -A NAMESPACE NAME NODE REQUESTOR READY PHASE FAILED default aerial-maintenance-operation nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com schmaustech True Ready

We can also see we only have a Pending Aerial deployment pod.

$ oc get pods -n aerial NAME READY STATUS RESTARTS AGE aerial-gnb-6947fc77b7-88dv8 0/1 Pending 0 49s

Then after 5 minutes and assuming our maintenance went well we will see the NodeMaintenance resource is removed by the operator automatically.

$ oc get nodemaintenances.maintenance.nvidia.com -A No resources found

The node is no longer marked as SchedulingDisabled.

$ oc get nodes NAME STATUS ROLES AGE VERSION nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com Ready control-plane,master,worker 13d v1.31.8

And our Aerial pod that was Pending is now in a Running state.

$ oc get pods -n aerial NAME READY STATUS RESTARTS AGE aerial-gnb-6947fc77b7-88dv8 1/1 Running 0 7m

Hopefully this provides an idea of what the NVIDIA Maintenance Operator can do on OpenShift from a simple point of view.  For more information about the maintenance operator check out the GitHub repo here.