- Scheduling of NodeMaintenance to be processed by the operator, taking into account constraints such as the maximal allowed parallel operations.
- Node preparation for maintenance such as cordon and draining of the node
- Mark NodeMaintenance as Ready (via condition)
- Cleanup on deletion of NodeMaintenance such as node uncordon
Assumptions
This workflow assumes we already have a OpenShift cluster installed along with some resources related to NVIDIA. In our example single node Openshift cluster below we have a GPU resource and also have the NVIDIA Network Operator installed and configured.
Installing and Configuring NVIDIA Maintenance Operator
Before we can demonstrate the NVIDIA Maintenance Operator we first have to install it and configure it. Let's generate the following custom resource file to create the namespace, operatorgroup and subscription for the NVIDIA Maintenance Operator.
$ cat <<EOF > node-maintenance-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
name: nvidia-maintenance-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: nvidia-maintenance-operator
namespace: nvidia-maintenance-operator
spec:
targetNamespaces:
- nvidia-maintenance-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nvidia-maintenance-operator
namespace: nvidia-maintenance-operator
spec:
channel: v0.2
installPlanApproval: Automatic
name: nvidia-maintenance-operator
source: certified-operators
sourceNamespace: openshift-marketplace
startingCSV: nvidia-maintenance-operator.v0.2.2
EOF
Now let's create the resource using the custom resource file we generated on the cluster.
$ oc create -f node-maintenance-operator.yaml
namespace/nvidia-maintenance-operator created
operatorgroup.operators.coreos.com/nvidia-maintenance-operator created
subscription.operators.coreos.com/nvidia-maintenance-operator created
Next we can validate the operator manager is up and running.
$ oc get pods -n nvidia-maintenance-operator
NAME READY STATUS RESTARTS AGE
maintenance-operator-controller-manager-d8db7f84b-hsmfd 1/1 Running 0 8m
Now we need to configure the MaintenanceOperatorConfig
custom resource file. In this file we can specify the log level, the number of parallel operations (ie how many nodes to take offline at once) and the time the node is kept in maintenance (a number in seconds that provides enough time for the maintenance work to happen before the operator will remove the node maintenance policy). In our example we are just going to allow one maintenance operation at a time and that operation has 300 seconds to finish before the node is returned to schedulable.
$ cat <<EOF > maintenance-operator-config.yaml
apiVersion: maintenance.nvidia.com/v1alpha1
kind: MaintenanceOperatorConfig
metadata:
name: default
namespace: nvidia-maintenance-operator
spec:
logLevel: info
maxParallelOperations: 1
maxNodeMaintenanceTimeSeconds: 300
EOF
Now let's create the MaintenanceOperatorConfig
on the cluster.
$ oc create -f maintenance-operator-config.yaml
maintenanceoperatorconfig.maintenance.nvidia.com/default created
The MaintenanceOperatorConfig
does not spin up any additional pods but we can check that it is there by running the following command.
$ oc get MaintenanceOperatorConfig -n nvidia-maintenance-operator -o yaml
apiVersion: v1
items:
- apiVersion: maintenance.nvidia.com/v1alpha1
kind: MaintenanceOperatorConfig
metadata:
creationTimestamp: "2025-05-29T17:44:58Z"
generation: 1
name: default
namespace: nvidia-maintenance-operator
resourceVersion: "3912981"
uid: 43a67142-c8ab-4d21-a788-417339f4a338
spec:
logLevel: info
maxNodeMaintenanceTimeSeconds: 300
maxParallelOperations: 1
kind: List
metadata:
resourceVersion: ""
Validating NVIDIA Maintenance Operator
Now before we configure a NodeMaintenance
resource I want to point out we currently have an Aerial app running on our cluster that does consume a GPU. This will be important when we configure our NodeMaintenance
resource.
$ oc get pod -l app=aerial-gnb -n aerial
NAME READY STATUS RESTARTS AGE
aerial-gnb-6947fc77b7-wjrsv 1/1 Running 0 76m
Below is an example NodeMaintenance
resource file. In it I have specified the node it applies to along with details for how long we should let our application pod complete. In this example we will wait 60 seconds for the Aerial application to complete. If the application does not complete the NodeMaintenance
resource is instructed to force the drain of the node which will terminate the pod on our running node. Notice we have eviction filters based on resources that a pod could potentially use. Again in our case the GPU is the important part here.
$ cat <<EOF > node-maintenance.yaml
apiVersion: maintenance.nvidia.com/v1alpha1
kind: NodeMaintenance
metadata:
name: aerial-maintenance-operation
namespace: default
spec:
requestorID: schmaustech
nodeName: nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com
cordon: true
waitForPodCompletion:
podSelector: "app=aerial-gnb"
timeoutSeconds: 60
drainSpec:
force: true
podSelector: ""
timeoutSeconds: 90
deleteEmptyDir: true
podEvictionFilters:
- byResourceNameRegex: nvidia.com/gpu*
- byResourceNameRegex: nvidia.com/rdma*
EOF
Now let's create our NodeMaintenance
custom resource on the cluster. Note that when this gets created it will taint the node as SchedulingDisabled and the timers will begin.
$ oc create -f node-maintenance.yaml
nodemaintenance.maintenance.nvidia.com/aerial-maintenance-operation created
We can see the NodeMaintenance
resource is waiting for pods to complete.
$ oc get nodemaintenances.maintenance.nvidia.com -A
NAMESPACE NAME NODE REQUESTOR READY PHASE FAILED
default aerial-maintenance-operation nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com schmaustech False WaitForPodCompletion
We can see the node has been marked SchedulingDisabled.
$ oc get nodes
NAME STATUS ROLES AGE VERSION
nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com Ready,SchedulingDisabled control-plane,master,worker 13d v1.31.8
After 90 seconds we can now see the node is being drained.
$ oc get nodemaintenances.maintenance.nvidia.com -A
NAMESPACE NAME NODE REQUESTOR READY PHASE FAILED
default aerial-maintenance-operation nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com schmaustech False Draining
And our Aerial pod is terminating and notice a new one Pending. The new one is generated because this deployment was configured to always have one running however we are on a single node OpenShift deployment and the node has SchedulingDisabled so it will sit in a Pending state until maintenance is over.
$ oc get pods -n aerial
NAME READY STATUS RESTARTS AGE
aerial-gnb-6947fc77b7-48vxz 1/1 Terminating 0 80m
aerial-gnb-6947fc77b7-88dv8 0/1 Pending 0 9s
Finally we can see the NodeMaintenance
resource says the node is ready since all the resource that consumed GPUs have been terminated and/or moved to another node if this were a multinode cluster.
$ oc get nodemaintenances.maintenance.nvidia.com -A
NAMESPACE NAME NODE REQUESTOR READY PHASE FAILED
default aerial-maintenance-operation nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com schmaustech True Ready
We can also see we only have a Pending Aerial deployment pod.
$ oc get pods -n aerial
NAME READY STATUS RESTARTS AGE
aerial-gnb-6947fc77b7-88dv8 0/1 Pending 0 49s
Then after 5 minutes and assuming our maintenance went well we will see the NodeMaintenance
resource is removed by the operator automatically.
$ oc get nodemaintenances.maintenance.nvidia.com -A
No resources found
The node is no longer marked as SchedulingDisabled.
$ oc get nodes
NAME STATUS ROLES AGE VERSION
nvd-srv-36.nvidia.eng.rdu2.dc.redhat.com Ready control-plane,master,worker 13d v1.31.8
And our Aerial pod that was Pending is now in a Running state.
$ oc get pods -n aerial
NAME READY STATUS RESTARTS AGE
aerial-gnb-6947fc77b7-88dv8 1/1 Running 0 7m
Hopefully this provides an idea of what the NVIDIA Maintenance Operator can do on OpenShift from a simple point of view. For more information about the maintenance operator check out the GitHub repo here.