SCHMAUSTECH: Change Ipaddress of OpenShift Control Node

My OpenShift 4.16.25 nodes were using DHCP addresses for their ipaddresses. However the scope changed in the DHCP range and one of my nodes that had ipaddress 10.6.135.250 was no longer able to get that address. Instead the ipaddress the node recieved was 10.6.135.245. Anyone who has worked with OpenShift knows that an ipaddress change will impact etcd. In the following I want to walk through the steps on how to recover from this situation without reinstalling OpenShift. However I also want to caution that this for academic purposes and if this happens in a real production environment be a hero and open a case with Red Hat support.

Original Ipaddress Configuration

This is a snapshot of my original configuration. The node involved in the address change ends up being nvd-srv-31-vm-1.

$ oc get nodes -o wide
NAME                                       STATUS   ROLES                         AGE   VERSION            INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
nvd-srv-31-vm-1                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.250   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-2                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.243   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-3                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.244   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9

Issues Arise

On a Friday, because it always happens on a Friday, one of my colleagues said that node nvd-srv-31-vm-1 had become unhealthy. When I took a look I could see a bunch of pods were not able to deploy. I also could not launch a debug pod for the node itself. Now the day before I had just had a conversation with someone in our networking team and they were mad about the DHCP scope having 10.6.135.250 in it. I mentioned my host had it and we currently could not change the ipaddress since it was an active OpenShift cluster. However 24 hours later something happened with the networking as I could not even ping the node via 10.5.136.250. I decided to reboot because that would help me understand the scope of the problem.

$ ping 10.6.135.250
PING 10.6.135.250 (10.6.135.250) 56(84) bytes of data.
^C
--- 10.6.135.250 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3070ms

Since this node was a virtual machine I rebooted it gracefully through virsh command.

The Recovery Process

Once the node came back up I could see it obtained a new DHCP address which meant that the one it had 10.6.135.250 was no longer available. Most of the containers were able to launch on the node without issue.

$ oc get nodes -o wide
NAME                                       STATUS   ROLES                         AGE   VERSION            INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
nvd-srv-31-vm-1                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.245   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-2                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.243   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-3                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.244   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9

However I knew etcd would have a problem with the ipaddress change because etcd has the ipaddresses hard coded in the configuration to form the quorum of the etcd cluster. With that I wanted to first check if the etcd container was crashing on node svr-nvd-wrv-31-vm. I am first going to go into the openshift-etcd project and thus for all the commands I can skip passing the namespace.

$ oc project openshift-etcd
Now using project "openshift-etcd" on server "https://api.doca2.nvidia.eng.rdu2.dc.redhat.com:6443"

$ oc get pods -l k8s-app=etcd
NAME                   READY   STATUS                  RESTARTS       AGE
etcd-nvd-srv-31-vm-1   0/4     Init:CrashLoopBackOff   12 (17s ago)   48d
etcd-nvd-srv-31-vm-2   4/4     Running                 8              48d
etcd-nvd-srv-31-vm-3   4/4     Running                 8              48d

Sure enough the container was crashing so let's rsh into a running etcd container like nvd-srv-31-vm-2. Inside we can use the etcdctl command to list out the members.

$ oc rsh etcd-nvd-srv-31-vm-2 
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
| e33638d3b94e9016 | started | nvd-srv-31-vm-1 | https://10.6.135.250:2380 | https://10.6.135.250:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+

We can see that the nvd-srv-31-vm-1 member still has the old ipaddress of 10.6.135.250. Let's go ahead and remove this using the etcdctl command and then display the remaining members.

sh-5.1# etcdctl member remove e33638d3b94e9016
Member e33638d3b94e9016 removed from cluster f0be7a9595f9ce77

sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
sh-5.1# exit

Now that the old etcd member for nvd-srv-31-vm-1 is removed we first need to patch the etcd cluster into an unsupported state temporarily.

$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'
etcd.operator.openshift.io/cluster patched

With the etcd cluster patched we need to find all secrets related to nvd-srv-31-vm-1. There should only be three at the time of this writing.

$ oc get secret | grep nvd-srv-31-vm-1
etcd-peer-nvd-srv-31-vm-1              kubernetes.io/tls   2      48d
etcd-serving-metrics-nvd-srv-31-vm-1   kubernetes.io/tls   2      48d
etcd-serving-nvd-srv-31-vm-1           kubernetes.io/tls   2      48d

We can remove each of those secrets as they will get regenerated when we do.

$ oc delete secret etcd-peer-nvd-srv-31-vm-1
secret "etcd-peer-nvd-srv-31-vm-1" deleted

$ oc delete secret etcd-serving-metrics-nvd-srv-31-vm-1
secret "etcd-serving-metrics-nvd-srv-31-vm-1" deleted

$ oc delete secret etcd-serving-nvd-srv-31-vm-1
secret "etcd-serving-nvd-srv-31-vm-1" deleted

With the secrets removed we can get the secrets again for nvd-srv-31-vm-1 and see they have been recreated.

$ oc get secret | grep nvd-srv-31-vm-1
NAME                                   TYPE                DATA   AGE
etcd-peer-nvd-srv-31-vm-1              kubernetes.io/tls   2      20s
etcd-serving-metrics-nvd-srv-31-vm-1   kubernetes.io/tls   2      11s
etcd-serving-nvd-srv-31-vm-1           kubernetes.io/tls   2      1s

Now let's double check the etcdctl member list again just to confirm we still only have two members.

$ oc rsh etcd-nvd-srv-31-vm-2 
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
sh-5.1# exit

Next we will need to approve a certificate for the nvd-srv-31-vm-1 node. Remember we removed its original secret.

$ oc get csr
NAME        AGE    SIGNERNAME                            REQUESTOR                                                REQUESTEDDURATION   CONDITION
csr-sjjxv   12m    kubernetes.io/kubelet-serving         system:node:nvd-srv-31-vm-1                              <none>              Pending

$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-sjjxv approved

We can validate the certificate was approved.

$ oc get csr
NAME        AGE     SIGNERNAME                            REQUESTOR                                                REQUESTEDDURATION   CONDITION
csr-sjjxv   13m     kubernetes.io/kubelet-serving         system:node:nvd-srv-31-vm-1                              <none>              Approved,Issued

Next we will go back into one of the etcd running containers again. I will rsh into the etcd-srv-31-vm-2 one again. Here I will check endpoint health and list member table again.

$ oc rsh etcd-nvd-srv-31-vm-2 
sh-5.1# etcdctl endpoint health --cluster
https://10.6.135.243:2379 is healthy: successfully committed proposal: took = 5.356332ms
https://10.6.135.244:2379 is healthy: successfully committed proposal: took = 7.730393ms

sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+

At this point I want to add the nvd-srv-31-vm-1 member back but with the appropriate new ipaddress 10.6.135.245.

sh-5.1# etcdctl member add nvd-srv-31-vm-1 --peer-urls="https://10.6.135.245:2380"
Member a4b9266380f688f4 added to cluster f0be7a9595f9ce77

ETCD_NAME="nvd-srv-31-vm-1"
ETCD_INITIAL_CLUSTER="nvd-srv-31-vm-2=https://10.6.135.243:2380,nvd-srv-31-vm-3=https://10.6.135.244:2380,nvd-srv-31-vm-1=https://10.6.135.245:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.6.135.245:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

We can then use etcdctl again to list all the members out and confirm our node now is listed with the correct ipaddresss.

sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
| a4b9266380f688f4 | started | nvd-srv-31-vm-1 | https://10.6.135.245:2380 | https://10.6.135.245:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+

Finally we can remove the override unspupported patch.

$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null }}'
etcd.operator.openshift.io/cluster patched

And lastly we can verify the etcd containers are running on the node properly.

$ oc get pods |grep nvd-srv-31-vm-1 |grep etcd
etcd-guard-nvd-srv-31-vm-1           1/1     Running     0          85m
etcd-nvd-srv-31-vm-1                 4/4     Running     0          56m

Hopefully this provide a good level of detail when needing to change the ipaddress on an OpenShift controller. Keep in mind this process shouldn't be used without engaging support from Red Hat.

Wednesday, April 02, 2025

Change Ipaddress of OpenShift Control Node

Original Ipaddress Configuration

Issues Arise

The Recovery Process