Wednesday, April 02, 2025

Change Ipaddress of OpenShift Control Node


My OpenShift 4.16.25 nodes were using DHCP addresses for their ipaddresses. However the scope changed in the DHCP range and one of my nodes that had ipaddress 10.6.135.250 was no longer able to get that address. Instead the ipaddress the node recieved was 10.6.135.245. Anyone who has worked with OpenShift knows that an ipaddress change will impact etcd. In the following I want to walk through the steps on how to recover from this situation without reinstalling OpenShift. However I also want to caution that this for academic purposes and if this happens in a real production environment be a hero and open a case with Red Hat support.

Original Ipaddress Configuration

This is a snapshot of my original configuration. The node involved in the address change ends up being nvd-srv-31-vm-1.

$ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME nvd-srv-31-vm-1 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.250 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9 nvd-srv-31-vm-2 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.243 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9 nvd-srv-31-vm-3 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.244 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9

Issues Arise

On a Friday, because it always happens on a Friday, one of my colleagues said that node nvd-srv-31-vm-1 had become unhealthy. When I took a look I could see a bunch of pods were not able to deploy. I also could not launch a debug pod for the node itself. Now the day before I had just had a conversation with someone in our networking team and they were mad about the DHCP scope having 10.6.135.250 in it. I mentioned my host had it and we currently could not change the ipaddress since it was an active OpenShift cluster. However 24 hours later something happened with the networking as I could not even ping the node via 10.5.136.250. I decided to reboot because that would help me understand the scope of the problem.

$ ping 10.6.135.250 PING 10.6.135.250 (10.6.135.250) 56(84) bytes of data. ^C --- 10.6.135.250 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3070ms

Since this node was a virtual machine I rebooted it gracefully through virsh command.

The Recovery Process

Once the node came back up I could see it obtained a new DHCP address which meant that the one it had 10.6.135.250 was no longer available. Most of the containers were able to launch on the node without issue.

$ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME nvd-srv-31-vm-1 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.245 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9 nvd-srv-31-vm-2 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.243 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9 nvd-srv-31-vm-3 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.244 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9

However I knew etcd would have a problem with the ipaddress change because etcd has the ipaddresses hard coded in the configuration to form the quorum of the etcd cluster. With that I wanted to first check if the etcd container was crashing on node svr-nvd-wrv-31-vm. I am first going to go into the openshift-etcd project and thus for all the commands I can skip passing the namespace.

$ oc project openshift-etcd Now using project "openshift-etcd" on server "https://api.doca2.nvidia.eng.rdu2.dc.redhat.com:6443" $ oc get pods -l k8s-app=etcd NAME READY STATUS RESTARTS AGE etcd-nvd-srv-31-vm-1 0/4 Init:CrashLoopBackOff 12 (17s ago) 48d etcd-nvd-srv-31-vm-2 4/4 Running 8 48d etcd-nvd-srv-31-vm-3 4/4 Running 8 48d

Sure enough the container was crashing so let's rsh into a running etcd container like nvd-srv-31-vm-2. Inside we can use the etcdctl command to list out the members.

$ oc rsh etcd-nvd-srv-31-vm-2 sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | | e33638d3b94e9016 | started | nvd-srv-31-vm-1 | https://10.6.135.250:2380 | https://10.6.135.250:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+

We can see that the nvd-srv-31-vm-1 member still has the old ipaddress of 10.6.135.250. Let's go ahead and remove this using the etcdctl command and then display the remaining members.

sh-5.1# etcdctl member remove e33638d3b94e9016 Member e33638d3b94e9016 removed from cluster f0be7a9595f9ce77 sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ sh-5.1# exit

Now that the old etcd member for nvd-srv-31-vm-1 is removed we first need to patch the etcd cluster into an unsupported state temporarily.

$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}' etcd.operator.openshift.io/cluster patched

With the etcd cluster patched we need to find all secrets related to nvd-srv-31-vm-1. There should only be three at the time of this writing.

$ oc get secret | grep nvd-srv-31-vm-1 etcd-peer-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d etcd-serving-metrics-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d etcd-serving-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d

We can remove each of those secrets as they will get regenerated when we do.

$ oc delete secret etcd-peer-nvd-srv-31-vm-1 secret "etcd-peer-nvd-srv-31-vm-1" deleted $ oc delete secret etcd-serving-metrics-nvd-srv-31-vm-1 secret "etcd-serving-metrics-nvd-srv-31-vm-1" deleted $ oc delete secret etcd-serving-nvd-srv-31-vm-1 secret "etcd-serving-nvd-srv-31-vm-1" deleted

With the secrets removed we can get the secrets again for nvd-srv-31-vm-1 and see they have been recreated.

$ oc get secret | grep nvd-srv-31-vm-1 NAME TYPE DATA AGE etcd-peer-nvd-srv-31-vm-1 kubernetes.io/tls 2 20s etcd-serving-metrics-nvd-srv-31-vm-1 kubernetes.io/tls 2 11s etcd-serving-nvd-srv-31-vm-1 kubernetes.io/tls 2 1s

Now let's double check the etcdctl member list again just to confirm we still only have two members.

$ oc rsh etcd-nvd-srv-31-vm-2 sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ sh-5.1# exit

Next we will need to approve a certificate for the nvd-srv-31-vm-1 node. Remember we removed its original secret.

$ oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-sjjxv 12m kubernetes.io/kubelet-serving system:node:nvd-srv-31-vm-1 <none> Pending $ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve certificatesigningrequest.certificates.k8s.io/csr-sjjxv approved

We can validate the certificate was approved.

$ oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-sjjxv 13m kubernetes.io/kubelet-serving system:node:nvd-srv-31-vm-1 <none> Approved,Issued

Next we will go back into one of the etcd running containers again. I will rsh into the etcd-srv-31-vm-2 one again. Here I will check endpoint health and list member table again.

$ oc rsh etcd-nvd-srv-31-vm-2 sh-5.1# etcdctl endpoint health --cluster https://10.6.135.243:2379 is healthy: successfully committed proposal: took = 5.356332ms https://10.6.135.244:2379 is healthy: successfully committed proposal: took = 7.730393ms sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+

At this point I want to add the nvd-srv-31-vm-1 member back but with the appropriate new ipaddress 10.6.135.245.

sh-5.1# etcdctl member add nvd-srv-31-vm-1 --peer-urls="https://10.6.135.245:2380" Member a4b9266380f688f4 added to cluster f0be7a9595f9ce77 ETCD_NAME="nvd-srv-31-vm-1" ETCD_INITIAL_CLUSTER="nvd-srv-31-vm-2=https://10.6.135.243:2380,nvd-srv-31-vm-3=https://10.6.135.244:2380,nvd-srv-31-vm-1=https://10.6.135.245:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.6.135.245:2380" ETCD_INITIAL_CLUSTER_STATE="existing"

We can then use etcdctl again to list all the members out and confirm our node now is listed with the correct ipaddresss.

sh-5.1# etcdctl member list -w table +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------+---------------------------+---------------------------+------------+ | aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false | | 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false | | a4b9266380f688f4 | started | nvd-srv-31-vm-1 | https://10.6.135.245:2380 | https://10.6.135.245:2379 | false | +------------------+---------+-----------------+---------------------------+---------------------------+------------+

Finally we can remove the override unspupported patch.

$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null }}' etcd.operator.openshift.io/cluster patched

And lastly we can verify the etcd containers are running on the node properly.

$ oc get pods |grep nvd-srv-31-vm-1 |grep etcd etcd-guard-nvd-srv-31-vm-1 1/1 Running 0 85m etcd-nvd-srv-31-vm-1 4/4 Running 0 56m

Hopefully this provide a good level of detail when needing to change the ipaddress on an OpenShift controller. Keep in mind this process shouldn't be used without engaging support from Red Hat.