Original Ipaddress Configuration
This is a snapshot of my original configuration.  The node involved in the address change ends up being nvd-srv-31-vm-1.
$ oc get nodes -o wide
NAME                                       STATUS   ROLES                         AGE   VERSION            INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
nvd-srv-31-vm-1                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.250   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-2                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.243   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-3                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.244   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
Issues Arise
On a Friday, because it always happens on a Friday, one of my colleagues said that node nvd-srv-31-vm-1 had become unhealthy.   When I took a look I could see a bunch of pods were not able to deploy.   I also could not launch a debug pod for the node itself.  Now the day before I had just had a conversation with someone in our networking team and they were mad about the DHCP scope having 10.6.135.250 in it.  I mentioned my host had it and we currently could not change the ipaddress since it was an active OpenShift cluster.   However 24 hours later something happened with the networking as I could not even ping the node via 10.5.136.250.   I decided to reboot because that would help me understand the scope of the problem.
$ ping 10.6.135.250
PING 10.6.135.250 (10.6.135.250) 56(84) bytes of data.
^C
--- 10.6.135.250 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3070ms
Since this node was a virtual machine I rebooted it gracefully through virsh command.
The Recovery Process
Once the node came back up I could see it obtained a new DHCP address which meant that the one it had 10.6.135.250 was no longer available. Most of the containers were able to launch on the node without issue.
$ oc get nodes -o wide
NAME                                       STATUS   ROLES                         AGE   VERSION            INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
nvd-srv-31-vm-1                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.245   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-2                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.243   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-3                            Ready    control-plane,master,worker   48d   v1.29.10+67d3387   10.6.135.244   <none>        Red Hat Enterprise Linux CoreOS 416.94.202411261619-0   5.14.0-427.47.1.el9_4.x86_64   cri-o://1.29.10-3.rhaos4.16.git319967e.el9
However I knew etcd would have a problem with the ipaddress change because etcd has the ipaddresses hard coded in the configuration to form the quorum of the etcd cluster. With that I wanted to first check if the etcd container was crashing on node svr-nvd-wrv-31-vm. I am first going to go into the openshift-etcd project and thus for all the commands I can skip passing the namespace.
$ oc project openshift-etcd
Now using project "openshift-etcd" on server "https://api.doca2.nvidia.eng.rdu2.dc.redhat.com:6443"
$ oc get pods -l k8s-app=etcd
NAME                   READY   STATUS                  RESTARTS       AGE
etcd-nvd-srv-31-vm-1   0/4     Init:CrashLoopBackOff   12 (17s ago)   48d
etcd-nvd-srv-31-vm-2   4/4     Running                 8              48d
etcd-nvd-srv-31-vm-3   4/4     Running                 8              48d
Sure enough the container was crashing so let's rsh into a running etcd container like nvd-srv-31-vm-2.  Inside we can use the etcdctl command to list out the members.
$ oc rsh etcd-nvd-srv-31-vm-2 
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
| e33638d3b94e9016 | started | nvd-srv-31-vm-1 | https://10.6.135.250:2380 | https://10.6.135.250:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
We can see that the nvd-srv-31-vm-1 member still has the old ipaddress of 10.6.135.250.   Let's go ahead and remove this using the etcdctl command and then display the remaining members.
sh-5.1# etcdctl member remove e33638d3b94e9016
Member e33638d3b94e9016 removed from cluster f0be7a9595f9ce77
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
sh-5.1# exit
Now that the old etcd member for nvd-srv-31-vm-1 is removed we first need to patch the etcd cluster into an unsupported state temporarily.
$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'
etcd.operator.openshift.io/cluster patched
With the etcd cluster patched we need to find all secrets related to nvd-srv-31-vm-1. There should only be three at the time of this writing.
$ oc get secret | grep nvd-srv-31-vm-1
etcd-peer-nvd-srv-31-vm-1              kubernetes.io/tls   2      48d
etcd-serving-metrics-nvd-srv-31-vm-1   kubernetes.io/tls   2      48d
etcd-serving-nvd-srv-31-vm-1           kubernetes.io/tls   2      48d
We can remove each of those secrets as they will get regenerated when we do.
$ oc delete secret etcd-peer-nvd-srv-31-vm-1
secret "etcd-peer-nvd-srv-31-vm-1" deleted
$ oc delete secret etcd-serving-metrics-nvd-srv-31-vm-1
secret "etcd-serving-metrics-nvd-srv-31-vm-1" deleted
$ oc delete secret etcd-serving-nvd-srv-31-vm-1
secret "etcd-serving-nvd-srv-31-vm-1" deleted
With the secrets removed we can get the secrets again for nvd-srv-31-vm-1 and see they have been recreated.
$ oc get secret | grep nvd-srv-31-vm-1
NAME                                   TYPE                DATA   AGE
etcd-peer-nvd-srv-31-vm-1              kubernetes.io/tls   2      20s
etcd-serving-metrics-nvd-srv-31-vm-1   kubernetes.io/tls   2      11s
etcd-serving-nvd-srv-31-vm-1           kubernetes.io/tls   2      1s
Now let's double check the etcdctl member list again just to confirm we still only have two members.
$ oc rsh etcd-nvd-srv-31-vm-2 
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
sh-5.1# exit
Next we will need to approve a certificate for the nvd-srv-31-vm-1 node. Remember we removed its original secret.
$ oc get csr
NAME        AGE    SIGNERNAME                            REQUESTOR                                                REQUESTEDDURATION   CONDITION
csr-sjjxv   12m    kubernetes.io/kubelet-serving         system:node:nvd-srv-31-vm-1                              <none>              Pending
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-sjjxv approved
We can validate the certificate was approved.
$ oc get csr
NAME        AGE     SIGNERNAME                            REQUESTOR                                                REQUESTEDDURATION   CONDITION
csr-sjjxv   13m     kubernetes.io/kubelet-serving         system:node:nvd-srv-31-vm-1                              <none>              Approved,Issued
Next we will go back into one of the etcd running containers again.  I will rsh into the etcd-srv-31-vm-2 one again.  Here I will check endpoint health and list member table again.
$ oc rsh etcd-nvd-srv-31-vm-2 
sh-5.1# etcdctl endpoint health --cluster
https://10.6.135.243:2379 is healthy: successfully committed proposal: took = 5.356332ms
https://10.6.135.244:2379 is healthy: successfully committed proposal: took = 7.730393ms
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
At this point I want to add the nvd-srv-31-vm-1 member back but with the appropriate new ipaddress 10.6.135.245.
sh-5.1# etcdctl member add nvd-srv-31-vm-1 --peer-urls="https://10.6.135.245:2380"
Member a4b9266380f688f4 added to cluster f0be7a9595f9ce77
ETCD_NAME="nvd-srv-31-vm-1"
ETCD_INITIAL_CLUSTER="nvd-srv-31-vm-2=https://10.6.135.243:2380,nvd-srv-31-vm-3=https://10.6.135.244:2380,nvd-srv-31-vm-1=https://10.6.135.245:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.6.135.245:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
We can then use etcdctl again to list all the members out and confirm our node now is listed with the correct ipaddresss.
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME       |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
|  aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 |      false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 |      false |
| a4b9266380f688f4 | started | nvd-srv-31-vm-1 | https://10.6.135.245:2380 | https://10.6.135.245:2379 |      false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
Finally we can remove the override unspupported patch.
$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null }}'
etcd.operator.openshift.io/cluster patched
And lastly we can verify the etcd containers are running on the node properly.
$ oc get pods |grep nvd-srv-31-vm-1 |grep etcd
etcd-guard-nvd-srv-31-vm-1           1/1     Running     0          85m
etcd-nvd-srv-31-vm-1                 4/4     Running     0          56m
Hopefully this provide a good level of detail when needing to change the ipaddress on an OpenShift controller. Keep in mind this process shouldn't be used without engaging support from Red Hat.
