Original Ipaddress Configuration
This is a snapshot of my original configuration. The node involved in the address change ends up being nvd-srv-31-vm-1
.
$ oc get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
nvd-srv-31-vm-1 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.250 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-2 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.243 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-3 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.244 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9
Issues Arise
On a Friday, because it always happens on a Friday, one of my colleagues said that node nvd-srv-31-vm-1
had become unhealthy. When I took a look I could see a bunch of pods were not able to deploy. I also could not launch a debug pod for the node itself. Now the day before I had just had a conversation with someone in our networking team and they were mad about the DHCP scope having 10.6.135.250 in it. I mentioned my host had it and we currently could not change the ipaddress since it was an active OpenShift cluster. However 24 hours later something happened with the networking as I could not even ping the node via 10.5.136.250. I decided to reboot because that would help me understand the scope of the problem.
$ ping 10.6.135.250
PING 10.6.135.250 (10.6.135.250) 56(84) bytes of data.
^C
--- 10.6.135.250 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3070ms
Since this node was a virtual machine I rebooted it gracefully through virsh
command.
The Recovery Process
Once the node came back up I could see it obtained a new DHCP address which meant that the one it had 10.6.135.250 was no longer available. Most of the containers were able to launch on the node without issue.
$ oc get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
nvd-srv-31-vm-1 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.245 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-2 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.243 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9
nvd-srv-31-vm-3 Ready control-plane,master,worker 48d v1.29.10+67d3387 10.6.135.244 <none> Red Hat Enterprise Linux CoreOS 416.94.202411261619-0 5.14.0-427.47.1.el9_4.x86_64 cri-o://1.29.10-3.rhaos4.16.git319967e.el9
However I knew etcd would have a problem with the ipaddress change because etcd has the ipaddresses hard coded in the configuration to form the quorum of the etcd cluster. With that I wanted to first check if the etcd container was crashing on node svr-nvd-wrv-31-vm. I am first going to go into the openshift-etcd project and thus for all the commands I can skip passing the namespace.
$ oc project openshift-etcd
Now using project "openshift-etcd" on server "https://api.doca2.nvidia.eng.rdu2.dc.redhat.com:6443"
$ oc get pods -l k8s-app=etcd
NAME READY STATUS RESTARTS AGE
etcd-nvd-srv-31-vm-1 0/4 Init:CrashLoopBackOff 12 (17s ago) 48d
etcd-nvd-srv-31-vm-2 4/4 Running 8 48d
etcd-nvd-srv-31-vm-3 4/4 Running 8 48d
Sure enough the container was crashing so let's rsh
into a running etcd container like nvd-srv-31-vm-2. Inside we can use the etcdctl
command to list out the members.
$ oc rsh etcd-nvd-srv-31-vm-2
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false |
| e33638d3b94e9016 | started | nvd-srv-31-vm-1 | https://10.6.135.250:2380 | https://10.6.135.250:2379 | false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
We can see that the nvd-srv-31-vm-1 member still has the old ipaddress of 10.6.135.250. Let's go ahead and remove this using the etcdctl
command and then display the remaining members.
sh-5.1# etcdctl member remove e33638d3b94e9016
Member e33638d3b94e9016 removed from cluster f0be7a9595f9ce77
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
sh-5.1# exit
Now that the old etcd member for nvd-srv-31-vm-1 is removed we first need to patch the etcd cluster into an unsupported state temporarily.
$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'
etcd.operator.openshift.io/cluster patched
With the etcd cluster patched we need to find all secrets related to nvd-srv-31-vm-1. There should only be three at the time of this writing.
$ oc get secret | grep nvd-srv-31-vm-1
etcd-peer-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d
etcd-serving-metrics-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d
etcd-serving-nvd-srv-31-vm-1 kubernetes.io/tls 2 48d
We can remove each of those secrets as they will get regenerated when we do.
$ oc delete secret etcd-peer-nvd-srv-31-vm-1
secret "etcd-peer-nvd-srv-31-vm-1" deleted
$ oc delete secret etcd-serving-metrics-nvd-srv-31-vm-1
secret "etcd-serving-metrics-nvd-srv-31-vm-1" deleted
$ oc delete secret etcd-serving-nvd-srv-31-vm-1
secret "etcd-serving-nvd-srv-31-vm-1" deleted
With the secrets removed we can get the secrets again for nvd-srv-31-vm-1 and see they have been recreated.
$ oc get secret | grep nvd-srv-31-vm-1
NAME TYPE DATA AGE
etcd-peer-nvd-srv-31-vm-1 kubernetes.io/tls 2 20s
etcd-serving-metrics-nvd-srv-31-vm-1 kubernetes.io/tls 2 11s
etcd-serving-nvd-srv-31-vm-1 kubernetes.io/tls 2 1s
Now let's double check the etcdctl
member list again just to confirm we still only have two members.
$ oc rsh etcd-nvd-srv-31-vm-2
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
sh-5.1# exit
Next we will need to approve a certificate for the nvd-srv-31-vm-1 node. Remember we removed its original secret.
$ oc get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-sjjxv 12m kubernetes.io/kubelet-serving system:node:nvd-srv-31-vm-1 <none> Pending
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-sjjxv approved
We can validate the certificate was approved.
$ oc get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-sjjxv 13m kubernetes.io/kubelet-serving system:node:nvd-srv-31-vm-1 <none> Approved,Issued
Next we will go back into one of the etcd running containers again. I will rsh
into the etcd-srv-31-vm-2 one again. Here I will check endpoint health and list member table again.
$ oc rsh etcd-nvd-srv-31-vm-2
sh-5.1# etcdctl endpoint health --cluster
https://10.6.135.243:2379 is healthy: successfully committed proposal: took = 5.356332ms
https://10.6.135.244:2379 is healthy: successfully committed proposal: took = 7.730393ms
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
At this point I want to add the nvd-srv-31-vm-1 member back but with the appropriate new ipaddress 10.6.135.245.
sh-5.1# etcdctl member add nvd-srv-31-vm-1 --peer-urls="https://10.6.135.245:2380"
Member a4b9266380f688f4 added to cluster f0be7a9595f9ce77
ETCD_NAME="nvd-srv-31-vm-1"
ETCD_INITIAL_CLUSTER="nvd-srv-31-vm-2=https://10.6.135.243:2380,nvd-srv-31-vm-3=https://10.6.135.244:2380,nvd-srv-31-vm-1=https://10.6.135.245:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.6.135.245:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
We can then use etcdctl
again to list all the members out and confirm our node now is listed with the correct ipaddresss.
sh-5.1# etcdctl member list -w table
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
| aad12dcf43e0b21 | started | nvd-srv-31-vm-2 | https://10.6.135.243:2380 | https://10.6.135.243:2379 | false |
| 3da9efb85d7b0420 | started | nvd-srv-31-vm-3 | https://10.6.135.244:2380 | https://10.6.135.244:2379 | false |
| a4b9266380f688f4 | started | nvd-srv-31-vm-1 | https://10.6.135.245:2380 | https://10.6.135.245:2379 | false |
+------------------+---------+-----------------+---------------------------+---------------------------+------------+
Finally we can remove the override unspupported patch.
$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null }}'
etcd.operator.openshift.io/cluster patched
And lastly we can verify the etcd containers are running on the node properly.
$ oc get pods |grep nvd-srv-31-vm-1 |grep etcd
etcd-guard-nvd-srv-31-vm-1 1/1 Running 0 85m
etcd-nvd-srv-31-vm-1 4/4 Running 0 56m
Hopefully this provide a good level of detail when needing to change the ipaddress on an OpenShift controller. Keep in mind this process shouldn't be used without engaging support from Red Hat.