If you have been reading some of my recent articles on Rook you have seen how to install a Ceph cluster with Rook on Kubernetes. This article extends on that Kubernetes installation and discusses how to replace a failed OSD in the Ceph cluster.
First lets review our current running Ceph cluster observing the rook-ceph-system, rook-ceph and inside the toolbox the Ceph status:
# kubectl get pods --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-system coredns-86c58d9df4-22fps 1/1 Running 4 3d2h 10.244.3.55 kube-node3kube-system coredns-86c58d9df4-jp2zb 1/1 Running 6 3d2h 10.244.2.66 kube-node2 kube-system etcd-kube-master 1/1 Running 3 3d5h 10.0.0.81 kube-master kube-system kube-apiserver-kube-master 1/1 Running 3 3d5h 10.0.0.81 kube-master kube-system kube-controller-manager-kube-master 1/1 Running 5 3d5h 10.0.0.81 kube-master kube-system kube-flannel-ds-amd64-5m9x5 1/1 Running 6 3d5h 10.0.0.83 kube-node2 kube-system kube-flannel-ds-amd64-7xgf4 1/1 Running 3 3d5h 10.0.0.81 kube-master kube-system kube-flannel-ds-amd64-dhdzm 1/1 Running 5 3d2h 10.0.0.84 kube-node3 kube-system kube-flannel-ds-amd64-m6fx5 1/1 Running 3 3d5h 10.0.0.82 kube-node1 kube-system kube-proxy-bnbzn 1/1 Running 3 3d5h 10.0.0.82 kube-node1 kube-system kube-proxy-gjxlg 1/1 Running 4 3d2h 10.0.0.84 kube-node3 kube-system kube-proxy-kkxdb 1/1 Running 3 3d5h 10.0.0.81 kube-master kube-system kube-proxy-knzsl 1/1 Running 6 3d5h 10.0.0.83 kube-node2 kube-system kube-scheduler-kube-master 1/1 Running 4 3d5h 10.0.0.81 kube-master rook-ceph-system rook-ceph-agent-748v8 1/1 Running 0 103m 10.0.0.83 kube-node2 rook-ceph-system rook-ceph-agent-9vznf 1/1 Running 0 103m 10.0.0.82 kube-node1 rook-ceph-system rook-ceph-agent-hfdv6 1/1 Running 0 103m 10.0.0.81 kube-master rook-ceph-system rook-ceph-agent-lfh7m 1/1 Running 0 103m 10.0.0.84 kube-node3 rook-ceph-system rook-ceph-operator-76cf7f88f-qmvn5 1/1 Running 0 103m 10.244.1.65 kube-node1 rook-ceph-system rook-discover-25h5z 1/1 Running 0 103m 10.244.1.66 kube-node1 rook-ceph-system rook-discover-dcm7k 1/1 Running 0 103m 10.244.0.41 kube-master rook-ceph-system rook-discover-t4qs7 1/1 Running 0 103m 10.244.3.61 kube-node3 rook-ceph-system rook-discover-w2nv5 1/1 Running 0 103m 10.244.2.72 kube-node2 rook-ceph rook-ceph-mgr-a-8649f78d9b-k6gwl 1/1 Running 0 100m 10.244.3.62 kube-node3 rook-ceph rook-ceph-mon-a-576d9d49cc-q9pm6 1/1 Running 0 101m 10.244.0.42 kube-master rook-ceph rook-ceph-mon-b-85f7b6cb6b-pnrhs 1/1 Running 0 101m 10.244.1.67 kube-node1 rook-ceph rook-ceph-mon-c-668f7f658d-hjf2v 1/1 Running 0 101m 10.244.2.74 kube-node2 rook-ceph rook-ceph-osd-0-6f76d5cc4c-t75gg 1/1 Running 0 100m 10.244.2.76 kube-node2 rook-ceph rook-ceph-osd-1-5759cd47c4-szvfg 1/1 Running 0 100m 10.244.3.64 kube-node3 rook-ceph rook-ceph-osd-2-6d69b78fbf-7s4bm 1/1 Running 0 100m 10.244.0.44 kube-master rook-ceph rook-ceph-osd-3-7b457fc56d-22gw6 1/1 Running 0 100m 10.244.1.69 kube-node1 rook-ceph rook-ceph-osd-prepare-kube-master-72kfz 0/2 Completed 0 100m 10.244.0.43 kube-master rook-ceph rook-ceph-osd-prepare-kube-node1-jp68h 0/2 Completed 0 100m 10.244.1.68 kube-node1 rook-ceph rook-ceph-osd-prepare-kube-node2-j89pc 0/2 Completed 0 100m 10.244.2.75 kube-node2 rook-ceph rook-ceph-osd-prepare-kube-node3-drh4t 0/2 Completed 0 100m 10.244.3.63 kube-node3 rook-ceph rook-ceph-tools-76c7d559b6-qvh2r 1/1 Running 0 6s 10.0.0.82 kube-node1 # kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash # ceph status cluster: id: edc7cac7-21a3-45ae-80a9-5d470afb7576 health: HEALTH_OK services: mon: 3 daemons, quorum c,a,b mgr: a(active) osd: 4 osds: 4 up, 4 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 17 GiB used, 123 GiB / 140 GiB avail pgs: # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.13715 root default -5 0.03429 host kube-master 2 hdd 0.03429 osd.2 up 1.00000 1.00000 -4 0.03429 host kube-node1 3 hdd 0.03429 osd.3 up 1.00000 1.00000 -2 0.03429 host kube-node2 0 hdd 0.03429 osd.0 up 1.00000 1.00000 -3 0.03429 host kube-node3 1 hdd 0.03429 osd.1 up 1.00000 1.00000
At this point the Ceph cluster is clean and in a healthy state. However I am going to introduce some chaos and which will cause osd1 to go down. In my case since this is a virtual lab I am going to just kill the OSD process and clear out osd1 data to mimic a failed drive.
Now when we look at the cluster state in the toolbox we can see OSD1 is down:
# kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
# ceph status cluster: id: edc7cac7-21a3-45ae-80a9-5d470afb7576 health: HEALTH_WARN 1 osds down 1 host (1 osds) down services: mon: 3 daemons, quorum c,a,b mgr: a(active) osd: 4 osds: 3 up, 4 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 17 GiB used, 123 GiB / 140 GiB avail pgs: [root@kube-node1 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.13715 root default -5 0.03429 host kube-master 2 hdd 0.03429 osd.2 up 1.00000 1.00000 -4 0.03429 host kube-node1 3 hdd 0.03429 osd.3 up 1.00000 1.00000 -2 0.03429 host kube-node2 0 hdd 0.03429 osd.0 up 1.00000 1.00000 -3 0.03429 host kube-node3 1 hdd 0.03429 osd.1 down 1.00000 1.00000
Given I removed the contents of the OSD lets go ahead and replace the failed drive. The first steps are to go into the toolbox and run the usual commands to remove a Ceph OSD from the cluster:
# kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
# ceph osd out osd.1 marked out osd.1. # ceph osd crush remove osd.1 removed item id 1 name 'osd.1' from crush map # ceph auth del osd.1 updated # ceph osd rm osd.1 removed osd.1
Lets exit out of the toolbox and go back to the master node command line and delete the Ceph OSD 3 deployment:
# kubectl delete deployment -n rook-ceph rook-ceph-osd-1 deployment.extensions "rook-ceph-osd-1" deleted
Now would be the time to replace the physically failed disk. In my case the disk is still good I just simulated the failure by downing the OSD process and removing the data.
To get the new disk back into the cluster we only need to restart the rook-ceph-operator pod and we can do so in Kubernetes with the following scale deployment commands:
# kubectl scale deployment rook-ceph-operator --replicas=0 -n rook-ceph-system deployment.extensions/rook-ceph-operator scaled # kubectl get pods --all-namespaces -o wide|grep operator # kubectl scale deployment rook-ceph-operator --replicas=1 -n rook-ceph-system deployment.extensions/rook-ceph-operator scaled # kubectl get pods --all-namespaces -o wide|grep operator rook-ceph-system rook-ceph-operator-76cf7f88f-g9pxr 0/1 ContainerCreating 0 2skube-node2
When the rook-ceph-operator is restarted it will go through and re-run each rook-ceph-osd-prepare container which will scan the system it is on and look for any disks that should be incorporated into the cluster based on the original cluster.yaml settings when the Ceph cluster was deployed with Rook. In this case it will see the new disk on kube-node-3 and incorporate that into OSD1.
We can confirm our assessment by seeing a new container for OSD1 was spawned and also by logging into the toolbox and running the familiar Ceph commands:
# kubectl get pods -n rook-ceph -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-mgr-a-8649f78d9b-k6gwl 1/1 Running 0 110m 10.244.3.62 kube-node3 <none> <none> rook-ceph-mon-a-576d9d49cc-q9pm6 1/1 Running 0 110m 10.244.0.42 kube-master <none> <none> rook-ceph-mon-b-85f7b6cb6b-pnrhs 1/1 Running 0 110m 10.244.1.67 kube-node1 <none> <none> rook-ceph-mon-c-668f7f658d-hjf2v 1/1 Running 0 110m 10.244.2.74 kube-node2 <none> <none> rook-ceph-osd-0-6f76d5cc4c-t75gg 1/1 Running 0 109m 10.244.2.76 kube-node2 <none> <none> rook-ceph-osd-1-69f5d5ffd-kndd7 1/1 Running 0 67s 10.244.3.68 kube-node3 <none> <none> rook-ceph-osd-2-6d69b78fbf-7s4bm 1/1 Running 0 109m 10.244.0.44 kube-master <none> <none> rook-ceph-osd-3-7b457fc56d-22gw6 1/1 Running 0 109m 10.244.1.69 kube-node1 <none> <none> rook-ceph-osd-prepare-kube-master-n2t7g 0/2 Completed 0 79s 10.244.0.47 kube-master <none> <none> rook-ceph-osd-prepare-kube-node1-ttznt 0/2 Completed 0 77s 10.244.1.72 kube-node1 <none> <none> rook-ceph-osd-prepare-kube-node2-9kxcl 0/2 Completed 0 75s 10.244.2.79 kube-node2 <none> <none> rook-ceph-osd-prepare-kube-node3-cpf4s 0/2 Completed 0 73s 10.244.3.66 kube-node3 <none> <none> rook-ceph-tools-76c7d559b6-qvh2r 1/1 Running 0 9m28s 10.0.0.82 kube-node1 <none> <none>
# ceph status cluster: id: edc7cac7-21a3-45ae-80a9-5d470afb7576 health: HEALTH_OK services: mon: 3 daemons, quorum c,a,b mgr: a(active) osd: 4 osds: 4 up, 4 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 17 GiB used, 123 GiB / 140 GiB avail pgs:
# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.13715 root default -5 0.03429 host kube-master 2 hdd 0.03429 osd.2 up 1.00000 1.00000 -4 0.03429 host kube-node1 3 hdd 0.03429 osd.3 up 1.00000 1.00000 -2 0.03429 host kube-node2 0 hdd 0.03429 osd.0 up 1.00000 1.00000 -3 0.03429 host kube-node3 1 0.03429 osd.1 up 1.00000 1.00000
As you can see replacing a failed OSD with Rook is about as uneventful as replacing a failed OSD in a standard deployed Ceph cluster. Hopefully this demonstration provided the proof of that.
Further Reading:
Rook: https://github.com/rook/rook