Wednesday, January 30, 2019

Replace Failed OSD in Rook Deployed Ceph


If you have been reading some of my recent articles on Rook you have seen how to install a Ceph cluster with Rook on Kubernetes. This article extends on that Kubernetes installation and discusses how to replace a failed OSD in the Ceph cluster.

First lets review our current running Ceph cluster observing the rook-ceph-system, rook-ceph and inside the toolbox the Ceph status:

# kubectl get pods --all-namespaces -o wide
NAMESPACE          NAME                                      READY   STATUS      RESTARTS   AGE    IP            NODE          NOMINATED NODE   READINESS GATES
kube-system        coredns-86c58d9df4-22fps                  1/1     Running     4          3d2h   10.244.3.55   kube-node3               
kube-system        coredns-86c58d9df4-jp2zb                  1/1     Running     6          3d2h   10.244.2.66   kube-node2               
kube-system        etcd-kube-master                          1/1     Running     3          3d5h   10.0.0.81     kube-master              
kube-system        kube-apiserver-kube-master                1/1     Running     3          3d5h   10.0.0.81     kube-master              
kube-system        kube-controller-manager-kube-master       1/1     Running     5          3d5h   10.0.0.81     kube-master              
kube-system        kube-flannel-ds-amd64-5m9x5               1/1     Running     6          3d5h   10.0.0.83     kube-node2               
kube-system        kube-flannel-ds-amd64-7xgf4               1/1     Running     3          3d5h   10.0.0.81     kube-master              
kube-system        kube-flannel-ds-amd64-dhdzm               1/1     Running     5          3d2h   10.0.0.84     kube-node3               
kube-system        kube-flannel-ds-amd64-m6fx5               1/1     Running     3          3d5h   10.0.0.82     kube-node1               
kube-system        kube-proxy-bnbzn                          1/1     Running     3          3d5h   10.0.0.82     kube-node1               
kube-system        kube-proxy-gjxlg                          1/1     Running     4          3d2h   10.0.0.84     kube-node3               
kube-system        kube-proxy-kkxdb                          1/1     Running     3          3d5h   10.0.0.81     kube-master              
kube-system        kube-proxy-knzsl                          1/1     Running     6          3d5h   10.0.0.83     kube-node2               
kube-system        kube-scheduler-kube-master                1/1     Running     4          3d5h   10.0.0.81     kube-master              
rook-ceph-system   rook-ceph-agent-748v8                     1/1     Running     0          103m   10.0.0.83     kube-node2               
rook-ceph-system   rook-ceph-agent-9vznf                     1/1     Running     0          103m   10.0.0.82     kube-node1               
rook-ceph-system   rook-ceph-agent-hfdv6                     1/1     Running     0          103m   10.0.0.81     kube-master              
rook-ceph-system   rook-ceph-agent-lfh7m                     1/1     Running     0          103m   10.0.0.84     kube-node3               
rook-ceph-system   rook-ceph-operator-76cf7f88f-qmvn5        1/1     Running     0          103m   10.244.1.65   kube-node1               
rook-ceph-system   rook-discover-25h5z                       1/1     Running     0          103m   10.244.1.66   kube-node1               
rook-ceph-system   rook-discover-dcm7k                       1/1     Running     0          103m   10.244.0.41   kube-master              
rook-ceph-system   rook-discover-t4qs7                       1/1     Running     0          103m   10.244.3.61   kube-node3               
rook-ceph-system   rook-discover-w2nv5                       1/1     Running     0          103m   10.244.2.72   kube-node2               
rook-ceph          rook-ceph-mgr-a-8649f78d9b-k6gwl          1/1     Running     0          100m   10.244.3.62   kube-node3               
rook-ceph          rook-ceph-mon-a-576d9d49cc-q9pm6          1/1     Running     0          101m   10.244.0.42   kube-master              
rook-ceph          rook-ceph-mon-b-85f7b6cb6b-pnrhs          1/1     Running     0          101m   10.244.1.67   kube-node1               
rook-ceph          rook-ceph-mon-c-668f7f658d-hjf2v          1/1     Running     0          101m   10.244.2.74   kube-node2               
rook-ceph          rook-ceph-osd-0-6f76d5cc4c-t75gg          1/1     Running     0          100m   10.244.2.76   kube-node2               
rook-ceph          rook-ceph-osd-1-5759cd47c4-szvfg          1/1     Running     0          100m   10.244.3.64   kube-node3               
rook-ceph          rook-ceph-osd-2-6d69b78fbf-7s4bm          1/1     Running     0          100m   10.244.0.44   kube-master              
rook-ceph          rook-ceph-osd-3-7b457fc56d-22gw6          1/1     Running     0          100m   10.244.1.69   kube-node1               
rook-ceph          rook-ceph-osd-prepare-kube-master-72kfz   0/2     Completed   0          100m   10.244.0.43   kube-master              
rook-ceph          rook-ceph-osd-prepare-kube-node1-jp68h    0/2     Completed   0          100m   10.244.1.68   kube-node1               
rook-ceph          rook-ceph-osd-prepare-kube-node2-j89pc    0/2     Completed   0          100m   10.244.2.75   kube-node2               
rook-ceph          rook-ceph-osd-prepare-kube-node3-drh4t    0/2     Completed   0          100m   10.244.3.63   kube-node3               
rook-ceph          rook-ceph-tools-76c7d559b6-qvh2r          1/1     Running     0          6s     10.0.0.82     kube-node1               

# kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

# ceph status
  cluster:
    id:     edc7cac7-21a3-45ae-80a9-5d470afb7576
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum c,a,b
    mgr: a(active)
    osd: 4 osds: 4 up, 4 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   17 GiB used, 123 GiB / 140 GiB avail
    pgs:     
 
# ceph osd tree  
ID CLASS WEIGHT  TYPE NAME            STATUS REWEIGHT PRI-AFF 
-1       0.13715 root default                                 
-5       0.03429     host kube-master                         
 2   hdd 0.03429         osd.2            up  1.00000 1.00000 
-4       0.03429     host kube-node1                          
 3   hdd 0.03429         osd.3            up  1.00000 1.00000 
-2       0.03429     host kube-node2                          
 0   hdd 0.03429         osd.0            up  1.00000 1.00000 
-3       0.03429     host kube-node3                          
 1   hdd 0.03429         osd.1            up  1.00000 1.00000 

At this point the Ceph cluster is clean and in a healthy state.  However I am going to introduce some chaos and which will cause osd1 to go down.  In my case since this is a virtual lab I am going to just kill the OSD process and clear out osd1 data to mimic a failed drive.

Now when we look at the cluster state in the toolbox we can see OSD1 is down:

# kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

# ceph status
  cluster:
    id:     edc7cac7-21a3-45ae-80a9-5d470afb7576
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
 
  services:
    mon: 3 daemons, quorum c,a,b
    mgr: a(active)
    osd: 4 osds: 3 up, 4 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   17 GiB used, 123 GiB / 140 GiB avail
    pgs:     
 
[root@kube-node1 /]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME            STATUS REWEIGHT PRI-AFF 
-1       0.13715 root default                                 
-5       0.03429     host kube-master                         
 2   hdd 0.03429         osd.2            up  1.00000 1.00000 
-4       0.03429     host kube-node1                          
 3   hdd 0.03429         osd.3            up  1.00000 1.00000 
-2       0.03429     host kube-node2                          
 0   hdd 0.03429         osd.0            up  1.00000 1.00000 
-3       0.03429     host kube-node3                          
 1   hdd 0.03429         osd.1          down  1.00000 1.00000 

Given I removed the contents of the OSD lets go ahead and replace the failed drive. The first steps are to go into the toolbox and run the usual commands to remove a Ceph OSD from the cluster:

# kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

# ceph osd out osd.1
marked out osd.1. 

# ceph osd crush remove osd.1
removed item id 1 name 'osd.1' from crush map

# ceph auth del osd.1
updated

# ceph osd rm osd.1
removed osd.1

Lets exit out of the toolbox and go back to the master node command line and delete the Ceph OSD 3 deployment:

# kubectl delete deployment -n rook-ceph rook-ceph-osd-1
deployment.extensions "rook-ceph-osd-1" deleted

Now would be the time to replace the physically failed disk. In my case the disk is still good I just simulated the failure by downing the OSD process and removing the data.

To get the new disk back into the cluster we only need to restart the rook-ceph-operator pod and we can do so in Kubernetes with the following scale deployment commands:

# kubectl scale deployment rook-ceph-operator --replicas=0 -n rook-ceph-system
deployment.extensions/rook-ceph-operator scaled

# kubectl get pods --all-namespaces -o wide|grep operator

# kubectl scale deployment rook-ceph-operator --replicas=1 -n rook-ceph-system
deployment.extensions/rook-ceph-operator scaled

# kubectl get pods --all-namespaces -o wide|grep operator
rook-ceph-system   rook-ceph-operator-76cf7f88f-g9pxr        0/1     ContainerCreating   0          2s              kube-node2               

When the rook-ceph-operator is restarted it will go through and re-run each rook-ceph-osd-prepare container which will scan the system it is on and look for any disks that should be incorporated into the cluster based on the original cluster.yaml settings when the Ceph cluster was deployed with Rook.  In this case it will see the new disk on kube-node-3 and incorporate that into OSD1.

We can confirm our assessment by seeing a new container for OSD1 was spawned and also by logging into the toolbox and running the familiar Ceph commands:

# kubectl get pods -n rook-ceph -o wide
NAME                                      READY   STATUS      RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
rook-ceph-mgr-a-8649f78d9b-k6gwl          1/1     Running     0          110m    10.244.3.62   kube-node3    <none>           <none>
rook-ceph-mon-a-576d9d49cc-q9pm6          1/1     Running     0          110m    10.244.0.42   kube-master   <none>           <none>
rook-ceph-mon-b-85f7b6cb6b-pnrhs          1/1     Running     0          110m    10.244.1.67   kube-node1    <none>           <none>
rook-ceph-mon-c-668f7f658d-hjf2v          1/1     Running     0          110m    10.244.2.74   kube-node2    <none>           <none>
rook-ceph-osd-0-6f76d5cc4c-t75gg          1/1     Running     0          109m    10.244.2.76   kube-node2    <none>           <none>
rook-ceph-osd-1-69f5d5ffd-kndd7           1/1     Running     0          67s     10.244.3.68   kube-node3    <none>           <none>
rook-ceph-osd-2-6d69b78fbf-7s4bm          1/1     Running     0          109m    10.244.0.44   kube-master   <none>           <none>
rook-ceph-osd-3-7b457fc56d-22gw6          1/1     Running     0          109m    10.244.1.69   kube-node1    <none>           <none>
rook-ceph-osd-prepare-kube-master-n2t7g   0/2     Completed   0          79s     10.244.0.47   kube-master   <none>           <none>
rook-ceph-osd-prepare-kube-node1-ttznt    0/2     Completed   0          77s     10.244.1.72   kube-node1    <none>           <none>
rook-ceph-osd-prepare-kube-node2-9kxcl    0/2     Completed   0          75s     10.244.2.79   kube-node2    <none>           <none>
rook-ceph-osd-prepare-kube-node3-cpf4s    0/2     Completed   0          73s     10.244.3.66   kube-node3    <none>           <none>
rook-ceph-tools-76c7d559b6-qvh2r          1/1     Running     0          9m28s   10.0.0.82     kube-node1    <none>           <none>

# ceph status
  cluster:
    id:     edc7cac7-21a3-45ae-80a9-5d470afb7576
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum c,a,b
    mgr: a(active)
    osd: 4 osds: 4 up, 4 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   17 GiB used, 123 GiB / 140 GiB avail
    pgs:     

# ceph osd tree 
ID CLASS WEIGHT  TYPE NAME            STATUS REWEIGHT PRI-AFF 
-1       0.13715 root default                                 
-5       0.03429     host kube-master                         
 2   hdd 0.03429         osd.2            up  1.00000 1.00000 
-4       0.03429     host kube-node1                          
 3   hdd 0.03429         osd.3            up  1.00000 1.00000 
-2       0.03429     host kube-node2                          
 0   hdd 0.03429         osd.0            up  1.00000 1.00000 
-3       0.03429     host kube-node3                          
 1       0.03429         osd.1            up  1.00000 1.00000 

As you can see replacing a failed OSD with Rook is about as uneventful as replacing a failed OSD in a standard deployed Ceph cluster.   Hopefully this demonstration provided the proof of that.

Further Reading:

Rook: https://github.com/rook/rook