Tuesday, April 02, 2019

Deploy Rook/Ceph Cluster on Dedicated Networks


Recently a colleague of mine was trying to get Rook to deploy a Ceph cluster that used dedicated public and private networks to segment the Ceph replication traffic and the client access traffic to the OSDs of the cluster.   In a regular Ceph deployment this is rather trivial but when in the context of Kubernetes it becomes a little more complex given that Rook is deploying the cluster containers.  The following is procedure I applied to ensure my OSDs were listening on the appropriate networks.

Before we get into the steps on how to achieve this configuration lets quick take a look at the setup I used.  First I have a three node Kubernetes configuration (1 master with allowed scheduling and two workers):

# kubectl get nodes
NAME          STATUS   ROLES    AGE     VERSION
kube-master   Ready    master   2d22h   v1.14.0
kube-node1    Ready    worker   2d22h   v1.14.0
kube-node2    Ready    worker   2d22h   v1.14.0

On each of the nodes I have 3 network interfaces: eth0 on 10.0.0.0/24 (Kubernetes public), eth1 on 192.168.100.0/24 (Ceph private/cluster) & eth2 on 192.168.200.0/24 (Ceph public):

# ip a|grep eth[0-2]
2: eth0:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    inet 10.0.0.81/24 brd 10.0.0.255 scope global noprefixroute eth0
3: eth1:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    inet 192.168.100.81/24 brd 192.168.100.255 scope global noprefixroute eth1
4: eth2:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    inet 192.168.200.81/24 brd 192.168.200.255 scope global noprefixroute eth2

Before we begin lets see the current vanilla pods and namespaces on the Kubernetes cluster:

# kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
kube-system   coredns-fb8b8dccf-h6wfn               1/1     Running   0          3d    10.244.1.2   kube-node2               
kube-system   coredns-fb8b8dccf-mv7p5               1/1     Running   0          3d    10.244.0.7   kube-master              
kube-system   etcd-kube-master                      1/1     Running   0          3d    10.0.0.81    kube-master              
kube-system   kube-apiserver-kube-master            1/1     Running   0          3d    10.0.0.81    kube-master              
kube-system   kube-controller-manager-kube-master   1/1     Running   1          3d    10.0.0.81    kube-master              
kube-system   kube-flannel-ds-amd64-szhg9           1/1     Running   0          3d    10.0.0.83    kube-node2               
kube-system   kube-flannel-ds-amd64-t4fxs           1/1     Running   0          3d    10.0.0.82    kube-node1               
kube-system   kube-flannel-ds-amd64-wbsdp           1/1     Running   0          3d    10.0.0.81    kube-master              
kube-system   kube-proxy-sn7j7                      1/1     Running   0          3d    10.0.0.83    kube-node2               
kube-system   kube-proxy-wtzm5                      1/1     Running   0          3d    10.0.0.81    kube-master              
kube-system   kube-proxy-xlwd9                      1/1     Running   0          3d    10.0.0.82    kube-node1               
kube-system   kube-scheduler-kube-master            1/1     Running   1          3d    10.0.0.81    kube-master              

# kubectl get ns
NAME              STATUS   AGE
default           Active   3d
kube-node-lease   Active   3d
kube-public       Active   3d
kube-system       Active   3d

Before can deploy the cluster we need to create a configmap for the rook-ceph namespace.  This namespace is normally created when the cluster is deployed however we want specific configuration items to be incorporated into the cluster upon deployment and so to do this we will create the rook-ceph namespace and apply a configmap that we create to that namespace.

First create a configmap file that looks like the following and notice I am referencing my Ceph cluster networks.  I will save this file with an arbitrary name like config-override.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: rook-config-override
  namespace: rook-ceph
data:
  config: |
    [global]
    public network =  192.168.200.0/24
    cluster network = 192.168.100.0/24
    public addr = ""
    cluster addr = ""

Next I will create the rook-ceph namespace:

# kubectl create namespace rook-ceph
namespace/rook-ceph created

# kubectl get ns
NAME              STATUS   AGE
default           Active   3d1h
kube-node-lease   Active   3d1h
kube-public       Active   3d1h
kube-system       Active   3d1h
rook-ceph         Active   5s

Now we can apply the configmap we created to the newly created namespace and validate its there:

# kubectl create -f config-override.yaml 
configmap/rook-config-override created
# kubectl get configmap -n rook-ceph
NAME                   DATA   AGE
rook-config-override   1      66s
# kubectl describe configmap -n rook-ceph
Name:         rook-config-override
Namespace:    rook-ceph
Labels:       <none>
Annotations:  <none>

Data
====
config:
----
[global]
public network =  192.168.200.0/24
cluster network = 192.168.100.0/24
public addr = ""
cluster addr = ""

Events:  <none>


Before we actually start to do the deploy we need to update one more thing in our Rook cluster.yaml.  Inside the cluster.yaml file we need to change hostNetwork from the default of false to true:

 sed -i 's/hostNetwork: false/hostNetwork: true/g' cluster.yaml

Now we can begin the process of deploying the Rook/Ceph cluster that includes launching the operator, cluster and toolbox.   I will place sleep statements in between each command to ensure the pods are up before I run the next command.  Also note there will be an error when creating the cluster about the rook-ceph namespace already existing and this is normal:

# kubectl create -f operator.yaml
namespace/rook-ceph-system created
customresourcedefinition.apiextensions.k8s.io/cephclusters.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephfilesystems.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephobjectstores.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephobjectstoreusers.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephblockpools.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/volumes.rook.io created
clusterrole.rbac.authorization.k8s.io/rook-ceph-cluster-mgmt created
role.rbac.authorization.k8s.io/rook-ceph-system created
clusterrole.rbac.authorization.k8s.io/rook-ceph-global created
clusterrole.rbac.authorization.k8s.io/rook-ceph-mgr-cluster created
serviceaccount/rook-ceph-system created
rolebinding.rbac.authorization.k8s.io/rook-ceph-system created
clusterrolebinding.rbac.authorization.k8s.io/rook-ceph-global created
deployment.apps/rook-ceph-operator created

# sleep 60

# kubectl create -f cluster.yaml 
serviceaccount/rook-ceph-osd created
serviceaccount/rook-ceph-mgr created
role.rbac.authorization.k8s.io/rook-ceph-osd created
role.rbac.authorization.k8s.io/rook-ceph-mgr-system created
role.rbac.authorization.k8s.io/rook-ceph-mgr created
rolebinding.rbac.authorization.k8s.io/rook-ceph-cluster-mgmt created
rolebinding.rbac.authorization.k8s.io/rook-ceph-osd created
rolebinding.rbac.authorization.k8s.io/rook-ceph-mgr created
rolebinding.rbac.authorization.k8s.io/rook-ceph-mgr-system created
rolebinding.rbac.authorization.k8s.io/rook-ceph-mgr-cluster created
cephcluster.ceph.rook.io/rook-ceph created
Error from server (AlreadyExists): error when creating "cluster.yaml": namespaces "rook-ceph" already exists

# sleep 60

# kubectl create -f toolbox.yaml 
pod/rook-ceph-tools created

Lets validate the Rook/Ceph operator, cluster and toolbox is up and running:

# kubectl get pods --all-namespaces -o wide
NAMESPACE          NAME                                      READY   STATUS      RESTARTS   AGE     IP           NODE          NOMINATED NODE   READINESS GATES
kube-system        coredns-fb8b8dccf-h6wfn                   1/1     Running     0          3d1h    10.244.1.2   kube-node2    <none>           <none>
kube-system        coredns-fb8b8dccf-mv7p5                   1/1     Running     0          3d1h    10.244.0.7   kube-master   <none>           <none>
kube-system        etcd-kube-master                          1/1     Running     0          3d1h    10.0.0.81    kube-master   <none>           <none>
kube-system        kube-apiserver-kube-master                1/1     Running     0          3d1h    10.0.0.81    kube-master   <none>           <none>
kube-system        kube-controller-manager-kube-master       1/1     Running     1          3d1h    10.0.0.81    kube-master   <none>           <none>
kube-system        kube-flannel-ds-amd64-szhg9               1/1     Running     0          3d1h    10.0.0.83    kube-node2    <none>           <none>
kube-system        kube-flannel-ds-amd64-t4fxs               1/1     Running     0          3d1h    10.0.0.82    kube-node1    <none>           <none>
kube-system        kube-flannel-ds-amd64-wbsdp               1/1     Running     0          3d1h    10.0.0.81    kube-master   <none>           <none>
kube-system        kube-proxy-sn7j7                          1/1     Running     0          3d1h    10.0.0.83    kube-node2    <none>           <none>
kube-system        kube-proxy-wtzm5                          1/1     Running     0          3d1h    10.0.0.81    kube-master   <none>           <none>
kube-system        kube-proxy-xlwd9                          1/1     Running     0          3d1h    10.0.0.82    kube-node1    <none>           <none>
kube-system        kube-scheduler-kube-master                1/1     Running     1          3d1h    10.0.0.81    kube-master   <none>           <none>
rook-ceph-system   rook-ceph-agent-55fqp                     1/1     Running     0          17m     10.0.0.83    kube-node2    <none>           <none>
rook-ceph-system   rook-ceph-agent-5v9v5                     1/1     Running     0          17m     10.0.0.81    kube-master   <none>           <none>
rook-ceph-system   rook-ceph-agent-spx29                     1/1     Running     0          17m     10.0.0.82    kube-node1    <none>           <none>
rook-ceph-system   rook-ceph-operator-57547fc866-ltp8z       1/1     Running     0          18m     10.244.2.4   kube-node1    <none>           <none>
rook-ceph-system   rook-discover-brxmt                       1/1     Running     0          17m     10.244.2.5   kube-node1    <none>           <none>
rook-ceph-system   rook-discover-hl748                       1/1     Running     0          17m     10.244.1.8   kube-node2    <none>           <none>
rook-ceph-system   rook-discover-qj5kd                       1/1     Running     0          17m     10.244.0.9   kube-master   <none>           <none>
rook-ceph          rook-ceph-mgr-a-5dbb44d7f8-vzs46          1/1     Running     0          16m     10.0.0.82    kube-node1    <none>           <none>
rook-ceph          rook-ceph-mon-a-5fb9568cb4-gvqln          1/1     Running     0          16m     10.0.0.81    kube-master   <none>           <none>
rook-ceph          rook-ceph-mon-b-b65c555bf-vz7ps           1/1     Running     0          16m     10.0.0.82    kube-node1    <none>           <none>
rook-ceph          rook-ceph-mon-c-69cf744c4d-8g4l6          1/1     Running     0          16m     10.0.0.83    kube-node2    <none>           <none>
rook-ceph          rook-ceph-osd-0-77499f547-d2vjx           1/1     Running     0          15m     10.0.0.81    kube-master   <none>           <none>
rook-ceph          rook-ceph-osd-1-698f76d786-lqn4w          1/1     Running     0          15m     10.0.0.82    kube-node1    <none>           <none>
rook-ceph          rook-ceph-osd-2-558c59d577-wfdlr          1/1     Running     0          15m     10.0.0.83    kube-node2    <none>           <none>
rook-ceph          rook-ceph-osd-prepare-kube-master-p55sw   0/2     Completed   0          15m     10.0.0.81    kube-master   <none>           <none>
rook-ceph          rook-ceph-osd-prepare-kube-node1-q7scn    0/2     Completed   0          15m     10.0.0.82    kube-node1    <none>           <none>
rook-ceph          rook-ceph-osd-prepare-kube-node2-8rm4d    0/2     Completed   0          15m     10.0.0.83    kube-node2    <none>           <none>
rook-ceph          rook-ceph-tools                           1/1     Running     0          3m24s   10.244.1.9   kube-node2    <none>           <none>

# kubectl -n rook-ceph exec -it rook-ceph-tools -- /bin/bash
bash: warning: setlocale: LC_CTYPE: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_COLLATE: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_MESSAGES: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_NUMERIC: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_TIME: cannot change locale (en_US.UTF-8): No such file or directory
[root@rook-ceph-tools /]# ceph status
  cluster:
    id:     b58f2a5c-2fc7-43e7-b410-2d541e78a90e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c
    mgr: a(active)
    osd: 3 osds: 3 up, 3 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0  objects, 0 B
    usage:   57 GiB used, 49 GiB / 105 GiB avail
    pgs:     
 
[root@rook-ceph-tools /]# exit
exit

At this point we have a fully operational cluster but is it really using the networks for OSD public and private traffic?   Lets explore that a bit further by first running the netstat command on any node in the cluster that has an OSD pod running.  Since my cluster is small I will show all 3 nodes below:

[root@kube-master]# netstat -tulpn | grep LISTEN | grep osd
tcp        0      0 192.168.100.81:6800     0.0.0.0:*               LISTEN      29719/ceph-osd      
tcp        0      0 192.168.200.81:6800     0.0.0.0:*               LISTEN      29719/ceph-osd      
tcp        0      0 192.168.200.81:6801     0.0.0.0:*               LISTEN      29719/ceph-osd      
tcp        0      0 192.168.100.81:6801     0.0.0.0:*               LISTEN      29719/ceph-osd
[root@kube-node1]# netstat -tulpn | grep LISTEN | grep osd
tcp        0      0 192.168.100.82:6800     0.0.0.0:*               LISTEN      18770/ceph-osd      
tcp        0      0 192.168.100.82:6801     0.0.0.0:*               LISTEN      18770/ceph-osd      
tcp        0      0 192.168.200.82:6801     0.0.0.0:*               LISTEN      18770/ceph-osd      
tcp        0      0 192.168.200.82:6802     0.0.0.0:*               LISTEN      18770/ceph-osd

[root@kube-node2]# netstat -tulpn | grep LISTEN | grep osd
tcp        0      0 192.168.100.83:6800     0.0.0.0:*               LISTEN      22659/ceph-osd      
tcp        0      0 192.168.200.83:6800     0.0.0.0:*               LISTEN      22659/ceph-osd      
tcp        0      0 192.168.200.83:6801     0.0.0.0:*               LISTEN      22659/ceph-osd      
tcp        0      0 192.168.100.83:6801     0.0.0.0:*               LISTEN      22659/ceph-osd

From the above we should see the OSD processes listening on the corresponding public and private networks we configured in the configmap.   However lets further confirm by going back into the toolbox and doing a ceph osd dump:

# kubectl -n rook-ceph exec -it rook-ceph-tools -- /bin/bash
bash: warning: setlocale: LC_CTYPE: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_COLLATE: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_MESSAGES: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_NUMERIC: cannot change locale (en_US.UTF-8): No such file or directory
bash: warning: setlocale: LC_TIME: cannot change locale (en_US.UTF-8): No such file or directory

[root@rook-ceph-tools]# ceph osd dump
epoch 14
fsid 05a8b767-e3e8-42aa-b792-69f479c807f7
created 2019-04-02 13:24:24.549423
modified 2019-04-02 13:25:28.441850
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 7
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client firefly
require_osd_release mimic
max_osd 3
osd.0 up   in  weight 1 up_from 11 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.200.81:6800/29719 192.168.100.81:6800/29719 192.168.100.81:6801/29719 192.168.200.81:6801/29719 exists,up 2feb0edf-6652-4148-8264-6ba52d04ff80
osd.1 up   in  weight 1 up_from 14 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.200.82:6801/18770 192.168.100.82:6800/18770 192.168.100.82:6801/18770 192.168.200.82:6802/18770 exists,up f8df61b4-4ac8-4705-9f97-eb09a1cc0d6c
osd.2 up   in  weight 1 up_from 14 up_thru 0 down_at 0 last_clean_interval [0,0) 192.168.200.83:6800/22659 192.168.100.83:6800/22659 192.168.100.83:6801/22659 192.168.200.83:6801/22659 exists,up db555c80-9d81-4662-aed9-4bce1c0d5d78

As you can see it can be fairly straight forward to configure Rook to deploy a Ceph cluster using segmented networks to ensure the replication traffic runs on dedicated network and does not interfere with public client performance.  Hopefully this quick demonstrate showed that.