记一次ceph集群的严重故障 (转)
问题:集群状态,坏了一个盘,pg状态好像有点问题
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_WARN
64 pgs degraded
64 pgs stuck degraded
64 pgs stuck unclean
64 pgs stuck undersized
64 pgs undersized
recovery 269/819 objects degraded (32.845%)
monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
election epoch 6, quorum 0 ceph-1
osdmap e38: 3 osds: 2 up, 2 in; 64 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v14328: 72 pgs, 2 pools, 420 bytes data, 275 objects
217 MB used, 40720 MB / 40937 MB avail
269/819 objects degraded (32.845%)
64 active+undersized+degraded
8 active+clean
[root@ceph-1 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05846 root default
-2 0.01949 host ceph-1
0 0.01949 osd.0 up 1.00000 1.00000
-3 0.01949 host ceph-2
1 0.01949 osd.1 up 1.00000 1.00000
-4 0.01949 host ceph-3
2 0.01949 osd.2 down 0 1.00000
将osd.2的状态设置为out
root@ceph-1:~# ceph osd out osd.2
osd.2 is already out.
从集群中删除
root@ceph-1:~# ceph osd rm osd.2
removed osd.2
从CRUSH中删除
root@ceph-1:~# ceph osd crush rm osd.2
removed item id 2 name 'osd.2' from crush map
删除osd.2的认证信息
root@ceph02:~# ceph auth del osd.2
updated
umount报错
[root@ceph-3 ~]# umount /dev/vdb1
umount: /var/lib/ceph/osd/ceph-2: target is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
kill掉ceph用户的占用
[root@ceph-3 ~]# fuser -mv /var/lib/ceph/osd/ceph-2
USER PID ACCESS COMMAND
/var/lib/ceph/osd/ceph-2:
root kernel mount /var/lib/ceph/osd/ceph-2
ceph 1517 F.... ceph-osd
[root@ceph-3 ~]# kill -9 1517
[root@ceph-3 ~]# fuser -mv /var/lib/ceph/osd/ceph-2
USER PID ACCESS COMMAND
/var/lib/ceph/osd/ceph-2:
root kernel mount /var/lib/ceph/osd/ceph-2
[root@ceph-3 ~]# umount /var/lib/ceph/osd/ceph-2
重新准备磁盘
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd prepare ceph-3:/dev/vdb1
激活所有节点上的osd磁盘或者分区
[root@ceph-deploy my-cluster]# ceph-deploy osd activate ceph-1:/dev/vdb1 ceph-2:/dev/vdb1 ceph-3:/dev/vdb1
报错...
[ceph-3][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init systemd --mount /dev/vdb1
一怒之下关机重启
[root@ceph-3 ~]# init 0
Connection to 192.168.101.13 closed by remote host.
Connection to 192.168.101.13 closed.
重启之后,osd好了,但是pg的问题好像还没解决
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_WARN
64 pgs degraded
64 pgs stuck degraded
64 pgs stuck unclean
64 pgs stuck undersized
64 pgs undersized
recovery 269/819 objects degraded (32.845%)
monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
election epoch 6, quorum 0 ceph-1
osdmap e53: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v14368: 72 pgs, 2 pools, 420 bytes data, 275 objects
5446 MB used, 55960 MB / 61406 MB avail
269/819 objects degraded (32.845%)
64 active+undersized+degraded
8 active+clean
[root@ceph-1 ~]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.03897 root default
-2 0.01949 host ceph-1
0 0.01949 osd.0 up 1.00000 1.00000
-3 0.01949 host ceph-2
1 0.01949 osd.1 up 1.00000 1.00000
-4 0 host ceph-3
2 0 osd.2 up 1.00000 1.00000
在ceph-1和ceph-2中加了一块硬盘,然后创建osd
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd create ceph-1:/dev/vdd ceph-2:/dev/vdd
查看集群状态,发现pg数好像小了
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_WARN
14 pgs degraded
14 pgs stuck degraded
64 pgs stuck unclean
14 pgs stuck undersized
14 pgs undersized
recovery 188/819 objects degraded (22.955%)
recovery 200/819 objects misplaced (24.420%)
too few PGs per OSD (28 < min 30)
monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
election epoch 6, quorum 0 ceph-1
osdmap e63: 5 osds: 5 up, 5 in; 50 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v14408: 72 pgs, 2 pools, 420 bytes data, 275 objects
5663 MB used, 104 GB / 109 GB avail
188/819 objects degraded (22.955%)
200/819 objects misplaced (24.420%)
26 active+remapped
24 active
14 active+undersized+degraded
8 active+clean
增加pg和pgp
[root@ceph-1 ~]# ceph osd pool set rbd pg_num 128
[root@ceph-1 ~]# ceph osd pool set rbd pgp_num 128
状态就成error了......
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_ERR
118 pgs are stuck inactive for more than 300 seconds
118 pgs peering
118 pgs stuck inactive
128 pgs stuck unclean
recovery 16/657 objects misplaced (2.435%)
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 8, quorum 0,1 ceph-1,ceph-3
osdmap e74: 5 osds: 5 up, 5 in; 55 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v14459: 136 pgs, 2 pools, 356 bytes data, 221 objects
5665 MB used, 104 GB / 109 GB avail
16/657 objects misplaced (2.435%)
73 peering
45 remapped+peering
10 active+remapped
8 active+clean
[root@ceph-1 ~]# less /etc/ceph/ceph.co
于是我又重启了三台osd机器,重启发现又有osd down了
[root@ceph-1 ~]# ceph -s
2018-07-25 15:18:17.207665 7fb4ec2ee700 0 -- :/1038496581 >> 192.168.101.12:6789/0 pipe(0x7fb4e8063fa0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb4e805c610).fault
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_WARN
16 pgs degraded
59 pgs stuck unclean
16 pgs undersized
recovery 134/819 objects degraded (16.361%)
recovery 88/819 objects misplaced (10.745%)
1/5 in osds are down
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 12, quorum 0,1 ceph-1,ceph-3
osdmap e95: 5 osds: 4 up, 5 in; 43 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v14529: 136 pgs, 2 pools, 420 bytes data, 275 objects
5668 MB used, 104 GB / 109 GB avail
134/819 objects degraded (16.361%)
88/819 objects misplaced (10.745%)
77 active+clean
39 active+remapped
16 active+undersized+degraded
4 active
[root@ceph-1 ~]# ceph osd tree
2018-07-25 15:22:25.573039 7fe5ff87c700 0 -- :/3787750993 >> 192.168.101.12:6789/0 pipe(0x7fe604063fd0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe60405c640).fault
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.10725 root default
-2 0.04388 host ceph-1
0 0.01949 osd.0 up 1.00000 1.00000
3 0.02440 osd.3 up 1.00000 1.00000
-3 0.04388 host ceph-2
1 0.01949 osd.1 down 0 1.00000
4 0.02440 osd.4 up 1.00000 1.00000
-4 0.01949 host ceph-3
2 0.01949 osd.2 up 1.00000 1.00000
把坏盘out、rm、crush rm、auth del后,集群健康了
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_OK
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 12, quorum 0,1 ceph-1,ceph-3
osdmap e102: 4 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v14597: 136 pgs, 2 pools, 356 bytes data, 270 objects
5559 MB used, 86551 MB / 92110 MB avail
136 active+clean
换掉了坏盘,把新的盘重新加入ceph集群(扩容也是这样操作)
[root@ceph-deploy my-cluster]# ceph-deploy disk list ceph-2
[root@ceph-deploy my-cluster]# ceph-deploy disk zap ceph-2:vdb
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd create ceph-2:vdb:/dev/vdc1
现在看是error
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_ERR
13 pgs are stuck inactive for more than 300 seconds
50 pgs degraded
2 pgs peering
1 pgs recovering
17 pgs recovery_wait
13 pgs stuck inactive
23 pgs stuck unclean
recovery 67/798 objects degraded (8.396%)
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 12, quorum 0,1 ceph-1,ceph-3
osdmap e110: 5 osds: 5 up, 5 in
flags sortbitwise,require_jewel_osds
pgmap v14633: 136 pgs, 2 pools, 356 bytes data, 268 objects
5669 MB used, 104 GB / 109 GB avail
67/798 objects degraded (8.396%)
79 active+clean
32 activating+degraded
17 active+recovery_wait+degraded
5 activating
2 peering
1 active+recovering+degraded
client io 0 B/s wr, 0 op/s rd, 5 op/s wr
过了一会看就完全正常了
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_OK
monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 12, quorum 0,1 ceph-1,ceph-3
osdmap e110: 5 osds: 5 up, 5 in
flags sortbitwise,require_jewel_osds
pgmap v14666: 136 pgs, 2 pools, 356 bytes data, 267 objects
5669 MB used, 104 GB / 109 GB avail
136 active+clean
问题:增加mon报错
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf mon create ceph-2
[ceph-2][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
[ceph-2][WARNIN] neither `public_addr` nor `public_network` keys are defined for monitors
[root@ceph-2 ~]# less /var/log/ceph/ceph-mon.ceph-2.log
2018-07-25 15:52:02.566212 7efeec7d9780 -1 no public_addr or public_network specified, and mon.ceph-2 not present in monmap or ceph.conf
原因:ceph.conf里面没有配置public_network
[global]
fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
mon_initial_members = ceph-1,ceph-2,ceph-3
mon_host = 192.168.101.11,192.168.101.12,192.168.101.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 2
修改ceph.conf文件
[root@ceph-deploy my-cluster]# vi ceph.conf
[global]
fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
mon_initial_members = ceph-1,ceph-2,ceph-3
mon_host = 192.168.101.11,192.168.101.12,192.168.101.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 2
public_network = 192.168.122.0/24
cluster_network = 192.168.101.0/24
推送新的配置文件至各个节点
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf config push ceph-1 ceph-2 ceph-3
增加ceph-2为mon
[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-2
添加成功后发现,mon集群中ceph-2的ip跟其他的不一样,按照配置文件,应该跟该ceph-1、ceph-3的网段为122
[root@ceph-1 ~]# ceph -s
cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
health HEALTH_OK
monmap e3: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.122.12:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 14, quorum 0,1,2 ceph-1,ceph-3,ceph-2
osdmap e110: 5 osds: 5 up, 5 in
flags sortbitwise,require_jewel_osds
pgmap v14666: 136 pgs, 2 pools, 356 bytes data, 267 objects
5669 MB used, 104 GB / 109 GB avail
136 active+clean
所以,我修改ceph.conf中mon节点的ip段为122
[root@ceph-deploy my-cluster]# vi ceph.conf
[global]
fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
mon_initial_members = ceph-1,ceph-2,ceph-3
mon_host = 192.168.122.11,192.168.122.12,192.168.122.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 2
public_network = 192.168.122.0/24
cluster_network = 192.168.101.0/24
再来一波推送
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf config push ceph-1 ceph-2 ceph-3
删除两个mon
[root@ceph-deploy my-cluster]# ceph-deploy mon destroy ceph-1 ceph-3
然后整个集群都不好了
[root@ceph-1 ~]# ceph -s
2018-07-25 16:35:21.723736 7f47dedfb700 0 -- 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8001f90).fault with nothing to send, going to standby
2018-07-25 16:35:27.723930 7f47dedfb700 0 -- 192.168.122.11:0/4277586904 >> 192.168.122.11:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8002410).fault with nothing to send, going to standby
2018-07-25 16:35:33.725130 7f47deffd700 0 -- 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c80046e0).fault with nothing to send, going to standby
[root@ceph-1 ~]# ceph osd tree
2018-07-25 16:35:21.723736 7f47dedfb700 0 -- 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8001f90).fault with nothing to send, going to standby
2018-07-25 16:35:27.723930 7f47dedfb700 0 -- 192.168.122.11:0/4277586904 >> 192.168.122.11:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8002410).fault with nothing to send, going to standby
2018-07-25 16:35:33.725130 7f47deffd700 0 -- 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c80046e0).fault with nothing to send, going to standby
好像也加不回去
[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-1 ceph-3
[ceph-1][WARNIN] 2018-07-25 16:37:52.760218 7f06739b9700 0 -- 192.168.122.11:0/2929495808 >> 192.168.122.11:6789/0 pipe(0x7f0668000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0668005c20).fault with nothing to send, going to standby
[ceph-1][WARNIN] 2018-07-25 16:37:55.760830 7f06738b8700 0 -- 192.168.122.11:0/2929495808 >> 192.168.122.13:6789/0 pipe(0x7f066800d5e0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f066800e8a0).fault with nothing to send, going to standby
[ceph-1][WARNIN] 2018-07-25 16:37:58.760748 7f06739b9700 0 -- 192.168.122.11:0/2929495808 >> 192.168.122.11:6789/0 pipe(0x7f0668000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f066800be40).fault with nothing to send, going to standby
不嫌事大,把最后一个mon也删掉
[root@ceph-deploy my-cluster]# ceph-deploy mon destroy ceph-2
[root@ceph-deploy my-cluster]# ceph-deploy new ceph-1 ceph-2 ceph-3
[root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf mon create-initial
[ceph-1][ERROR ] "ceph auth get-or-create for keytype admin returned 22
[ceph-1][DEBUG ] Error EINVAL: unknown cap type 'mgr'
[ceph-1][ERROR ] Failed to return 'admin' key from host ceph-1
[ceph-2][ERROR ] "ceph auth get-or-create for keytype admin returned 22
[ceph-2][DEBUG ] Error EINVAL: unknown cap type 'mgr'
[ceph-2][ERROR ] Failed to return 'admin' key from host ceph-2
[ceph-3][ERROR ] "ceph auth get-or-create for keytype admin returned 22
[ceph-3][DEBUG ] Error EINVAL: unknown cap type 'mgr'
[ceph-3][ERROR ] Failed to return 'admin' key from host ceph-3
[ceph_deploy.gatherkeys][ERROR ] Failed to connect to host:ceph-1, ceph-2, ceph-3
[ceph_deploy.gatherkeys][INFO ] Destroy temp directory /tmp/tmpnPWk4d
[ceph_deploy][ERROR ] RuntimeError: Failed to connect any mon
[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-1
[ceph-1][INFO ] monitor: mon.ceph-1 is running
[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-2
[ceph-2][INFO ] monitor: mon.ceph-2 is running
[root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-3
[ceph-3][INFO ] monitor: mon.ceph-3 is running
[root@ceph-1 ceph-ceph-1]# ceph -s
2018-07-25 20:42:07.965513 7f1482a91700 0 librados: client.admin authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError
通常我们执行ceph -s 时,就相当于开启了一个客户端,连接到 Ceph 集群,而这个客户端默认是使用 client.admin 的账户密码登陆连接集群的,所以平时执行的ceph -s 相当于执行了 ceph -s --name client.admin --keyring /etc/ceph/ceph.client.admin.keyring。需要注意的是,每次我们在命令行执行 Ceph 的指令,都相当于开启一个客户端,和集群交互,再关闭客户端。 现在举一个很常见的报错,这在刚接触 Ceph 时,很容易遇到:
[root@blog ~]# ceph -s
2017-08-03 02:22:27.352516 7fbd157b7700 0 librados: client.admin authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError
报错信息很好理解,操作不被允许,也就是认证未通过,由于这里我们使用的是默认的client.admin 用户和它的秘钥,说明秘钥内容和 Ceph 集群记录的不一致,也就是说 /etc/ceph/ceph.client.admin.keyring 内容很可能是之前集群留下的,或者是记录了错误的秘钥,这时,只需要使用 mon.用户来执行 ceph auth list就可以查看到正确的秘钥内容:
[root@ceph-1 ceph]# ceph auth get client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
Error ENOENT: failed to find client.admin in keyring
[root@ceph-1 ceph]#
用mon.用户瞄一眼集群
[root@ceph-1 ceph]# ceph -s --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
cluster 053670e9-9b12-4297-aa04-41c430091f90
health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean
no osds
monmap e1: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.101.12:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 8, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise,require_jewel_osds
pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
64 creating
获取client.admin的秘钥
[root@ceph-1 ceph]# ceph auth get client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
Error ENOENT: failed to find client.admin in keyring
添加client.admin用户
[root@ceph-1 ceph]# ceph auth add client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
再次获取client.admin的秘钥
[root@ceph-1 ceph]# ceph auth get client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
exported keyring for client.admin
[client.admin]
key = AQAIf1hbmuPXBxAA5Q3g/Jz8gerf+S6znEHLBQ==
修改本地client.admin的秘钥
[root@ceph-1 ceph]# vi ceph.client.admin.keyring
[client.admin]
# key = AQAnPVBbJJWsMhAAKEqaHkWdwEWndOvqDjtjXA==
key = AQAIf1hbmuPXBxAA5Q3g/Jz8gerf+S6znEHLBQ==
caps mds = "allow *"
caps mon = "allow *"
caps osd = "allow *"
查看集群状态
[root@ceph-1 ceph]# ceph -s
2018-07-25 21:50:40.512039 7f0ca92d0700 0 librados: client.admin authentication error (13) Permission denied
给client.admin用户添加权限
[root@ceph-1 ceph]# ceph auth add client.admin mon 'allow r' osd 'allow rw'
2018-07-25 21:57:45.263271 7f68398ea700 0 librados: client.admin authentication error (13) Permission denied
之前mon create-initial时新生成的ceph.client.admin.keyring忘了加读权限
[root@ceph-1 ceph]# chmod +r /etc/ceph/ceph.client.admin.keyring
[root@ceph-1 ceph]# ceph -s
2018-07-25 22:06:17.167512 7f449b116700 0 librados: client.admin authentication error (13) Permission denied
再次给client.admin用户添加权限
[root@ceph-1 ceph]# ceph auth add client.admin mon 'allow r' osd 'allow rw' --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
Error EINVAL: entity client.admin exists but caps do not match
历经千辛万苦,终于在谷歌找到一个方法,client.admin权限恢复后,查看到集群osd全没了
[root@ceph-1 ~]# cd /var/lib/ceph/mon
[root@ceph-1 mon]# ls
ceph-ceph-1
[root@ceph-1 mon]# cd ceph-ceph-1/
[root@ceph-1 ceph-ceph-1]# ls
done keyring store.db systemd
[root@ceph-1 ceph-ceph-1]# ceph -n mon. --keyring keyring auth caps client.admin mds 'allow *' osd 'allow *' mon 'allow *'
updated caps for client.admin
[root@ceph-1 ceph-ceph-1]# ceph -s
cluster 053670e9-9b12-4297-aa04-41c430091f90
health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean
no osds
monmap e1: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.101.12:6789/0,ceph-3=192.168.101.13:6789/0}
election epoch 16, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise,require_jewel_osds
pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
64 creating
[root@ceph-1 ceph-ceph-1]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0 root default
在每个节点lsblk查看,所有挂载点均以自动卸载了,趁此,我也调整一下磁盘规格,把它们都统一该为20G
[root@ceph-1 ceph-ceph-1]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 1024M 0 rom
vda 252:0 0 100G 0 disk
├─vda1 252:1 0 1G 0 part /boot
└─vda2 252:2 0 99G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 47G 0 lvm /home
vdb 252:16 0 20G 0 disk
└─vdb1 252:17 0 20G 0 part
vdc 252:32 0 20G 0 disk
└─vdc1 252:33 0 5G 0 part
vdd 252:48 0 30G 0 disk
├─vdd1 252:49 0 25G 0 part
└─vdd2 252:50 0 5G 0 part
重新格式化磁盘
[root@ceph-deploy my-cluster]# ceph-deploy disk zap ceph-1:vdb ceph-2:vdb ceph-3:vdb
[root@ceph-deploy my-cluster]# ceph-deploy osd prepare ceph-1:vdb:vdc ceph-2:vdb:vdc ceph-3:vdb:vdc
激活osd,看似好像是osd认证失败导致的
[root@ceph-deploy my-cluster]# ceph-deploy osd activate ceph-1:vdb1:vdc
[ceph-1][WARNIN] ceph_disk.main.Error: Error: ceph osd create failed: Command '/usr/bin/ceph' returned non-zero exit status 1: 2018-07-26 10:34:36.851527 7f678c625700 0 librados: client.bootstrap-osd authentication error (1) Operation not permitted
[ceph-1][WARNIN] Error connecting to cluster: PermissionError
[ceph-1][WARNIN]
[ceph-1][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init systemd --mount /dev/vdb1
暂时研究到这里吧,这个集群先放着,等以后证明白cephx再来搞
重装请看这里
ceph-deploy purgedata {ceph-node} [{ceph-node}] ##清空数据
ceph-deploy forgetkeys ##删除之前生成的密钥
ceph-deploy purge {ceph-node} [{ceph-node}] ##卸载ceph软件
If you execute purge, you must re-install Ceph.
ceph-deploy new {initial-monitor-node(s)}
ceph-deploy install {ceph-node}[{ceph-node}
ceph-deploy mon create-initial
ceph-deploy disk list {node-name [node-name]...}
ceph-deploy disk zap osdserver1:sda
ceph-deploy osd prepare ceph-osd1:/dev/sda ceph-osd1:/dev/sdb
ceph-deploy osd activate ceph-osd1:/dev/sda1 ceph-osd1:/dev/sdb1
ceph-deploy admin {admin-node} {ceph-node}
chmod +r /etc/ceph/ceph.client.admin.keyring
记一次ceph集群的严重故障 (转)的更多相关文章
- 记一次ceph集群的严重故障
问题:集群状态,坏了一个盘,pg状态好像有点问题[root@ceph-1 ~]# ceph -s cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe ...
- CEPH集群操作入门--配置
参考文档:CEPH官网集群操作文档 概述 Ceph存储集群是所有Ceph部署的基础. 基于RADOS,Ceph存储集群由两种类型的守护进程组成:Ceph OSD守护进程(OSD)将数据作为对象 ...
- Ubuntu 14.04 部署 CEPH集群
注:下文的所有操作都在admin节点进行 1.准备三台虚拟机,其中一台作为admin节点,另外两台作为osd节点,并相应地用hostname命令将主机名修改为admin,osd0,osd1,最后修改/ ...
- ceph集群安装
所有 Ceph 部署都始于 Ceph 存储集群.一个 Ceph 集群可以包含数千个存储节点,最简系统至少需要一个监视器和两个 OSD 才能做到数据复制.Ceph 文件系统. Ceph 对象存储.和 C ...
- 使用虚拟机CentOS7部署CEPH集群
第1章 CEPH部署 1.1 简单介绍 Ceph的部署模式下主要包含以下几个类型的节点 Ø CephOSDs: A Ceph OSD 进程主要用来存储数据,处理数据的replication,恢复 ...
- docker创建ceph集群
背景 Ceph官方现在提供两类镜像来创建集群,一种是常规的,每一种Ceph组件是单独的一个镜像,如ceph/daemon.ceph/radosgw.ceph/mon.ceph/osd等:另外一种是最新 ...
- ceph集群搭建
CEPH 1.组成部分 1.1 monitor admin节点安装ceph-deploy工具 admin节点安装ceph-deploy 添加源信息 rm -f /etc/yum.repos.d/* w ...
- Ceph集群搭建及Kubernetes上实现动态存储(StorageClass)
集群准备 ceph集群配置说明 节点名称 IP地址 配置 作用 ceph-moni-0 10.10.3.150 centos7.5 4C,16G,200Disk 管理节点,监视器 monitor ...
- Ceph集群更换public_network网络
1.确保ceph集群是连通状态 这里,可以先把机器配置为以前的x.x.x.x的网络,确保ceph集群是可以通的.这里可以执行下面的命令查看是否连通,显示HEALTH_OK则表示连通 2.获取monma ...
随机推荐
- Ajax返回数据却一直进入error(已经解决)
做asp.net项目 使用ajax $.ajax({ url: '../Music/Default2.aspx?Types=' + type + '&texts=' + text + '', ...
- Tomcat - Tomcat安装
Tomcat官网:http://tomcat.apache.org/ 准备:JAVA环境布置完成 一.Windows平台 1. 版本选择 1) 进入官网 2) 查看版本匹配 官网说明 https:// ...
- Girls Like You--Maroon 5
Girls Like You Spent 24 hours, I need more hours with you (24小时过去 还想和你 相处更久) You spent the weekend g ...
- anaconda环境中---py2.7下安装tf1.0 + py3.5下安装tf1.5
anaconda环境中---py2.7下安装tf1.0 + py3.5下安装tf1.5 @wp20181030 环境:ubuntu18.04, anaconda2, ubuntu系统下事先安装了pyt ...
- 2018 牛客网暑期ACM多校训练营(第一场) E Removal (DP)
Removal 链接:https://ac.nowcoder.com/acm/contest/139/E来源:牛客网 题目描述 Bobo has a sequence of integers s1, ...
- vue-cli3.0 环境变量与模式
vue-cli3.0移除了配置文件目录: config和build文件夹.可以说是非常的精简了,那移除了配置文件目录后如何自定义配置环境变量和模式呢? 为什么需要配置环境变量和模式呢? 所有方法肯定是 ...
- 使用 Live CD 修复 Ubuntu GRUB
用 Ubuntu 的 Live CD 试用 Ubuntu 启动后,打开终端 假如你的Ubuntu的 / 分区是sdc1,又假如 /boot 分区是 sdc1,在终端下输入 sudo -i mount ...
- ASTC on Android
kGL_KHR_texture_compression_astc_ldr kWEBGL_compressed_texture_astc_ldr KHR_texture_compression_astc ...
- django-session的使用---文件session型
3.文件Session a. 配置 settings.py SESSION_ENGINE = 'django.contrib.sessions.backends.file' # 引擎 ...
- 【Android-布局复用】 多个界面复用一个布局文件(二)
多个界面复用一个布局界面 ,如何找到复用布局文件中的控件的id? 举个栗子: 1. layout_common.xml 复用的布局文件,如何找到button 的id? <?xml versio ...