在一个ceph集群中,操作创建一个池后,发现ceph的集群状态处于warn状态,信息如下

检查集群的信息

查看看池

[root@serverc ~]# ceph osd pool ls

  1. images #只有一个池

[root@serverc ~]# ceph osd tree

  1. ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
  2. -1 0.13129 root default
  3. -5 0.04376 host serverc
  4. 2 hdd 0.01459 osd.2 up 1.00000 1.00000 #9块osd状态up in状态
  5. 3 hdd 0.01459 osd.3 up 1.00000 1.00000
  6. 7 hdd 0.01459 osd.7 up 1.00000 1.00000
  7. -3 0.04376 host serverd
  8. 0 hdd 0.01459 osd.0 up 1.00000 1.00000
  9. 5 hdd 0.01459 osd.5 up 1.00000 1.00000
  10. 6 hdd 0.01459 osd.6 up 1.00000 1.00000
  11. -7 0.04376 host servere
  12. 1 hdd 0.01459 osd.1 up 1.00000 1.00000
  13. 4 hdd 0.01459 osd.4 up 1.00000 1.00000
  14. 8 hdd 0.01459 osd.8 up 1.00000 1.00000

重现错误

[root@serverc ~]# ceph osd pool create images 64 64

[root@serverc ~]# ceph osd pool application enable images rbd

[root@serverc ~]# ceph -s

  1. cluster:
  2. id: 04b66834-1126-4870-9f32-d9121f1baccd
  3. health: HEALTH_WARN
  4. too few PGs per OSD (21 < min 30)
  5. services:
  6. mon: 3 daemons, quorum serverc,serverd,servere
  7. mgr: servere(active), standbys: serverd, serverc
  8. osd: 9 osds: 9 up, 9 in
  9. data:
  10. pools: 1 pools, 64 pgs
  11. objects: 8 objects, 12418 kB
  12. usage: 1005 MB used, 133 GB / 134 GB avail
  13. pgs: 64 active+clean

[root@serverc ~]# ceph pg dump

  1. dumped all
  2. version 1334
  3. stamp 2019-03-29 22:21:41.795511
  4. last_osdmap_epoch 0
  5. last_pg_scan 0
  6. full_ratio 0
  7. nearfull_ratio 0
  8. PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
  9. 1.3f 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.871318 0'0 33:41 [7,1,0] 7 [7,1,0] 7 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  10. 1.3e 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.867341 0'0 33:41 [4,5,7] 4 [4,5,7] 4 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  11. 1.3d 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.871213 0'0 33:41 [0,3,1] 0 [0,3,1] 0 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  12. 1.3c 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.859216 0'0 33:41 [5,7,1] 5 [5,7,1] 5 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  13. 1.3b 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.870865 0'0 33:41 [0,8,7] 0 [0,8,7] 0 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  14. 1.3a 2 0 0 0 0 19 17 17 active+clean 2019-03-29 22:17:34.858977 33'17 33:117 [4,6,7] 4 [4,6,7] 4 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  15. 1.39 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.871027 0'0 33:41 [0,3,4] 0 [0,3,4] 0 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  16. 1.38 1 0 0 0 0 16 1 1 active+clean 2019-03-29 22:17:34.861985 30'1 33:48 [4,2,5] 4 [4,2,5] 4 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  17. 1.37 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.861667 0'0 33:41 [6,7,1] 6 [6,7,1] 6 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  18. 1.36 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.860382 0'0 33:41 [6,3,1] 6 [6,3,1] 6 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  19. 1.35 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.860407 0'0 33:41 [8,6,2] 8 [8,6,2] 8 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  20. 1.34 0 0 0 0 0 0 2 2 active+clean 2019-03-29 22:17:34.861874 32'2 33:44 [4,3,0] 4 [4,3,0] 4 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  21. 1.33 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.860929 0'0 33:41 [4,6,2] 4 [4,6,2] 4 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  22. 1.32 0 0 0 0 0 0 0 0 active+clean 2019-03-29 22:17:34.860589 0'0 33:41 [4,2,6] 4 [4,2,6] 4 0'0 2019-03-29 21:55:07.534833 0'0 2019-03-29 21:55:07.534833
  23. …………
  24. 1 8 0 0 0 0 12716137 78 78
  25. sum 8 0 0 0 0 12716137 78 78
  26. OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
  27. 8 119M 15229M 15348M [0,1,2,3,4,5,6,7] 22 6
  28. 7 119M 15229M 15348M [0,1,2,3,4,5,6,8] 22 9
  29. 6 119M 15229M 15348M [0,1,2,3,4,5,7,8] 23 5
  30. 5 107M 15241M 15348M [0,1,2,3,4,6,7,8] 18 7
  31. 4 107M 15241M 15348M [0,1,2,3,5,6,7,8] 18 9
  32. 3 107M 15241M 15348M [0,1,2,4,5,6,7,8] 23 6
  33. 2 107M 15241M 15348M [0,1,3,4,5,6,7,8] 19 6
  34. 1 107M 15241M 15348M [0,2,3,4,5,6,7,8] 24 8
  35. 0 107M 15241M 15348M [1,2,3,4,5,6,7,8] 23 8
  36. sum 1005M 133G 134G

由提示看出,每个osd上的pg数量小于最小的数目30个。是因为在创建池的时候,指定pg和pgs为64,由于是3副本的配置,所以当有9个osd的时候,每个osd上均分了64/9 *3=21个pgs,也就是出现了如上的错误 小于最小配置30个。从pg dump看出每块osd上的PG数,都小于30

集群这种状态如果进行数据的存储和操作,会发现集群卡死,无法响应io,同时会导致大面积的osd down。

解决办法

修改pool的pg数

[root@serverc ~]# ceph osd pool set images pg_num 128

  1. set pool 1 pg_num to 128

[root@serverc ~]# ceph -s

  1. cluster:
  2. id: 04b66834-1126-4870-9f32-d9121f1baccd
  3. health: HEALTH_WARN
  4. Reduced data availability: 21 pgs peering
  5. Degraded data redundancy: 21 pgs unclean
  6. 1 pools have pg_num > pgp_num
  7. too few PGs per OSD (21 < min 30)
  8.  
  9. services:
  10. mon: 3 daemons, quorum serverc,serverd,servere
  11. mgr: servere(active), standbys: serverd, serverc
  12. osd: 9 osds: 9 up, 9 in
  13.  
  14. data:
  15. pools: 1 pools, 128 pgs
  16. objects: 8 objects, 12418 kB
  17. usage: 1005 MB used, 133 GB / 134 GB avail
  18. pgs: 50.000% pgs unknown
  19. 16.406% pgs not active
  20. 64 unknown
  21. 43 active+clean
  22. 21 peering

出现 too few PGs per OSD

继续修改PGS

[root@serverc ~]# ceph osd pool set images pgp_num 128

  1. set pool 1 pgp_num to 128

查看

  1. [root@serverc ~]# ceph -s
  2. cluster:
  3. id: 04b66834-1126-4870-9f32-d9121f1baccd
  4. health: HEALTH_WARN
  5. Reduced data availability: 7 pgs peering
  6. Degraded data redundancy: 24 pgs unclean, 2 pgs degraded
  7. services:
  8. mon: 3 daemons, quorum serverc,serverd,servere
  9. mgr: servere(active), standbys: serverd, serverc
  10. osd: 9 osds: 9 up, 9 in
  11. data:
  12. pools: 1 pools, 128 pgs
  13. objects: 8 objects, 12418 kB
  14. usage: 1005 MB used, 133 GB / 134 GB avail
  15. pgs: 24.219% pgs not active #pg状态,数据在重平衡(状态信息代表的意义,请参考https://www.cnblogs.com/zyxnhr/p/10616497.html第三部分内容)
  16. 97 active+clean
  17. 20 activating
  18. 9 peering
  19. 2 activating+degraded
  20. [root@serverc ~]# ceph -s
  21. cluster:
  22. id: 04b66834-1126-4870-9f32-d9121f1baccd
  23. health: HEALTH_WARN
  24. Reduced data availability: 7 pgs peering
  25. Degraded data redundancy: 3/24 objects degraded (12.500%), 33 pgs unclean, 4 pgs degraded
  26. services:
  27. mon: 3 daemons, quorum serverc,serverd,servere
  28. mgr: servere(active), standbys: serverd, serverc
  29. osd: 9 osds: 9 up, 9 in
  30. data:
  31. pools: 1 pools, 128 pgs
  32. objects: 8 objects, 12418 kB
  33. usage: 1005 MB used, 133 GB / 134 GB avail
  34. pgs: 35.938% pgs not active
  35. 3/24 objects degraded (12.500%)
  36. 79 active+clean
  37. 34 activating
  38. 9 peering
  39. 3 activating+degraded
  40. 2 active+clean+snaptrim
  41. 1 active+recovery_wait+degraded
  42. io:
  43. recovery: 1 B/s, 0 objects/s
  44. [root@serverc ~]# ceph -s
  45. cluster:
  46. id: 04b66834-1126-4870-9f32-d9121f1baccd
  47. health: HEALTH_OK
  48. services:
  49. mon: 3 daemons, quorum serverc,serverd,servere
  50. mgr: servere(active), standbys: serverd, serverc
  51. osd: 9 osds: 9 up, 9 in
  52. data:
  53. pools: 1 pools, 128 pgs
  54. objects: 8 objects, 12418 kB
  55. usage: 1050 MB used, 133 GB / 134 GB avail
  56. pgs: 128 active+clean
  57. io:
  58. recovery: 1023 kB/s, 0 keys/s, 0 objects/s
  59. [root@serverc ~]# ceph -s
  60. cluster:
  61. id: 04b66834-1126-4870-9f32-d9121f1baccd
  62. health: HEALTH_OK #数据平衡完毕,集群状态恢复正常
  63. services:
  64. mon: 3 daemons, quorum serverc,serverd,servere
  65. mgr: servere(active), standbys: serverd, serverc
  66. osd: 9 osds: 9 up, 9 in
  67. data:
  68. pools: 1 pools, 128 pgs
  69. objects: 8 objects, 12418 kB
  70. usage: 1016 MB used, 133 GB / 134 GB avail
  71. pgs: 128 active+clean
  72. io:
  73. recovery: 778 kB/s, 0 keys/s, 0 objects/s

注:这里是实验环境,pool上也没有数据,所以修改pg影响并不大,但是如果是生产环境,这时候再重新修改pg数,会对生产环境产生较大影响。因为pg数变了,就会导致整个集群的数据重新均衡和迁移,数据越大响应io的时间会越长。具体请参考https://www.cnblogs.com/zyxnhr/p/10543814.html,对PG的状态参数有详细的解释,同时,在生产环境,修改PG,如果不影响业务,要考虑到各个方面,比如在什么时候恢复,什么时间修改pgs,请参考

参考资料:

https://my.oschina.net/xiaozhublog/blog/664560

021 Ceph关于too few PGs per OSD的问题的更多相关文章

  1. ceph -s集群报错too many PGs per OSD

    背景 集群状态报错,如下: # ceph -s cluster 1d64ac80-21be-430e-98a8-b4d8aeb18560 health HEALTH_WARN <-- 报错的地方 ...

  2. ceph故障:too many PGs per OSD

    原文:http://www.linuxidc.com/Linux/2017-04/142518.htm 背景 集群状态报错,如下: # ceph -s cluster 1d64ac80-21be-43 ...

  3. HEALTH_WARN too few PGs per OSD (21 < min 30)解决方法

    标签(空格分隔): ceph,ceph运维,pg 集群环境: [root@node3 ~]# cat /etc/redhat-release CentOS Linux release 7.3.1611 ...

  4. too few PGs per OSD (20 < min 30)

    ceph osd pool set replicapool pg_num 150 ceph osd pool set replicapool pgp_num 150

  5. Ceph学习笔记(4)- OSD

    前言 OSD是一个抽象的概念,对应一个本地块设备(一块盘或一个raid组) 传统NAS和SAN存储是赋予底层物理磁盘一些CPU.内存等,使其成为一个对象存储设备(OSD),可以独立进行磁盘空间分配.I ...

  6. ceph 存储池PG查看和PG存放OSD位置

    1. 查看PG (ceph-mon)[root@controller /]# ceph pg stat 512 pgs: 512 active+clean; 0 bytes data, 1936 MB ...

  7. ceph 剔除osd

    先将osd.2移出集群 root@ceph-monster:~# ceph osd out osd.2 marked out osd.2. root@ceph-monster:~# ceph osd ...

  8. 018 Ceph的mon和osd的删除和添加

    一.OSD管理 1.1 移出故障osd 查看当前节点的osd的id [root@ceph2 ceph]# df -hT Filesystem Type Size Used Avail Use% Mou ...

  9. Ceph osd启动报错osd init failed (36) File name too long

    在Ceph的osd节点上,启动osd进程失败,查看其日志/var/log/ceph/ceph-osd.{osd-index}.log日志,报错如下: 2017-02-14 16:26:13.55853 ...

随机推荐

  1. javascript内置函数

    1.Date:日期函数属性(1):constructor 所修立对象的函数参考prototype 能够为对象加进的属性和方法办法(43):getDay() 返回一周中的第几天(0-6)getYear( ...

  2. 04Top K算法问题

    本章阐述寻找最小的k个数的反面,即寻找最大的k个数,尽管寻找最大的k个树和寻找最小的k个数,本质上是一样的.但这个寻找最大的k个数的问题的实用范围更广,因为它牵扯到了一个Top K算法问题,以及有关搜 ...

  3. sqlserver 序号重新计算

    DBCC CHECKIDENT('leshua_TradeData',NORESEED) DBCC CHECKIDENT('表名',NORESEED)

  4. JQuery完整验证&密码的显示与隐藏&验证码

    HTML <link href="bootstrap.css" rel="stylesheet"> <link href="gloa ...

  5. es6新增语法之`${}`

    这是es6中新增的字符串方法 可以配合反单引号完成拼接字符串的功能 1.反单引号怎么打出来?将输入法调整为英文输入法,单击键盘上数字键1左边的按键. 2.用法step1: 定义需要拼接进去的字符串变量 ...

  6. oralce 分离表和索引

    总是将你的表和索引建立在不同的表空间内(TABLESPACES). 决不要将不属于ORACLE内部系统的对象存放到SYSTEM表空间里. 同时,确保数据表空间和索引表空间置于不同的硬盘上.   “同时 ...

  7. input submit标签的高度和宽度与input text的差异

    <input type="text"> 时设置input的高度和border,最后元素的高度和宽度包含了border的值. <input type="s ...

  8. 移动端Chrome Inspect调试 (Android通过Chrome Inspect调试WebView的H5)(ios手机safari,chrome调试 windows)(如果inspect的时候,是空白)

    ios +chrome调试 引用https://segmentfault.com/a/1190000015428430 iTunes ios-webkit-debug-proxy-1.8-win64- ...

  9. jq杂项方法/工具方法----each() grep() map()

    each() 用于循环数组 对象(单纯遍历) 返回 false 可提前停止循环.接受的参数是数组名和要执行的函数,函数参数为数组索引和当前元素. var arr = [30, 40, 50,1 ,8] ...

  10. 2013-10-6 datagridview实现换行并自动设置行高

    datagridview设置换行,如下,文本设置\r\n即可换行 dv4.DefaultCellStyle.WrapMode = DataGridViewTriState.True; dv4.Auto ...