Percona 开始尝试基于Ceph做上层感知的分布式 MySQL 集群，使用 Ceph 提供的快照，备份和 HA 功能来解决分布式数据库的底层存储问题

本文由 Ceph 中国社区 -QiYu 翻译

欢迎加入CCTG

Over the last year, the Ceph world drew me in. Partly because of my taste for distributed systems, but also because I think Ceph represents a great opportunity for MySQL specifically and databases in general. The shift from local storage to distributed storage is similar to the shift from bare disks host configuration to LVM-managed disks configuration.

经过过去一年，Ceph的世界吸引了我。部分是因为我对分布式系统的品味，但也是因为我认为Ceph特描绘了对尤其是MySQL和普通数据库的一个大机会。从本地存储到分布式存储的转变和从裸磁盘主机配置到LVM管理磁盘配置的转变相似。

Most of the work I’ve done with Ceph was in collaboration with folks from Red Hat (mainly Brent Compton and Kyle Bader). This work resulted in a number of talks presented at the Percona Live conference in April and the Red Hat Summit San Francisco at the end of June. I could write a lot about using Ceph with databases, and I hope this post is the first in a long series on Ceph. Before I starting with use cases, setup configurations and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.

我用Ceph做完的大部分工作是和Red Hat的伙伴 (主要是 Brent Compton and Kyle Bader)合作。这项工作导致一些讨论呈现在Percona4月份的在线会议和六月末旧金山的Red hat峰会上。我可以写很多数据库使用Ceph的经验，我希望这个帖是Ceph一个长系列中的第1个。在案例、设置配置和性能基准测试，我想我应该快速回顾一下Ceph背后的架构和原则。

Introduction to Ceph
Inktank created Ceph a few years ago as a spin-off of the hosting company DreamHost. Red Hat acquired Inktank in 2014 and now offers it as a storage solution. OpenStack uses Ceph as its dominant storage backend. This blog, however, focuses on a more general review and isn’t restricted to a virtual environment.

Ceph介绍

作为主机服务公司的DreamHost的独立子公司Inktank一些年前创造Ceph。 Red Hat 在2014年收购了Inktank并且现在作为一个存储解决方案推出。OpenStack使用Ceph作为它支配性的存储后端。而这个博文聚焦在一个更通用角度，而不局限在虚拟环境中。

A simplistic way of describing Ceph is to say it is an object store, just like S3 or Swift. This is a true statement but only up to a certain point. There are minimally two types of nodes with Ceph, monitors and object storage daemons (OSDs). The monitor nodes are responsible for maintaining a map of the cluster or, if you prefer, the Ceph cluster metadata. Without access to the information provided by the monitor nodes, the cluster is useless. Redundancy and quorum at the monitor level are important.

一个简化描述Ceph的方式是说它是一个对象存储，像S3或Swift。这是一个队的声明，但只提到一个特定的点。Ceoh最少有2种类型的节点，监视器（MON）和对象存储后台服务（OSD）。监视器负责维护一个集群的图或如果你喜欢，集群元数据。没有访问到监视器节点提供的信息，集群是没有用的。在监视器层面冗余和选举法定人数是很重要的。

Any non-trivial Ceph setup has at least three monitors. The monitors are fairly lightweight processes and can be co-hosted on OSD nodes (the other node type needed in a minimal setup). The OSD nodes store the data on disk, and a single physical server can host many OSD nodes – though it would make little sense for it to host more than one monitor node. The OSD nodes are listed in the cluster metadata (the “crushmap”) in a hierarchy that can span data centers, racks, servers, etc. It is also possible to organize the OSDs by disk types to store some objects on SSD disks and other objects on rotating disks.

任何有价值的Ceph配置至少需要3个监视器，监视器是轻量级进程的可以和OSD节点（其他节点需要在最小的配置）部署在一起。OSD节点存储数据到磁盘，一个单点的物理服务器可以有很多OSD节点，然而有超过1个监视器节点毫无意义。

With the information provided by the monitors’ crushmap, any client can access data based on a predetermined hash algorithm. There’s no need for a relaying proxy. This becomes a big scalability factor since these proxies can be performance bottlenecks. Architecture-wise, it is somewhat similar to the NDB API, where – given a cluster map provided by the NDB management node – clients can directly access the data on data nodes.

通过监视器的CRUSHMAP提供的信息，任何客户端基于一个伪随机哈希算法访问数据。这不需要一个传递的代理。因为这些代理带来性能瓶颈，这成为一个大的扩展因素。

Ceph stores data in a logical container call a pool. With the pool definition comes a number of placement groups. The placement groups are shards of data across the pool. For example, on a four-node Ceph cluster, if a pool is defined with 256 placement groups (pg), then each OSD will have 64 pgs for that pool. You can view the pgs as a level of indirection to smooth out the data distribution across the nodes. At the pool level, you define the replication factor (“size” in Ceph terminology).

Ceph存储熟读在一个叫做池的逻辑容器中。池定义后带来了一些PG（放置组）。 PG是池访问数据的碎片。例如，在一个4个节点的Ceph集群上，如果一个池被定义有256个PG，接着每个OSD都有这个池的64个PG。你可以以一个间接在节点间平滑数据分布的层看这些PG。在池的层，你定义的副本数（Ceph术语 'size' ）

The recommended values are a replication factor of three for spinners and two for SSD/Flash. I often use a size of one for ephemeral test VM images. A replication factor greater than one associates each pg with one or more pgs on the other OSD nodes. As the data is modified, it is replicated synchronously to the other associated pgs so that the data it contains is still available in case an OSD node crashes.

副本数的推荐值是普通机械硬盘3，SSD/Flash盘 2。我经常用1作短暂的测试VM镜像。一个大于1的副本配置把每个PG和一个或更多的其他OSD节点的PG关联起来。当数据被修改，它被同步复制到其他关联的PG以防保存的数据在一个OSD节点坏掉后仍然可用。

So far, I have just discussed the basics of an object store. But the ability to update objects atomically in place makes Ceph different and better (in my opinion) than other object stores. The underlying object access protocol, rados, updates an arbitrary number of bytes in an object at an arbitrary offset, exactly like if it is a regular file. That update capability allows for much fancier usage of the object store – for things like the support of block devices, rbd devices, and even a network file systems, cephfs.

到目前为止，我已经论述了一个对象存储的基础。但是自动适当更新对象的能力使ceph不同和比其他的对象存储更好（在我看来）。根本的对象访问协议RADOS，在一个对象任意位置更新一个任意数量的字节，很像它是一个普通的文件。这种更新能力允许对象存储的很多新奇的应用-例如块设备，rbd设备，甚至网络文件系统cephfs的支持。

When using MySQL on Ceph, the rbd disk block device feature is extremely interesting. A Ceph rbd disk is basically the concatenation of a series of objects (4MB objects by default) that are presented as a block device by the Linux kernel rbd module. Functionally it is pretty similar to an iSCSI device as it can be mounted on any host that has access to the storage network and it is dependent upon the performance of the network.

当使用在Ceph上使用MySQL，rbd磁盘块设备特性非常地吸引人。一个Ceph的rbd磁盘基本上是一系列（默认4M对象）的串联，它被Linux内核rbd模块作为块设备。因它可以被挂载在任何可以访问存储网络并依赖网络的性能的主机上，它功能上相当像一个iSCSI设备。

The benefits of using Ceph

Agility

In a world striving for virtualization and containers, Ceph gives easily moves database resources between hosts.

使用Ceph的优势

敏捷

在一个为虚拟化和容器奋斗的世界里，Ceph提供在不主机间容易的移动数据库资源

IO scalability

On a single host, you have access only to the IO capabilities of that host. With Ceph, you basically put in parallel all the IO capabilities of all the hosts. If each host can do 1000 iops, a four-node cluster could reach up to 4000 iops.

IO扩展性

在一个单点主机，你可以只能达到这个主机的IO能力。使用Ceph，你基本上把所有主机的全部IO能力并行化。如果一个主机有1000 iops，4个节点集群可以达到4000 iops。

High availability

Ceph replicates data at the storage level, and provides resiliency to storage node crash. A kind of DRBD on steroids…

高可用性

Ceph 在存储层面复制数据，并提供存储节点坏掉的弹性。

Backups

Ceph rbd block devices support snapshots, which are quick to make and have no performance impacts. Snapshots are an ideal way of performing MySQL backups.

备份

Ceph rbd块设备支持快照，它很快且没有性能影响。快照是一个优化MySQL备份性能的理想方式。

Thin provisioning

You can clone and mount Ceph snapshots as block devices. This is a useful feature to provision new database servers for replication, either with asynchronous replication or with Galera replication.

轻发放

你可以克隆和挂载一个快照作为块设备。这是一个有用的特性为来发放一个复制的新的数据库服务器，而且有异步复制或Galera复制。

The caveats of using Ceph

Of course, nothing is free. Ceph use comes with some caveats.

使用Ceph的限制

当然，没有东西是免费的。 Ceph一同带来一些限制。

Ceph reaction to a missing OSD

If an OSD goes down, the Ceph cluster starts copying data with fewer copies than specified. Although good for high availability, the copying process significantly impacts performance. This implies that you cannot run a Ceph with a nearly full storage, you must have enough disk space to handle the loss of one node.

The “no out” OSD attribute mitigates this, and prevents Ceph from reacting automatically to a failure (but you are then on your own). When using the “no out” attribute, you must monitor and detect that you are running in degraded mode and take action. This resembles a failed disk in a RAID set. You can choose this behavior as default with the mon_osd_auto_mark_auto_out_in setting.

Ceph 丢失OSD的反应

如果一个OSD宕机，Ceph集群开始以比设定的值小的复制数据副本。虽然对高可用性好，复制过程明显的影响性能。这意味着你不能在接近慢的存储上运行Ceph，你必须有足够的硬盘空间处理一个节点的丢失。 OSD的“no nout” 属性减轻了这些，并且阻止Ceph自动地处理一个失败（但你讲自己负责）。当使用 “no nout”属性时，你在降级运行模式，你必须监控和检测并采取措施。这和一个RAID集的一个磁盘失败类似。你可以设置mon_osd_auto_mark_auto_out_in选择这个行为作为默认。

Scrubbing

Every day and every week (deep), Ceph scrubs operations that, although they are throttled, can still impact performance. You can modify the interval and the hours that control the scrub action. Once per day and once per week are likely fine. But you need to set osd_scrub_begin_hour and osd_scrub_end_hour to restrict the scrubbing to off hours. Also, scrubbing throttles itself to not put too much load on the nodes. The osd_scrub_load_threshold variable sets the threshold.

数据清理

每天和每周（深度），Ceph执行清理操作，虽然它被限流，仍然湖影响性能。你可以修改控制清理动作的间隔和小时。曾经一天活一周可能是好的。但是你应该设置osd_scrub_begin_hour 和 osd_scrub_end_hour 来限制在繁忙时间清理。并且清理限流自身不要放过多的负载到节点。osd_scrub_load_threshold变量设置阈值。

Tuning

Ceph has many parameters so that tuning Ceph can be complex and confusing. Since distributed systems push hardware, properly tuning Ceph might require things like distributing interrupt load among cores and thread core pinning, handling of Numa zones – especially if you use high-speed NVMe devices.

调优

Ceph有很多参数因此调优Ceph是复杂和令人困惑的。由于分布式系统推进硬件，恰当地的调优Ceph可能需要像在CPU核和线程核绑定分布中断负载、处理Numa域（尤其是你使用一个NVMe设备）的事情。

Conclusion

Hopefully, this post provided a good introduction to Ceph. I’ve discussed the architecture, the benefits and the caveats of Ceph. In future posts, I’ll present use cases with MySQL. These cases include performing Percona XtraDB Cluster SST operations using Ceph snapshots, provisioning async slaves and building HA setups. I also hope to provide guidelines on how to build and configure an efficient Ceph cluster.

结论

希望这篇博文能提供一个好的Ceph介绍。我已经论述Ceph的架构、好处和限制。在未来的博文中，我将呈现在Ceph上使用MySQL的案例。这些案例包括使用Ceph快照调优XtraDB 集群的 SST操作、发放异步slave和构建HA配置。我也希望提供一个怎样构建和配置一个高效的Ceph集群的指导。

Finally, a note for the ones who think cost and complexity put building a Ceph cluster out of reach. The picture below shows my home cluster (which I use quite heavily). The cluster comprises four ARM-based nodes (Odroid-XU4), each with a two TB portable USB-3 hard disk, a 16 GB EMMC flash disk and a gigabit Ethernet port.

I won’t claim record breaking performance (although it’s decent), but cost-wise it is pretty hard to beat (at around $600)!

最后，一个给那些认为构建Ceph集群的成本和复杂度难以企及的人提示。下面的图片展示我家的机器（我重度地使用）。这个集群由4个基于ARM的节点（Odroid-XU4）构成，每个带1个USB-3接口的2TB的硬盘、1个16GB的EMMC闪存盘和1个1Gb的以太网口。
我不会宣称性能记录打破（虽然足够好），但从很难从成本方式打败（近600$）。