Introducing shard translator

by Krutika Dhananjay on December 23, 2015

GlusterFS-3.7.0 saw the release of sharding feature, among several others. The feature was tagged as “experimental” as it was still in the initial stages of development back then. Here is some introduction to the feature:

Why shard translator?

GlusterFS’ answer to very large files (those which can grow beyond a single brick) had never been clear. There is a stripe translator which allows you to do that, but that comes at a cost of flexibility – you can add servers only in multiple of stripe-count x replica-count, mixing striped and unstriped files is not possible in an “elegant” way. This also happens to be a big limiting factor for the big data/Hadoop use case where super large files are the norm (and where you want to split a file even if it could fit within a single server.) The proposed solution for this is to replace the current stripe translator with a new “shard” translator.

What?

Unlike stripe, Shard is not a cluster translator. It is placed on top of DHT. Initially all files will be created as normal files, even up to a certain configurable size. The first block (default 4MB) will be stored like a normal file under its parent directory. However further blocks will be stored in a file, named by the GFID and block index in a separate namespace (like /.shard/GFID1.1, /.shard/GFID1.2 … /.shard/GFID1.N). File IO happening to a particular offset will write to the appropriate “piece file”, creating it if necessary. The aggregated file size and block count will be stored in the xattr of the original file.

Usage:

Here I have a 2×2 distributed-replicated volume.

# gluster volume info
Volume Name: dis-rep
Type: Distributed-Replicate
Volume ID: 96001645-a020-467b-8153-2589e3a0dee3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/1
Brick2: server2:/bricks/2
Brick3: server3:/bricks/3
Brick4: server4:/bricks/4
Options Reconfigured:
performance.readdir-ahead: on

To enable sharding on it, this is what I do:

# gluster volume set dis-rep features.shard on
volume set: success

Now, to configure the shard block size to 16MB, this is what I do:

# gluster volume set dis-rep features.shard-block-size 16MB
volume set: success

How files are sharded:

Now I write 84MB of data into a file named ‘testfile’.

# dd if=/dev/urandom of=/mnt/glusterfs/testfile bs=1M count=84
84+0 records in
84+0 records out
88080384 bytes (88 MB) copied, 13.2243 s, 6.7 MB/s

Let’s check the backend to see how the file was sharded to pieces and how these pieces got distributed across the bricks:

# ls /bricks/* -lh
/bricks/1:
total 0

/bricks/2:
total 0

/bricks/3:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

/bricks/4:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

So the file hashed to the second replica set (brick3 and brick4 which form a replica pair) and 16M in size. Where did the remaining 68MB worth of data go? To find out, let’s check the contents of the hidden directory .shard on all bricks:

# ls /bricks/*/.shard -lh
/bricks/1/.shard:
total 37M
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/2/.shard:
total 37M
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/3/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

/bricks/4/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

So, the file was basically split into 6 pieces: 5 of them residing in the hidden directory “/.shard” distributed across replica sets based on disk space availability and the file name hash, and the first block residing in its native parent directory. Notice how blocks 1 through 4 are all of size 16M and the last block (block-5) is 4M in size.

Now let’s do some math to see how ‘testfile’ was “sharded”:

The total size of the write was 84MB. And the configured block size in this case is 16MB. So (84MB divided by 16MB) = 5 with remainder = 4MB

So the file was basically broken into 6 pieces in all, with the last piece having 4MB of data and the rest of them 16MB in size.

Now when we view the file from the mount point, it would appear as one single file:

# ls -lh /mnt/glusterfs/
total 85M
-rw-r–r–. 1 root root 84M Dec 24 12:36 testfile

Notice how the file is shown to be of size 84MB on the mount point. Similarly, when the file is read by an application, the different pieces or ‘shards’ are stitched together and appropriately presented to the application as if there was no chunking done at all.

Advantages of sharding:

The advantage of sharding a file over striping it across a finite set of bricks are:

  • Data blocks are distributed by DHT in a “normal way”.
  • Adding servers can happen in any number (even one at a time) and DHT’s rebalance will spread out the “piece files” evenly.
  • Sharding provides better utilization of disk space. Now it is no longer necessary to have at least one brick of size X in order to accommodate a file of size X, where X is really large. Consider this example: A distribute volume is made up of 3 bricks of size 10GB, 20GB, 30GB. With this configuration, it is impossible to store a file greater than 30GB in size on this volume. Sharding eliminates this limitation. A file of upto 60GB size can be stored on this volume with sharding.
  • Self-healing of a large file is now more distributed into smaller files across more servers leading to better heal performance and lesser CPU usage, which is particularly a pain point for large file workloads.
  • piece file naming scheme is immune to renames and hardlinks.
  • When geo-replicating a large file to a remote volume, only the shards that changed can be synced to the slave, considerably reducing the sync time.
  • When sharding is used in conjunction with tiering, only the shards that change would be promoted/demoted. This reduces the amount of data that needs to be migrated between hot and cold tier.
  • When sharding is used in conjunction with bit-rot detection feature of GlusterFS, the checksum is computed on smaller shards as opposed to one large file.

Yes, sharding in its current form is not compatible with directory quota. This is something we are going to focus on, in the coming days – to make it compatible with other Gluster features (including directory quota and user/group quota which is a feature in design phase).

Thanks,
Krutika

Introducing shard translator的更多相关文章

  1. Global Translator

    Global Translator插件可以把已经通过翻译服务翻译好的内容生成对应语种的“静态”页面,或者说“缓存”起来,这样在一段时间内(可设置)想访问该语种的这 个页面的访客,就可以在不调用翻译服务 ...

  2. MongoDBV3.0.7版本(shard+replica)集群的搭建及验证

    集群的模块介绍: 从MongoDB官方给的集群架构了解,整个集群主要有4个模块:Config Server.mongs. shard.replica set: Config Server:用来存放集群 ...

  3. 在Application中集成Microsoft Translator服务之使用http获取服务

    一.创建项目 首先我们来创建一个ASP.NET Application 选择时尚时尚最时尚的MVC,为了使演示的Demo更简单,这里选择无身份验证 二.创建相关类 项目需要引入之前两个类AdmAcce ...

  4. 在Application中集成Microsoft Translator服务之翻译语言代码

    Microsoft  Translator支持多种语言,当我们获取服务时使用这些代码来表示我们是使用哪种语言翻译成什么语言,以下是相关语言对应的代码和中文名 为了方便我已经将数据库上传到云盘上,读者可 ...

  5. 在Application中集成Microsoft Translator服务之获取访问令牌

    我在这里画了一张图来展示业务逻辑 在我们调用microsoft translator server之前需要获得令牌,而且这个令牌的有效期为10分钟.下表列出所需的参数和对于的说明 参数 描述 clie ...

  6. 在Application中集成Microsoft Translator服务之开发前准备

    第一步:准备一个微软账号 要使用Microsoft Translator API需要在Microsoft Azure Marketplace(https://datamarket.azure.com/ ...

  7. 十三、File Translator怎么写

    ---恢复内容开始--- 1. File Translator可以将信息从maya中导入和导出. 2. 创建一个file translator需要从MPxFileTranslator继承. 3. 函数 ...

  8. MongoDB Shard部署及Tag的使用

    Shard部署 准备测试环境 为准备数据文件夹 Cd  /home/tiansign/fanr/mongodb/Shard mkdir configdb1 configdb2 configdb3 mk ...

  9. Solr术语介绍:SolrCloud,单机Solr,Collection,Shard,Replica,Core之间的关系

    Solr有一堆让人发晕的术语如:collections,shards,replicas,cores,config sets. 在了解这些术语之前需要先做做如下功课: 1)什么是倒排索引? 2)搜索引擎 ...

随机推荐

  1. c++ insert iterators 插入型迭代器

    insert iterators 插入型迭代器 (1)front inserters 前向插入迭代器 只适用于提供有push_front()成员函数的容器,在标准程序库中这样的容器是deque和lis ...

  2. HDU 4612 Warm up(Tarjan)

    果断对Tarjan不熟啊,Tarjan后缩点,求树上的最长路,注意重边的处理,借鉴宝哥的做法,开标记数组,标记自己的反向边. #pragma comment(linker, "/STACK: ...

  3. Odoo Auto Backup Database And Set Linux task schedualer

    First ,Write Database Backup Script: pg_dump -Fc yourdatabasename > /home/yourfilepath/yourdataba ...

  4. 怎样使用Photoshop CS5的操控变形功能

    | 浏览:23114 | 更新:2011-08-08 10:10 | 标签: photoshop 1 2 3 4 5 6 7 分步阅读 Photoshop CS5已经发布很长时间了,和以前的版本相比, ...

  5. Layui - 示例

    示例地址 http://www.layui.com/demo/ 下载地址 http://www.layui.com/ 示例代码 <!doctype html> <html> & ...

  6. 课堂Scrum站立会议演示

    组名:连连看 组长:张政 组员:张金生.李权.武志远 时间:2016.10.13   20:20--20:40 会议内容: 已完成的内容: 1.选定编译语言,安装软件并配置环境,完成了游戏的基本模型. ...

  7. nodejs:grunt使用合并压缩的基本使用

    一.模块化历史 1,nodejs出现:主要解决后端js规范 2,commonjs:这个组织出来一些服务器规范 3,后端规范commonjs应用升级到前端commonjs2:cmd规范(seajs)和完 ...

  8. PHP 开发 APP 接口学习笔记与总结 - [ Linux ] 定时任务

    定时任务可以使用 crontab 命令来设定: crontab -e #编辑某个用户的cron 服务 crontab -l  #列出某个用户cron 服务的详细内容 crontab -r  #删除某个 ...

  9. jQuery+CSS 简单代码实现遮罩层( 兼容主流浏览器 )

    /* ** jQuery版本:jQuery-1.8.3.min.js ** 浏览器:Chrome( v31.0.1650.63 m ),IE11,Firefox( v32.0.1 ),IETester ...

  10. 同IP不同端口Session冲突问题

    同IP不同端口Session冲突问题 分类: tomcat2013-09-24 11:19 1146人阅读 评论(0) 收藏 举报 一个服务器上搭建了多个tomcat或者weblogic,端口不一样, ...