The Hadoop Distributed File System has been great in providing a cloud-type file system. It is robust (when administered correctly :-)) and highly scalable. However, one of the main drawbacks of HDFS is that each piece of data is replicated in three places. This is acceptable because disk storage is cheap and is becoming cheaper by the day; this isn't a problem if you have a relatively small to medium size cluster. The price difference (in absolute terms) is not much whether you use 15 disks or whether you use 10 disks. If we consider the cost of $1 per GByte, the price difference between fifteen 1 TB disk and ten 1 TB disk is only $5K. But when the total size of your cluster is 10 PBytes, then the costs savings in storing the data in two places versus three is a huge ten million dollars!

The reason HDFS stores disk blocks in triplicate is because it uses commodity hardware and there is non-negligible probability of a disk failure. It has been observed that a replication factor of 3 and the fact the HDFS aggressively detects failures and immediately replicates failed -block-replicas is sufficient to never lose any data in practice. The challenge now is to achieve an effective replication factor of 3 while keeping the real physical replication factor at close to 2! How best to do it than by usingErasure Codes.

I heard about this idea called DiskReduce from the folks at CMU. The CMU PDL Labs has been a powerhouse of research in file systems and it is no surprise that they proposed a elegant way of implementing erasure codes in HDFS. I borrowed heavily from their idea in my implementation of Erasure Codes in HDFS described in HDFS-503. One of the main motivation of my design is to keep the HDFS Erasure Coding as a software layer above HDFS rather than inter-twining it inside of HDFS code. The HDFS code is complex by itself and it is really nice to not have to make it more complex and heavyweight.

Distributed Raid File System consists of two main software components. The first component is the RaidNode, a daemon that creates parity files from specified HDFS files. The second component "raidfs" is a software that is layered over a HDFS client and it intercepts all calls that an application makes to the HDFS client. If the HDFS client encounters corrupted data while reading a file, the raidfs client detects it; it uses the relevant parity blocks to recover the corrupted data (if possible) and returns the data to the application. The application is completely transparent to the fact that parity data was used to satisfy it's read request. The Distributed Raid File System can be configured in such a way that a set of data blocks of a file are combined together to form one or more parity blocks. This allows one to reduce the replication factor of a HDFS file from 3 to 2 while keeping the failure probabilty relatively same as before.

I have seen that using a stripe size of 10 blocks decreases the physical replication factor of a file to 2.2 while keeping the effective replication factor of a file at 3. This typically results in saving 25% to 30% of storage space in a HDFS cluster.

One of the shortcoming of this implementation is that we need a parity file for every file in HDFS. This potentially increases the number of files in the NameNode. To alleviate this problem, I will enhance this implementation (in future) to use the Hadoop Archive feature to archive all the parity files together in larger containers so that the NameNode does not have to support additional files when the HDFS Erasure Coding is switched on. This works reasonably well because it is a very very rare case that the parity files are ever used to satisfy a read request.

ref: http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html

[转载] HDFS and Erasure Codes (HDFS-RAID)的更多相关文章

  1. HDFS Federation(转HDFS Federation(HDFS 联盟)介绍 CSDN)

    转载地址:http://blog.csdn.net/strongerbit/article/details/7013221 HDFS Federation(HDFS 联盟)介绍 1. 当前HDFS架构 ...

  2. vivo 万台规模 HDFS 集群升级 HDFS 3.x 实践

    vivo 互联网大数据团队-Lv Jia Hadoop 3.x的第一个稳定版本在2017年底就已经发布了,有很多重大的改进. 在HDFS方面,支持了Erasure Coding.More than 2 ...

  3. HDFS shell操作及HDFS Java API编程

    HDFS shell操作及HDFS Java API编程 1.熟悉Hadoop文件结构. 2.进行HDFS shell操作. 3.掌握通过Hadoop Java API对HDFS操作. 4.了解Had ...

  4. (第3篇)HDFS是什么?HDFS适合做什么?我们应该怎样操作HDFS系统?

    摘要: 这篇文章会详细介绍HDFS是什么,HDFS的作用,适合和不适合的场景,我们该如何操作HDFS?   HDFS文件系统 Hadoop 附带了一个名为 HDFS(Hadoop分布式文件系统)的分布 ...

  5. Hadoop之HDFS(一)HDFS入门及基本Shell命令操作

    1 . HDFS 基本概念 1.1  HDFS 介绍 HDFS 是 Hadoop Distribute File System 的简称,意为:Hadoop 分布式文件系统.是 Hadoop 核心组件之 ...

  6. HDFS之二:HDFS文件系统JavaAPI接口

    HDFS是存取数据的分布式文件系统,HDFS文件操作常有两种方式,一种是命令行方式,即Hadoop提供了一套与Linux文件命令类似的命令行工具.HDFS操作之一:hdfs命令行操作 另一种是Java ...

  7. hdfs的文件个数 HDFS Quotas Guide

    HDFS Quotas Guide Overview HDFS允许管理员为多个每个目录设置使用的命名空间和空间的配额.命名空间配额和空间配额独立操作,但是这两种类型的配额的管理和实现非常类似. Nam ...

  8. 【转载】Hadoop分布式文件系统HDFS的工作原理详述

    转载请注明来自36大数据(36dsj.com):36大数据 » Hadoop分布式文件系统HDFS的工作原理详述 转注:读了这篇文章以后,觉得内容比较易懂,所以分享过来支持一下. Hadoop分布式文 ...

  9. 【转载】经典漫画讲解HDFS原理

    分布式文件系统比较出名的有HDFS  和 GFS,其中HDFS比较简单一点.本文是一篇描述非常简洁易懂的漫画形式讲解HDFS的原理.比一般PPT要通俗易懂很多.不难得的学习资料. 1.三个部分: 客户 ...

随机推荐

  1. 转:FIFO的定义与作用

    一.先入先出队列(First Input First Output,FIFO)这是一种传统的按序执行方法,先进入的指令先完成并引退,跟着才执行第二条指令. 1.什么是FIFO? FIFO是英文Firs ...

  2. listview底部增加按钮

    View bottomView=getActivity().getLayoutInflater().inflate(R.layout.btn_my_course, null); myCourses = ...

  3. LD1-K(求差值最小的生成树)

    题目链接 /* *题目大意: *一个简单图,n个点,m条边; *要求一颗生成树,使得其最大边与最小边的差值是所有生成树中最小的,输出最小的那个差值; *算法分析: *枚举最小边,用kruskal求生成 ...

  4. 【HDOJ】1316 How Many Fibs?

    Java水了. import java.util.Scanner; import java.math.BigInteger; public class Main { public static voi ...

  5. Selenium webdriver 元素操作

    本来这些东西网上一搜一大堆,但是本着收集的精神,整理一份放着吧!哈!哈!哈! 1. 输入框(text field or textarea) WebElement element = driver.fi ...

  6. 跟Google学习Android开发-起始篇-用碎片构建一个动态的用户界面(3)

    4.3 构建一个灵活的用户界面 当设计你的应用程序要支持大范围的屏幕尺寸时,你可以在不同的布局配置中重用碎片,来根据可用的屏幕空间优化用户体验. 例如,在手持设备上,它可能是适应来在一个单窗格用户界面 ...

  7. SQL语句查询结果额外加入一列序号自己主动添加

    sqlserver 能够用row_number函数实现 例如以下: SELECT *,row_number() OVER(ORDER BY score(列名) DESC) AS rank FROM s ...

  8. NYOJ-1070诡异的电梯【Ⅰ】

    这道题是个dp,主要考虑两种情况,刚开始我把状态转移方程写成了dp[i] = min(dp[i-1] + a, dp[i + 1] +b); 后来想想当推到dp[i]的时候,那个dp[i + 1]还没 ...

  9. ServletContextListener 解析用法

    ServletContext 被 Servlet 程序用来与 Web 容器通信.例如写日志,转发请求.每一个 Web 应用程序含有一个Context,被Web应用内的各个程序共享.因为Context可 ...

  10. DedeCMS批量替换栏目文件保存目录的方法

    学点sql还是很有必要的.   有时候由于栏目太多,但是要修改一下栏目的保存目录.一个一个修改真的有点费事和慢.所以想了一个方法来批量修改栏目的保存目录.就是批量替换: update dede_arc ...