实验简单来讲就是



1. put 一个600M文件,分散3个replica x 9个block 共18个blocks到4个datanode



2. 我关掉了两个datanode,使得大部分的block只在一个datanode上存在,但因为9个很分散,所以文件能正确取回(靠的是checksum来计算文件值)



3. hadoop namenode很迅速的复制了仅有一个replica的block使之成为 3 replica(2) but only found 2



4. 我再关掉一个datanode,结果发现每个datanode被很均衡的分配了block,这样即使只有一个datanode,也因为之前有确保2个replicas的比率,所以依然healthy



5. 我从这个仅存的datanode中删除一个blk,namenode report这个文件corrupt,(我其实一直很希望能进safemode,结果-safemode get一直是OFF)



6. 然后我启动另外一个datanode,30秒不到,这个missing的block被从这个新启动的datanode中迅速“扩展”为2个replicas



容灾性非常可靠,如果使用至少三个rack的话,数据会非常坚挺,对HADOOP信任值 level up!

首先来了解一下HDFS的一些基本特性



HDFS设计基础与目标



硬件错误是常态。因此需要冗余

流式数据访问。即数据批量读取而非随机读写,Hadoop擅长做的是数据分析而不是事务处理

大规模数据集

简单一致性模型。为了降低系统复杂度,对文件采用一次性写多次读的逻辑设计,即是文件一经写入,关闭,就再也不能修改

程序采用“数据就近”原则分配节点执行

HDFS体系结构



NameNode

DataNode

事务日志

映像文件

SecondaryNameNode

Namenode



管理文件系统的命名空间

记录每个文件数据块在各个Datanode上的位置和副本信息

协调客户端对文件的访问

记录命名空间内的改动或空间本身属性的改动

Namenode使用事务日志记录HDFS元数据的变化。使用映像文件存储文件系统的命名空间,包括文件映射,文件属性等

Datanode



负责所在物理节点的存储管理

一次写入,多次读取(不修改)

文件由数据块组成,典型的块大小是64MB

数据块尽量散布道各个节点

读取数据流程



客户端要访问HDFS中的一个文件

首先从namenode获得组成这个文件的数据块位置列表

根据列表知道存储数据块的datanode

访问datanode获取数据

Namenode并不参与数据实际传输

HDFS的可靠性



冗余副本策略

机架策略

心跳机制

安全模式

使用文件块的校验和 Checksum来检查文件的完整性

回收站

元数据保护

快照机制

我分别试验了冗余副本策略/心跳机制/安全模式/回收站。下面实验是关于冗余副本策略的。



环境:



Namenode/Master/jobtracker: h1/192.168.221.130

SecondaryNameNode: h1s/192.168.221.131

四个Datanode: h2~h4 (IP段:142~144)

为以防文件太小只有一个文件块(block/blk),我们准备一个稍微大一点的(600M)的文件,使之能分散分布到几个datanode,再停掉其中一个看有没有问题。


先来put一个文件(为了方便起见,建议将hadoop/bin追加到$Path变量后

:hadoop fs –put ~/Documents/IMMAUSWX201304

结束后,我们想查看一下文件块的情况,可以去网页上看,也可以在namenode上使用fsck命令来检查一下,关于fsck命令

:bin/hadoop fsck /user/hadoop_admin/in/bigfile  -files -blocks -locations < ~/hadoopfiles/log1.txt


下面打印结果说明 个600M文件被划分为9个64M的blocks,并且被分散到我当前所有datanode上(共4个),看起来比较平均,



/user/hadoop_admin/in/bigfile/USWX201304 597639882 bytes, 9 block(s):  OK

0. blk_-4541681964616523124_1011 len=67108864 repl=3 [192.168.221.131:50010, 192.168.221.142:50010, 192.168.221.144:50010]


1. blk_4347039731705448097_1011 len=67108864 repl=3 [192.168.221.143:50010, 192.168.221.131:50010, 192.168.221.144:50010]


2. blk_-4962604929782655181_1011 len=67108864 repl=3 [192.168.221.142:50010, 192.168.221.143:50010, 192.168.221.144:50010]


3. blk_2055128947154747381_1011 len=67108864 repl=3 [192.168.221.143:50010, 192.168.221.142:50010, 192.168.221.144:50010]


4. blk_-2280734543774885595_1011 len=67108864 repl=3 [192.168.221.131:50010, 192.168.221.142:50010, 192.168.221.144:50010]


5. blk_6802612391555920071_1011 len=67108864 repl=3 [192.168.221.143:50010, 192.168.221.142:50010, 192.168.221.144:50010]


6. blk_1890624110923458654_1011 len=67108864 repl=3 [192.168.221.143:50010, 192.168.221.142:50010, 192.168.221.144:50010]


7. blk_226084029380457017_1011 len=67108864 repl=3 [192.168.221.143:50010, 192.168.221.131:50010, 192.168.221.144:50010]


8. blk_-1230960090596945446_1011 len=60768970 repl=3 [192.168.221.142:50010, 192.168.221.143:50010, 192.168.221.144:50010]



Status: HEALTHY

Total size:    597639882 B

Total dirs:    0

Total files:   1

Total blocks (validated):      9 (avg. block size 66404431 B)

Minimally replicated blocks:   9 (100.0 %)

Over-replicated blocks:        0 (0.0 %)

Under-replicated blocks:       0 (0.0 %)

Mis-replicated blocks:         0 (0.0 %)

Default replication factor:    3

Average block replication:     3.0

Corrupt blocks:                0

Missing replicas:              0 (0.0 %)

Number of data-nodes:          4

Number of racks:               1



h1s,h2,h3,h4四个DD全部参与,跑去h2 (142),h3(143) stop datanode, 从h4上面get,发现居然能够get回,而且初步来看,size正确,看一下上图中黄底和绿底都DEAD了,每个blk都有源可以取回,所以GET后数据仍然是完整的,从这点看hadoop确实是强大啊,load balancing也做得很不错,数据看上去很坚强,容错性做得不错



1



再检查一下,我本来想测试safemode的,结果隔一会一刷,本来有几个blk只有1个livenode的,现在又被全部复制为确保每个有2个了!    



hadoop_admin@h1:~/hadoop-0.20.2$ hadoop fsck /user/hadoop_admin/in/bigfile  -files -blocks -locations


/user/hadoop_admin/in/bigfile/USWX201304 597639882 bytes, 9 block(s):  

Under replicated blk_-4541681964616523124_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_4347039731705448097_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_-4962604929782655181_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_2055128947154747381_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_-2280734543774885595_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_6802612391555920071_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_1890624110923458654_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_226084029380457017_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_-1230960090596945446_1011. Target Replicas is 3 but found 2 replica(s).


0. blk_-4541681964616523124_1011 len=67108864 repl=2 [192.168.221.131:50010, 192.168.221.144:50010]


1. blk_4347039731705448097_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


2. blk_-4962604929782655181_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


3. blk_2055128947154747381_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


4. blk_-2280734543774885595_1011 len=67108864 repl=2 [192.168.221.131:50010, 192.168.221.144:50010]


5. blk_6802612391555920071_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


6. blk_1890624110923458654_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


7. blk_226084029380457017_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


8. blk_-1230960090596945446_1011 len=60768970 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]



我决定再关一个datanode,结果等了好半天也没见namenode发现它死了,这是因为心跳机制,datanode每隔3秒会向namenode发送heartbeat指令表明它的存活,但如果namenode很长时间(5~10分钟看设置)没有收到heartbeat即认为这个NODE死掉了,就会做出BLOCK的复制操作,以保证有足够的replica来保证数据有足够的容灾/错性,现在再打印看看,发现因为只有一个live datanode,所以现在每个blk都有且只有一份



hadoop_admin@h1:~$ hadoop fsck /user/hadoop_admin/in/bigfile -files -blocks -locations


/user/hadoop_admin/in/bigfile/USWX201304 597639882 bytes, 9 block(s):  Under replicated blk_-4541681964616523124_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_4347039731705448097_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_-4962604929782655181_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_2055128947154747381_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_-2280734543774885595_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_6802612391555920071_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_1890624110923458654_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_226084029380457017_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_-1230960090596945446_1011. Target Replicas is 3 but found 1 replica(s).




我现在把其中一个BLK从这个仅存的Datanode中移走使之corrupt,我想实验,重启一个DATANODE后,会不会复员

hadoop_admin@h4:/hadoop_run/data/current$ mv blk_4347039731705448097_1011* ~/Documents/


然后为了不必要等8分钟DN发block report,我手动修改了h4的dfs.blockreport.intervalMsec值为30000,stop datanode,再start (另外,你应该把hadoop/bin也加入到Path变量后面,这样你可以不带全路径执行hadoop命令,结果,检测它已被损坏


hadoop_admin@h1:~$ hadoop fsck /user/hadoop_admin/in/bigfile -files -blocks -locations




/user/hadoop_admin/in/bigfile/USWX201304 597639882 bytes, 9 block(s):  Under replicated blk_-4541681964616523124_1011. Target Replicas is 3 but found 1 replica(s).



/user/hadoop_admin/in/bigfile/USWX201304: CORRUPT block blk_4347039731705448097

Under replicated blk_-4962604929782655181_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_2055128947154747381_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_-2280734543774885595_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_6802612391555920071_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_1890624110923458654_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_226084029380457017_1011. Target Replicas is 3 but found 1 replica(s).


Under replicated blk_-1230960090596945446_1011. Target Replicas is 3 but found 1 replica(s).


MISSING 1 blocks of total size 67108864 B

0. blk_-4541681964616523124_1011 len=67108864 repl=1 [192.168.221.144:50010]

1. blk_4347039731705448097_1011 len=67108864 MISSING!

2. blk_-4962604929782655181_1011 len=67108864 repl=1 [192.168.221.144:50010]

3. blk_2055128947154747381_1011 len=67108864 repl=1 [192.168.221.144:50010]

4. blk_-2280734543774885595_1011 len=67108864 repl=1 [192.168.221.144:50010]

5. blk_6802612391555920071_1011 len=67108864 repl=1 [192.168.221.144:50010]

6. blk_1890624110923458654_1011 len=67108864 repl=1 [192.168.221.144:50010]

7. blk_226084029380457017_1011 len=67108864 repl=1 [192.168.221.144:50010]

8. blk_-1230960090596945446_1011 len=60768970 repl=1 [192.168.221.144:50010]



Status: CORRUPT

Total size:    597639882 B

Total dirs:    0

Total files:   1

Total blocks (validated):      9 (avg. block size 66404431 B)

   ********************************

   CORRUPT FILES:        1

   MISSING BLOCKS:       1

   MISSING SIZE:         67108864 B

   CORRUPT BLOCKS:       1

   ********************************

Minimally replicated blocks:   8 (88.888885 %)

Over-replicated blocks:        0 (0.0 %)

Under-replicated blocks:       8 (88.888885 %)

Mis-replicated blocks:         0 (0.0 %)

Default replication factor:    3

Average block replication:     0.8888889

Corrupt blocks:                1

Missing replicas:              16 (200.0 %)

Number of data-nodes:          1

Number of racks:               1





The filesystem under path '/user/hadoop_admin/in/bigfile' is CORRUPT



我现在启动一个DATANODE h1s(131),结果很快的在30秒之内,它就被hadoop原地满HP复活了,现在每个blk都有了两份replica

hadoop_admin@h1:~$ hadoop fsck /user/hadoop_admin/in/bigfile -files -blocks -locations


/user/hadoop_admin/in/bigfile/USWX201304 597639882 bytes, 9 block(s):  Under replicated blk_-4541681964616523124_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_4347039731705448097_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_-4962604929782655181_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_2055128947154747381_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_-2280734543774885595_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_6802612391555920071_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_1890624110923458654_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_226084029380457017_1011. Target Replicas is 3 but found 2 replica(s).


Under replicated blk_-1230960090596945446_1011. Target Replicas is 3 but found 2 replica(s).


0. blk_-4541681964616523124_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


1. blk_4347039731705448097_1011 len=67108864 repl=2 [192.168.221.131:50010, 192.168.221.144:50010]


2. blk_-4962604929782655181_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


3. blk_2055128947154747381_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


4. blk_-2280734543774885595_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


5. blk_6802612391555920071_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


6. blk_1890624110923458654_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


7. blk_226084029380457017_1011 len=67108864 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]


8. blk_-1230960090596945446_1011 len=60768970 repl=2 [192.168.221.144:50010, 192.168.221.131:50010]



发现这个文件被从131成功复制回了144 (h4)。



结论:HADOOP容灾太坚挺了,我现在坚信不疑了!



另外有一个没有粘出来的提示就是,h4 datanode上有不少重新format遗留下来的badLinkBlock,在重新put同一个文件的时候,hadoop将那些老旧残留的block文件全部都删除了。这说明它是具有删除无效bad block的功能的。

hadoop容灾能力测试 分类: A1_HADOOP 2015-03-02 09:38 291人阅读 评论(0) 收藏的更多相关文章

  1. IOS第三方数据库--FMDB 分类: ios技术 2015-03-01 09:38 57人阅读 评论(0) 收藏

    iOS中原生的SQLite API在使用上相当不友好,在使用时,非常不便.于是,就出现了一系列将SQLite API进行封装的库,例如FMDB.PlausibleDatabase.sqlitepers ...

  2. iOS开源库--最全的整理 分类: ios相关 2015-04-08 09:20 486人阅读 评论(0) 收藏

    youtube下载神器:https://github.com/rg3/youtube-dl 我擦咧 vim插件:https://github.com/Valloric/YouCompleteMe vi ...

  3. Hdu 1507 Uncle Tom's Inherited Land* 分类: Brush Mode 2014-07-30 09:28 112人阅读 评论(0) 收藏

    Uncle Tom's Inherited Land* Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (J ...

  4. iOS 消息推送原理及实现总结 分类: ios技术 2015-03-01 09:22 70人阅读 评论(0) 收藏

    在实现消息推送之前先提及几个于推送相关概念,如下图: 1. Provider:就是为指定IOS设备应用程序提供Push的服务器,(如果IOS设备的应用程序是客户端的话,那么Provider可以理解为服 ...

  5. Jquery easy UI 上中下三栏布局 分类: ASP.NET 2015-02-06 09:19 368人阅读 评论(0) 收藏

    效果图: 源代码: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://w ...

  6. C# IIS应用程序池辅助类 分类: C# Helper 2014-07-19 09:50 249人阅读 评论(0) 收藏

    using System.Collections.Generic; using System.DirectoryServices; using System.Linq; using Microsoft ...

  7. PIGS 分类: POJ 图论 2015-08-10 09:15 3人阅读 评论(0) 收藏

    PIGS Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 18209 Accepted: 8277 Description Mir ...

  8. Pots 分类: 搜索 POJ 2015-08-09 18:38 3人阅读 评论(0) 收藏

    Pots Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 11885 Accepted: 5025 Special Judge D ...

  9. Babelfish 分类: 哈希 2015-08-04 09:25 2人阅读 评论(0) 收藏

    Babelfish Time Limit: 3000MS Memory Limit: 65536K Total Submissions: 36398 Accepted: 15554 Descripti ...

随机推荐

  1. Onvif开发之Linux下gsoap的使用及移植

    一直以来都是在CSDN上面学习别人的东西,很多次想写点什么但是又无从写起.由于公司项目需要,最近一段时间在研究onvif,在网上找了很多资料,发现资料是非常多,但是很少有比较全的资料,或者资料太多无从 ...

  2. activity 接回返回值

    activity 接回返回值 今天做订单列表显示 点击某一项显示订单详细信息,在详细activity中用户可以选择取消订单(未支付的状态下)当用户取消订单后订单列表也要改变状态,原来最初做法是所加载绑 ...

  3. 【Java学习】Font字体类的用法介绍

    一.Font类简介 Font类是用于设置图形用户界面上的字体样式的,包括字体类型(例如宋体.仿宋.Times New Roman等).字体风格(例如斜体字.加粗等).以及字号大小. 二.Font类的引 ...

  4. Unity实现发送QQ邮件功能

    闲来无聊,用Unity简单实现了一个发送邮件的功能,希望与大家互相交流互相进步,大神勿喷,测试的是QQ邮件用到的是MailMessage类和SmtpClient类首先如果发送方使用的是个人QQ邮箱账号 ...

  5. Spring3拦截引发的问题——WEB开发中的client路径

    什么是client路径? 第一类.也就是html或js文件等client訪问的文件里的路径,这里包含一些资源文件的引入(js.css还有各种图片等),或是跳转到静态html页面,总之获取的都是静态资源 ...

  6. Django的命令

    安装django          : pip install django 创建django项目   :django-admin startproject projectname 启动django项 ...

  7. upf用到的工具

    emulator          : PXP zebu simulator :

  8. TabControl里面添加From

    private void dynamicDll() { string dllName = "dll"; Assembly ass = Assembly.Load(dllName); ...

  9. Windows 64位下 python3.4.3 安装numpy scipy

    Numpy: 1.在开始菜单搜索cmd打开 终端 2.在终端输入python -m pip install -U pip 3.到http://www.lfd.uci.edu/~gohlke/pytho ...

  10. Windows Forms 对话框篇

    1,标准对话框 Windows内置的对话框,又叫公用对话框,它们作为组件提供的,并且存在于System.Windows.Forms命名空间中. 手工方式: private void button1_C ...