Archiving for File Count Reduction

Note: Archiving should be considered an advanced command due to the caveats involved.

Archiving for File Count Reduction

Overview

Due to the design of HDFS, the number of files in the filesystem directly affects the memory consumption(消费) in the namenode. While normally not a problem for small clusters, memory usage may hit the limits of accessible memory on a single machine when there are >50-100 million files. In such situations, it is advantageous（有利的） to have as few files as possible.

The use of Hadoop Archives is one approach（途径） to reducing the number of files in partitions. （减少分区里面的文件数量）Hive has built-in support to convert files in existing partitions to a Hadoop Archive (HAR) so that a partition that may once have consisted of 100's of files can occupy just ~3 files (depending on settings). However, the trade-off（交易，权衡） is that queries may be slower due to the additional overhead in reading from the HAR. （但是读数据的时候可能会稍稍变慢）

Note that archiving does NOT compress the files – HAR is analogous to the Unix tar command.

Archiving 并非压缩文件，非常类似与Unix系统的tar命令（按我的理解是：仅打包，不压缩）

tar -zcvf /tmp/etc.tar.gz  /etc  <==打包后，以 gzip 压缩

tar -jcvf /tmp/etc.tar.bz2 /etc  <==打包后，以 bzip2 压缩

tar -zxvf /tmp/etc.tar.gz  解压

tar -jxvf /tmp/etc.tar.bz2 解压

Settings

There are 3 settings that should be configured before archiving is used. (Example values are shown.)

hive> set hive.archive.enabled=true;

hive> set hive.archive.har.parentdir.settable=true;

hive> set har.partfile.size=1099511627776;

hive.archive.enabled controls whether archiving operations are enabled.

hive.archive.har.parentdir.settable informs Hive whether the parent directory can be set while creating the archive. In recent versions of Hadoop the -p option can specify the root directory of the archive. For example, if /dir1/dir2/file is archived with /dir1 as the parent directory, then the resulting archive file will contain the directory structure dir2/file. In older versions of Hadoop (prior to 2011), this option was not available and therefore Hive must be configured to accommodate(适应) this limitation.

har.partfile.size controls the size of the files that make up the archive. The archive will contain size_of_partition/har.partfile.size files, rounded up. Higher values mean fewer files, but will result in longer archiving times due to the reduced number of mappers.

Usage

Unarchive

The partition can be reverted back to its original files with the unarchive command:

ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12')

Cautions and Limitations 警告和限制

In some older versions of Hadoop, HAR had a few bugs that could cause data loss or other errors. Be sure that these patches are integrated into your version of Hadoop:

https://issues.apache.org/jira/browse/HADOOP-6591 (fixed in Hadoop 0.21.0)

https://issues.apache.org/jira/browse/MAPREDUCE-1548 (fixed in Hadoop 0.22.0)

https://issues.apache.org/jira/browse/MAPREDUCE-2143 (fixed in Hadoop 0.22.0)

https://issues.apache.org/jira/browse/MAPREDUCE-1752 (fixed in Hadoop 0.23.0)

The HarFileSystem class still has a bug that has yet to be fixed:

https://issues.apache.org/jira/browse/MAPREDUCE-1877 (moved to https://issues.apache.org/jira/browse/HADOOP-10906 in 2014)

Hive comes with the HiveHarFileSystem class that addresses some of these issues, and is by default the value for fs.har.impl. Keep this in mind if you're rolling your own version of HarFileSystem:

The default HiveHarFileSystem.getFileBlockLocations() has no locality. That means it may introduce higher network loads or reduced performance.

Archived partitions cannot be overwritten with INSERT OVERWRITE. The partition must be unarchived first.

If two processes attempt to archive the same partition at the same time, bad things could happen. (Need to implement concurrency support.)

Under the Hood

Internally, when a partition is archived, a HAR is created using the files from the partition's original location (such as /warehouse/table/ds=1). The parent directory of the partition is specified to be the same as the original location and the resulting archive is named 'data.har'. The archive is moved under the original directory (such as /warehouse/table/ds=1/data.har), and the partition's location is changed to point to the archive.

[Hive - LanguageManual] Archiving for File Count Reduction的更多相关文章

Hive:org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: The NameSpace quota (directories and files) of directory /mydir is exceeded: quota=100000 file count=100001
集群中遇到了文件个数超出限制的错误: 0)昨天晚上spark 任务突然抛出了异常:org.apache.hadoop.hdfs.protocol.NSQuotaExceededException: T ...
[Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
[Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
[Hive - LanguageManual] GroupBy
Group By Syntax Simple Examples Select statement and group by clause Advanced Features Multi-Group-B ...
[HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
[Hive - LanguageManual ] Explain (待)
EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...
[Hive - LanguageManual ] Windowing and Analytics Functions （待）
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
[Hive - LanguageManual] VirtualColumns
Virtual Columns Simple Examples Virtual Columns Hive 0.8.0 provides support for two virtual columns: ...
Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和数据定义语言 (DDL) 一.数据库增删改都在文档里说得也很明白,不重复造车轮二.表 ...

随机推荐

javascript基础之数组对象
一.定义数组的方法: 定义了一个空数组: var myArray =new Array(); 指定有n个空元素的数组: var myArray=new Array(n); 定义数组并赋值: var m ...
hadoop拾遗（一）---- 避免切分map文件
有些程序可能不希望文件被切分,而是用一个mapper完整处理每一个输入文件.例如,检查一个文件中所有记录是否有序,一个简单的方法是顺序扫描第一条记录并并比较后一条记录是否比前一条要小.如果将它实现为一 ...
BZOJ 3143 游走（高斯消元）
题目链接:http://61.187.179.132/JudgeOnline/problem.php?id=3143 题意:一个无向连通图,顶点从1编号到n,边从1编号到m.小Z在该图上进行随机游走, ...
ACM刷题常用链接
武汉科技大学 http://acm.wust.edu.cn/ 华中科技大学 http://acm.hust.edu.cn/vjudge/toIndex.action 杭州电子科技大学 http:/ ...
Oracle VM VirtualBox虚拟机安装系统
作为一个前端,必须要有自己的虚拟机,用于测试 IE6 .IE7浏览器. 要测试这两个浏览器,必须要是 Windows XP 系统才可以,这里我找到两个纯净版的 xp 系统 iso 镜像文件. http ...
浏览器HTML5支持程度测试
/********************************************************************* * 浏览器HTML5支持程度测试 * 说明: * 想知道对 ...
FFMPEG + SDL音频播放分析
目录 [hide] 1 抽象流程: 2 关键实现: 2.1 main()函数 2.2 decode_thread()读取文件信息和音频包 2.3 stream_component_open():设置音 ...
转《深入理解Java虚拟机》学习笔记之最后总结
编译器 Java是编译型语言,按照编译的时期不同,编译器可分为: 前端编译器:其实叫编译器的前端更合适些,它把*.java文件转变成*.class文件,如Sun的Javac.Eclipse JDT中的 ...
一步一步ITextSharp 低级操作函数使用
首先说一下PDF文档的结构: 分为四层,第一层和第四层由低级操作来进行操作,第二层.第三层由高级对象操作第一层操作只能使用PdfWriter.DirectContent操作,第四层使用DirectC ...
android.view.ViewRootImpl$CalledFromWrongThreadException错误处理
一般情况下,我们在编写android代码的时候,我们会将一些耗时的操作,比如网络访问.磁盘访问放到一个子线程中来执行.而这类操作往往伴随着UI的更新操作.比如说,访问网络加载一张图片 new Thre ...

[Hive - LanguageManual] Archiving for File Count Reduction