splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.
splittability
- Created by Confluence Administrator, last modified by Lefty Leverenz on Sep 19, 2017
Compressed Data Storage
Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.
You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' ; LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; |
The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause underutilization of your cluster's 'mapping' power.
【 A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.】
The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be. For example:
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' ; CREATE TABLE raw_sequence (line STRING) STORED AS SEQUENCEFILE; LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; SET hive.exec.compress.output= true ; SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below) INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; |
The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.
LZO Compression
See LZO Compression for information about using LZO with Hive.
splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章
- 初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory问题
>>提君博客原创 http://www.cnblogs.com/tijun/ << 刚刚安装好hive,进行第一次启动 提君博客原创 [hadoop@ltt1 bin]$ ...
- hadoop输入分片计算(Map Task个数的确定)
作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...
- Hadoop 使用Combiner提高Map/Reduce程序效率
众所周知,Hadoop框架使用Mapper将数据处理成一个<key,value>键值对,再网络节点间对其进行整理(shuffle),然后使用Reducer处理数据并进行最终输出. 在上述过 ...
- C#、JAVA操作Hadoop(HDFS、Map/Reduce)真实过程概述。组件、源码下载。无法解决:Response status code does not indicate success: 500。
一.Hadoop环境配置概述 三台虚拟机,操作系统为:Ubuntu 16.04. Hadoop版本:2.7.2 NameNode:192.168.72.132 DataNode:192.168.72. ...
- 【hadoop】如何向map和reduce脚本传递参数,加载文件和目录
本文主要讲解三个问题: 1 使用Java编写MapReduce程序时,如何向map.reduce函数传递参数. 2 使用Streaming编写MapReduce程序(C/C++ ...
- Hadoop 2.4.1 Map/Reduce小结【原创】
看了下MapReduce的例子.再看了下Mapper和Reducer源码,理清了参数的意义,就o了. public class Mapper<KEYIN, VALUEIN, KEYOUT, VA ...
- hadoop 2.2.0 编译报错: [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found
[ERROR] class file for org.mortbay.component.AbstractLifeCycle not found 错误堆栈如下: [ERROR] COMPILATIO ...
- hadoop用mutipleInputs实现map读取不同格式的文件
mapmap读取不同格式的文件这个问题一直就有,之前的读取方式是在map里获取文件的名称,依照名称不同分不同的方式读取,比如以下的方式 //取文件名 InputSplit inputSplit = c ...
- Weekly Contest 78-------->811. Subdomain Visit Count (split string with space and hash map)
A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...
随机推荐
- 应用js函数柯里化currying 与ajax 局部刷新dom
直接上代码吧 最近读javascript核心概念及实践的代码 感觉很有用 备忘. <div id="request"></div> <script t ...
- os.system() 和 os.popen()
1.os.popen(command[, mode[, bufsize]]) os.system(command) 2.os.popen() 功能强于os.system() , os.popen() ...
- nrm+nvm
一.nvm的安装和使用 nvm全称Node Version Manager是 Nodejs 版本管理器,它让我们能方便的对 Nodejs 的版 本进行切换. nvm 的官方版本只支持 Linux ...
- Go语言入门——数组、切片和映射(下)
上篇主要介绍了Go语言里面常见的复合数据类型的声明和初始化. 这篇主要针对数组.切片和映射这些复合数据类型从其他几个方面介绍比较下. 1.遍历 不管是数组.切片还是映射结构,都是一种集合类型,要从这些 ...
- Network | Public-key cryptography
公开密钥加密public-key cryptography,也称为非对称(密钥)加密. 非对称密钥,是指一对加密密钥与解密密钥,这两个密钥是数学相关,用某用户密钥加密后所得的信息,只能用该用户的解密密 ...
- git history 记录(上传到 issu-170 )
一.上传到gitlab 本地issu-170落后git很多,发生冲突的要手动修改. 2000 cd robot_demo_0226_ws/ 2001 ls 2002 cd IGV01-SW-170 2 ...
- 字符串(NSString)及常见字符串处理函数
从本系列文章的开始,我们就使用过字符串对象,但是我们却还没有比较详细的介绍过它.使用@符,再一对双引号将一组字符串引用起来,例如: @”In fact, Objective-C is very sim ...
- 邁向IT專家成功之路的三十則鐵律 鐵律二十九 IT人富足之道-信仰
天地自然的循環法則,讓每一個人都必須經歷生.老.病.死.喜.怒.哀.樂,然而至始至終無盡的煩惱總是遠多於快樂,因此筆者深信每一個人在一生當中,都必須要有適合自己的正確信仰,他可以在你無法以物質力量來解 ...
- Opencv 改进的外接矩形合并拼接方法
上一篇中的方法存在的问题是矩形框不够精确,而且效果不能达到要求 这里使用凸包检测的方法,并将原来膨胀系数由20缩小到5,达到了更好的效果 效果图: 效果图: 代码: #include <open ...
- C#报错"线程间操作无效: 从不是创建控件“XXX”的线程访问它"--解决示例
C# Winform程序中,使用线程对界面进行更新需要特殊处理,否则会出现异常“线程间操作无效: 从不是创建控件“taskView”的线程访问它.” 在网文“http://www.cnblogs.co ...