splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.
splittability
- Created by Confluence Administrator, last modified by Lefty Leverenz on Sep 19, 2017
Compressed Data Storage
Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.
You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' ; LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; |
The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause underutilization of your cluster's 'mapping' power.
【 A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.】
The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be. For example:
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' ; CREATE TABLE raw_sequence (line STRING) STORED AS SEQUENCEFILE; LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; SET hive.exec.compress.output= true ; SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below) INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; |
The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.
LZO Compression
See LZO Compression for information about using LZO with Hive.
splittability A SequenceFile can be split by Hadoop and distributed across map jobs whereas a GZIP file cannot be.的更多相关文章
- 初次启动hive,解决 ls: cannot access /home/hadoop/spark-2.2.0-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory问题
>>提君博客原创 http://www.cnblogs.com/tijun/ << 刚刚安装好hive,进行第一次启动 提君博客原创 [hadoop@ltt1 bin]$ ...
- hadoop输入分片计算(Map Task个数的确定)
作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...
- Hadoop 使用Combiner提高Map/Reduce程序效率
众所周知,Hadoop框架使用Mapper将数据处理成一个<key,value>键值对,再网络节点间对其进行整理(shuffle),然后使用Reducer处理数据并进行最终输出. 在上述过 ...
- C#、JAVA操作Hadoop(HDFS、Map/Reduce)真实过程概述。组件、源码下载。无法解决:Response status code does not indicate success: 500。
一.Hadoop环境配置概述 三台虚拟机,操作系统为:Ubuntu 16.04. Hadoop版本:2.7.2 NameNode:192.168.72.132 DataNode:192.168.72. ...
- 【hadoop】如何向map和reduce脚本传递参数,加载文件和目录
本文主要讲解三个问题: 1 使用Java编写MapReduce程序时,如何向map.reduce函数传递参数. 2 使用Streaming编写MapReduce程序(C/C++ ...
- Hadoop 2.4.1 Map/Reduce小结【原创】
看了下MapReduce的例子.再看了下Mapper和Reducer源码,理清了参数的意义,就o了. public class Mapper<KEYIN, VALUEIN, KEYOUT, VA ...
- hadoop 2.2.0 编译报错: [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found
[ERROR] class file for org.mortbay.component.AbstractLifeCycle not found 错误堆栈如下: [ERROR] COMPILATIO ...
- hadoop用mutipleInputs实现map读取不同格式的文件
mapmap读取不同格式的文件这个问题一直就有,之前的读取方式是在map里获取文件的名称,依照名称不同分不同的方式读取,比如以下的方式 //取文件名 InputSplit inputSplit = c ...
- Weekly Contest 78-------->811. Subdomain Visit Count (split string with space and hash map)
A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...
随机推荐
- vue.js源码学习分享(四)
/** * Generate a static keys string from compiler modules.//从编译器生成一个静态键字符串模块. */ function genStaticK ...
- 标准C程序设计七---50
Linux应用 编程深入 语言编程 标准C程序设计七---经典C11程序设计 以下内容为阅读: <标准C程序设计>(第7版) 作者 ...
- 转载 关于malloc
1.函数原型及说明: void *malloc(long NumBytes):该函数分配了NumBytes个字节,并返回了指向这块内存的指针.如果分配失败,则返回一个空指针(NULL). 关于分配失败 ...
- 什麼是 struct,union,enumeration 的 tag ?
struct tag { member-list }; union tag { member-list }; enum tag { member-list }; union test1 { int a ...
- html --- rem
// rem (function(doc, win) { var docEle = doc.documentElement, evt = "onorientati ...
- BZOJ 1044 木棍分割(二分答案 + DP优化)
题目链接 木棍分割 1044: [HAOI2008]木棍分割 Time Limit: 10 Sec Memory Limit: 162 MBSubmit: 3830 Solved: 1453[S ...
- 着陆攻击LAND Attack
着陆攻击LAND Attack 着陆攻击LAND Attack也是一种拒绝服务攻击DOS.LAND是Local Area Network Denial的缩写,意思是局域网拒绝服务攻击,翻译为着陆攻 ...
- HDU3625 Examining the Rooms
@(HDU)[Stirling數] Problem Description A murder happened in the hotel. As the best detective in the t ...
- kafka生产者客户端
kafka的生产者 1. 生产者客户端开发 熟悉kafka的朋友都应该知道kafka客户端有新旧版本,老版本采用scala编写,新版本采用java编写.随着kafka版本的升级,旧版本客户端已经快 ...
- 使用viewPage实现图片轮播
概述 图片循环播放这种效果,在许多的场合都能看到,只要一打开各大主流网站的首页几乎都有一个这样的组件,它可以很显目的提供给用户最近最火热的信息.因为它应用得如此之广泛,今天,我们就来写一下这个组件. ...