hive中使用rcfile

（1）建student & student1 表：（hive 托管）
create table student(id INT, age INT, name STRING)
partitioned by(stat_date STRING)
clustered by(id) sorted by(age) into 4 buckets
row format delimited fields terminated by ',';

create table studentrc(id INT, age INT, name STRING)
partitioned by(stat_date STRING)
clustered by(id) sorted by(age) into 4 buckets
row format delimited fields terminated by ',' stored as rcfile;

create table studentlzo(id INT, age INT, name STRING)
partitioned by(stat_date STRING)
clustered by(id) sorted by(age) into 4 buckets
row format delimited fields terminated by ',' stored as rcfile;

文件格式 textfile， sequencefile， rcfile
（2）设置环境变量：
set hive.enforce.bucketing = true;
（3）插入数据：
LOAD DATA local INPATH '/home/hadoop/hivetest1.txt' OVERWRITE INTO TABLE student partition(stat_date="20120802");

(CPU使用率很高)
from student
insert overwrite table student1 partition(stat_date="20120802")
select id,age,name where stat_date="20120802" sort by age;

查看数据
select id, age, name from student distribute by id ; // distribute相当于mapreduce中的key

抽选数据(一般测试的情况下使用)
select * from student tablesample(bucket 1 out of 2 on id);
TABLESAMPLE(BUCKET x OUT OF y)
其中, x必须比y小, y必须是在创建表的时候bucket on的数量的因子或者倍数, hive会根据y的大小来决定抽样多少, 比如原本分了32分, 当y=16时, 抽取32/16=2分, 这时TABLESAMPLE(BUCKET 3 OUT OF 16) 就意味着要抽取第3和第16+3=19分的样品. 如果y=64，这要抽取 32/64=1/2份数据, 这时TABLESAMPLE(BUCKET 3 OUT OF 64) 意味着抽取第3份数据的一半来进行.

rcfile操作

// 导入(gzip压缩)
set hive.enforce.bucketing=true;
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
from student
insert overwrite table studentrc partition(stat_date="20120802")
select id,age,name where stat_date="20120802" sort by age;

// lzo压缩
set hive.io.rcfile.record.buffer.size = 16777216; // 16 * 1024 * 1024
set io.file.buffer.size = 131072; // 缓冲区大小 128 * 1024

set hive.enforce.bucketing=true;
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
set io.compression.codecs=com.hadoop.compression.lzo.LzoCodec;
from student
insert overwrite table studentlzo partition(stat_date="20120802")
select id,age,name where stat_date="20120802" sort by age;

// sequencefile导入
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
insert overwrite table studentseq select * from student;

hive中使用rcfile的更多相关文章

hive中rcfile格式(收藏文)
首先声明,此文是属于纯粹收藏文,感觉讲的很不错. 本文介绍了Facebook公司数据分析系统中的RCFile存储结构,该结构集行存储和列存储的优点于一身,在MapReduce环境下的大规模数据分析中扮 ...
Hive中的数据库(Database)和表(Table)
在前面的文章中,介绍了可以把Hive当成一个"数据库",它也具备传统数据库的数据单元,数据库(Database/Schema)和表(Table). 本文介绍一下Hive中的数据库( ...
Hive中的HiveServer2、Beeline及数据的压缩和存储
1.使用HiveServer2及Beeline HiveServer2的作用:将hive变成一种server服务对外开放,多个客户端可以连接. 启动namenode.datanode.resource ...
Hive存储格式之RCFile详解，RCFile的过去现在和未来
我在整理Hive的存储格式和压缩格式,本来打算一篇发出来,结果其中一小节就有很多内容,于是打算写成Hive存储格式和压缩格式系列. 本节主要讲一下Hive存储格式最早的典型的列式存储格式RCFile. ...
SparkSQL读取Hive中的数据
由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...
hive中分析函数window子句
hive中有些分析函数功能确实很强大,在和sum,max等聚合函数结合起来能实现不少功能. 直接上代码演示吧原始数据 channel1 2016-11-10 1 channel1 2016-11-1 ...
hive中的一种假NULL现象
使用hive时,我们偶尔会遇到这样的问题,当你将结果输出到屏幕时,查出的数据往往显示为null,但是当你将结果输出到文本时,却显示为空(即未填充),这是为什么呢? 在hive中有一种假NULL,它看起 ...
hive中导入json格式的数据（hive分区表）
hive中建立外部分区表,外部数据格式是json的如何导入呢? json格式的数据表不必含有分区字段,只需要在hdfs目录结构中体现出分区就可以了 This is all according to t ...
sqoop将关系型数据库的表导入hive中
1.sqoop 将关系型数据库的数据导入hive的参数说明:

随机推荐

Linux利器strace
strace常用来跟踪进程执行时的系统调用和所接收的信号. 在Linux世界,进程不能直接访问硬件设备,当进程需要访问硬件设备(比如读取磁盘文件,接收网络数据等等)时,必须由用户态模式切换至内核态模式 ...
Linux安装PHP加速器Xcache
XCache 是一个又快又稳定的 PHP opcoolcode 缓存器. 经过良好的测试并在大流量/高负载的生产机器上稳定运行. 经过(在linux 上)测试并支持所有现行 PHP 分支的最新发布版本 ...
php 中全局变量global 的使用
简介即使开发一个新的大型PHP程序,你也不可避免的要使用到全局数据,因为有些数据是需要用到你的代码的不同部分的.一些常见的全局数据有:程序设定类.数据库连接类.用户资料等等.有很多方法能够使这些数 ...
一个典型案例为你解读TDSQL 全时态数据库系统
欢迎大家前往腾讯云+社区,获取更多腾讯海量技术实践干货哦~ 本文由腾讯技术工程官方号发表在腾讯云+社区经典案例增量抽取.增量计算等都是T-TDSQL的经典案例.如下以增量计算为例,来分析T-TDS ...
IBM Rational Appscan: Part 2 ---reference
http://resources.infosecinstitute.com/appscan-part-2/ By Rohit T|August 16th, 2012 ----------------- ...
生成ks.cfg文件
RHEL 7下生成ks.cfg文件环境 + RHEL 7 + 字符界面, 没有安装图形界面软件包安装 + `yum install system-config-kickstart -y` + `y ...
命令行编译java项目
命令行编译java项目项目名: testproj 目录 src -> cn -> busix -> test bin lib 编译项目 cd testproj javac -d . ...
MySQL出现时区错误的解决方法
目录环境问题分析解决方法环境 windows10 MySQL 8.0.13 IDEA 问题 The server time zone value 'ÖÐ¹ú±ê×¼Ê±¼ä' is unre ...
Scrum之初体验
一.前言入职两个月,作为新人,没有参加过一次早晨的scrum会议. 最大的感触就是,有一天中午,带我的开发哥哥突然说产品今天下午提测,我突然就懵了. 这算是我体会的最大的团队中人没有参加scrum, ...
css3 border-image及连续的图像边框
border-image 它是下面几个值的简写: border-image-source // 使用绝对或相对地址url,引入图片 border-image-slice //切割图片,取值支持:& ...

hive中使用rcfile

hive中使用rcfile的更多相关文章

随机推荐

热门专题