hive表多种存储格式的文件大小差异，无重复数据

-- 重点，目标表无重复数据

-- dbName.num_result 无重复记录

-- 插入数据

CREATE TABLE dbName.test_textfile(

  `key` string,

  `value` string,

  `p_key` string,

  `p_key2` string)

STORED AS textfile

;

insert overwrite table dbName.test_textfile select * from dbName.num_result where p_key='' and p_key2='';

drop table dbName.test_orcfile;

CREATE TABLE dbName.test_orcfile(

  `key` string,

  `value` string,

  `p_key` string,

  `p_key2` string)

STORED AS orc

;

insert overwrite table dbName.test_orcfile select * from test_textfile;

CREATE TABLE dbName.test_rcfile(

  `key` string,

  `value` string,

  `p_key` string,

  `p_key2` string)

STORED AS rcfile

;

insert overwrite table dbName.test_rcfile select * from test_textfile;

CREATE TABLE dbName.test_parquet(

  `key` string,

  `value` string,

  `p_key` string,

  `p_key2` string)

STORED AS parquet

;

insert overwrite table dbName.test_parquet select * from test_textfile;

-- 统计数据量

select count(1) as cnt from dbName.test_textfile;

select count(1) as cnt from dbName.test_orcfile;

select count(1) as cnt from dbName.test_rcfile;

select count(1) as cnt from dbName.test_parquet;

-- 统计文件大小

dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_text*;

dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_par*;

dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_rc*;

dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_orc*;

1.0 G  3.1 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_textfile

1.1 G  3.3 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_parquet

984.0 M  2.9 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_rcfile

470.0 M  1.4 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_orcfile

从结果可以看出，在无重复数据的情况下，parquet的压缩无用武之地，占用空间比textfile还大，ORC是压缩最强的文件模式。

hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_text*;

1110741501  3332224503  hdfs://nameNode/user/hive/warehouse/dbName.db/test_textfile

hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_par*;

1167366639  3502099917  hdfs://nameNode/user/hive/warehouse/dbName.db/test_parquet

hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_rc*;

1031774688  3095324064  hdfs://nameNode/user/hive/warehouse/dbName.db/test_rcfile

hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_orc*;

492795434  1478386302  hdfs://nameNode/user/hive/warehouse/dbName.db/test_orcfile

hive表多种存储格式的文件大小差异，无重复数据的更多相关文章

hive表的存储格式; ORC格式的使用
hive表的源文件存储格式有几类: 1.TEXTFILE 默认格式,建表时不指定默认为这个格式,导入数据时会直接把数据文件拷贝到hdfs上不进行处理.源文件可以直接通过hadoop fs -cat 查 ...
hive 表新增字段后更新分区无法显示数据
解决方案: 1.删除分区后重新跑数据 alter table drop partition(分区字段=“”): 2.新增字段运行程序后其实数据已经有了,只是查询hive的时候无法显示出来, 这个时候只 ...
SQLServer 表连接时使用top 1 去除重复数据
left join SM_SOLine soline on soline.SO=so.ID and soline.DocLineNo=(select MAX(DocLineNo) from SM_SO ...
hive 压缩全解读(hive表存储格式以及外部表直接加载压缩格式数据)；HADOOP存储数据压缩方案对比（LZO,gz，ORC）
数据做压缩和解压缩会增加CPU的开销,但可以最大程度的减少文件所需的磁盘空间和网络I/O的开销,所以最好对那些I/O密集型的作业使用数据压缩,cpu密集型,使用压缩反而会降低性能. 而hive中间结果 ...
如何快速把hdfs数据动态导入到hive表
1. hdfs 文件 {"retCode":1,"retMsg":"Success","data":[{" ...
疯狂位图之——位图生成12GB无重复随机乱序大整数集
上一篇讲述了用位图实现无重复数据的排序,排序算法一下就写好了,想弄个大点数据测试一下,因为小数据在内存中快排已经很快. 一.生成的数据集要求 1.数据为0--2147483647(2^31-1)范围内 ...
Hive表的几种存储格式
Hive的文件存储格式: textFile textFile为默认格式存储方式:行存储缺点:磁盘开销大:数据解析开销大:压缩的text文件,hive无法进行合并和拆分 sequencefile 二 ...
大数据：Hive - ORC 文件存储格式
一.ORC File文件结构 ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache ...
Hive - ORC 文件存储格式【转】
一.ORC File文件结构 ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache ...

随机推荐

B和strong以及i和em的区别（转）
B和strong以及i和em的区别 (2013-12-31 13:58:35) 标签: b strong i em 搜索引擎分类: 网页制作一直以来都以为B和strong以及i和em是相同的效果, ...
Selenium二次封装-Java版本
package com.yanfuchang.selenium.utils; import java.awt.AWTException; import java.awt.Robot; import j ...
在Oracle 12C中使用scott账号
在Oracle11g中默认是有scott账号的,但在Oracle 12C中则不能直接使用. 我的机器环境: 操作系统:Windows Server 2008 R2 64位 Oracle版本:Oracl ...
boost::thread 库的使用
转载自:http://blog.csdn.net/yockie/article/details/9181939 概要通过实例介绍boost thread的使用方式,本文主要由线程启动.Interru ...
CF570E Pig and Palindromes
完全不会这种类型的$dp$啊…… 考虑回文串一定是可以拆分成(偶数个字母 + 偶数个字母)或者(偶数个字母 + 一个字母 +偶数个字母),两边的偶数个字母其实是完全对称的.因为这道题回文串的长度是给定 ...
小组作业wordCountPro·
基本任务:代码编写+单元测试 (1) Github地址: https://github.com/LongtermPartner/ExtendWordCount (2) PSP表格: psp 2.1 ...
Java基础-集合框架的学习大纲
1.List 和 Set 的区别 2.HashSet 是如何保证不重复的 3.HashMap 是线程安全的吗,为什么不是线程安全的(最好画图说明多线程环境下不安全)? 4.HashMap 的扩容过程 ...
java求几个数字的和输出详细步骤
设计思想:要求几个数字的和,就要把输入的字符串转换成浮点型,然后求和再输出. 程序流程图: 程序源代码: //此程序用于从命令行接收多个数字,就和并输出. //作者:赵东睿 //2015.9.26 p ...
jmeter MD5加密
MD5.jar已经上传到博客园的文件中第一步添加变量${Qpassword} 第二步调用MD5加密 import hehe.Str2MD5;String res = new Str2MD5() ...
字符串创建XML文档
创建XML文档,方法与形式是多样的,下面Insus.NET再例举两种,可供参考. XmlDocument(namespace:System.Xml)在实例化之后,它有一个方法LoadXml(),可以把 ...

hive表多种存储格式的文件大小差异，无重复数据

hive表多种存储格式的文件大小差异，无重复数据的更多相关文章

随机推荐

热门专题