Write Custom Java to Create LZO Files
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
- Created by Lefty Leverenz, last modified on Sep 19, 2017
LZO Compression
General LZO Concepts
LZO is a lossless data compression library that favors speed over compression ratio. See http://www.oberhumer.com/opensource/lzo and http://www.lzop.org for general information about LZO and see Compressed Data Storage for information about compression in Hive.
Imagine a simple data file that has three columns
- id
- first name
- last name
Let's populate a data file containing 4 records:
19630001 john lennon
19630002 paul mccartney
19630003 george harrison
19630004 ringo starr
Let's call the data file /path/to/dir/names.txt.
In order to make it into an LZO file, we can use the lzop utility and it will create a names.txt.lzo file.
Now copy the file names.txt.lzo to HDFS.
Prerequisites
Lzo/Lzop Installations
lzo and lzop need to be installed on every node in the Hadoop cluster. The details of these installations are beyond the scope of this document.
core-site.xml
Add the following to your core-site.xml:
com.hadoop.compression.lzo.LzoCodeccom.hadoop.compression.lzo.LzopCodec
For example:
<property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value></property>
<property><name>io.compression.codec.lzo.class</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>
Next we run the command to create an LZO index file:
hadoop jar /path/to/jar/hadoop-lzo-cdh4-0.4.15-gplextras.jar com.hadoop.compression.lzo.LzoIndexer /path/to/HDFS/dir/containing/lzo/files
This creates names.txt.lzo on HDFS.
Table Definition
The following hive -e command creates an LZO-compressed external table:
hive -e "CREATE EXTERNAL TABLE IF NOT EXISTS hive_table_name (column_1 datatype_1......column_N datatype_N)
PARTITIONED BY (partition_col_1 datatype_1 ....col_P datatype_P)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT \"com.hadoop.mapred.DeprecatedLzoTextInputFormat\"
OUTPUTFORMAT \"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\";
Note: The double quotes have to be escaped so that the 'hive -e' command works correctly.
See CREATE TABLE and Hive CLI for information about command syntax.
Hive Queries
Option 1: Directly Create LZO Files
- Directly create LZO files as the output of the Hive query.
- Use
lzopcommand utility or your custom Java to generate.lzo.indexfor the.lzofiles.
Hive Query Parameters
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true
For example:
hive -e "SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec; SET hive.exec.compress.output=true;SET mapreduce.output.fileoutputformat.compress=true; <query-string>"
Note: If the data sets are large or number of output files are large , then this option does not work.
Option 2: Write Custom Java to Create LZO Files
- Create text files as the output of the Hive query.
- Write custom Java code to
- convert Hive query generated text files to
.lzofiles - generate
.lzo.indexfiles for the.lzofiles generated above
- convert Hive query generated text files to
Hive Query Parameters
Prefix the query string with these parameters:
SET hive.exec.compress.output=false
SET mapreduce.output.fileoutputformat.compress=false
For example:
hive -e "SET hive.exec.compress.output=false;SET mapreduce.output.fileoutputformat.compress=false;<query-string>"
Write Custom Java to Create LZO Files的更多相关文章
- How to create PDF files in a Python/Django application using ReportLab
https://assist-software.net/blog/how-create-pdf-files-python-django-application-using-reportlab CONT ...
- 转载Java NIO中的Files类的使用
Java NIO中的Files类(java.nio.file.Files)提供了多种操作文件系统中文件的方法. Files.exists() Files.exits()方法用来检查给定的Path在文件 ...
- Java使用JSP Tag Files & JSP EL Functions打造你自己的页面模板
1. 简单说明:在JSP 2.0后, 你不再需要大刀阔斧地定义一堆TagSupport或BodyTagSupport, 使用JSP Tag Files技术可以实现功能强大的页面模板技术. 在这里抛砖引 ...
- Java NIO.2 使用Files类遍历文件夹
在以前的Java版本中,如果要遍历某个文件夹下所有的子文件.子文件夹,需要我们自己写递归,很麻烦. 在Java7以后,我们可以NIO.2中的Files工具类来遍历某个文件夹(会自动递归). 大致用法: ...
- 使用Java内存映射(Memory-Mapped Files)处理大文件
>>NIO中的内存映射 (1)什么是内存映射文件内存映射文件,是由一个文件到一块内存的映射,可以理解为将一个文件映射到进程地址,然后可以通过操作内存来访问文件数据.说白了就是使用虚拟内存将 ...
- Using Custom Java code in ODI
在ODI中调用jar包java方法的过程如下: 1.编写Java代码如下 代码写hello world字符串到一个文件. package odi; import java.io.File; impor ...
- Create XML Files Out Of SQL Server With SSIS And FOR XML Syntax
So you want to spit out some XML from SQL Server into a file, how can you do that? There are a coupl ...
- [Tools] Batch Create Markdown Files from a Template with Node.js and Mustache
Creating Markdown files from a template is a straightforward process with Node.js and Mustache. You ...
- Cognos权限Custom Java Provider表结构实例
select * from org_user;USER_ID USER_CODE USER_NAME FULL_NAME EMAIL PWD2 889 zhangsan 张三 123@126.com ...
随机推荐
- 如何用github展示前端页面
如何在github上展示你的前端页面 参考:https://luozhihao.github.io/demo/ 感谢作者 1.New reposipory 2.进入你本机目录 我是在d:vuedemo ...
- JVM指令详解(下)
九.自增减指令 该指令用于对本地(局部)变量进行自增减操作.该指令第一参数为本地变量的编号,第二个参数为自增减的数量. 比如对于代码: int d=10; d++; d ...
- DataTable.AcceptChanges的理解
OleDbDataAdapter 怎么更新不了数据库? String tbName = ds.Tables[0].TableName; String te ...
- vue 权限控制按钮3种样式、内容、以及跳转事件
最近碰到一个因为要根据权限来给一个按钮变成不同功能, 简单写出3个按钮然后用v-if也能实现这个功能,但是在加载页面时,如果延迟过高则会把按钮按照DOM顺序加载出来,这是个很不好的效果 思索了下,把三 ...
- http错误种类及原因
http://blog.csdn.net/dxykevin/article/details/50950878 [摘要]HTTP状态码(HTTP Status Code)是用以表示网页服务器HTTP响应 ...
- [笔记][FPGA]如何使用SignalTap观察wire与reg值
0. 简介 在FPGA程序调试时,我们除了仿真还经常的会用到SignalTap进行板级调试,其可以真实有效的反应某些变量的变化,方便我们理解内在跳转,方便Debug的运行.SignalTap需要制定时 ...
- ubuntu下打开windows里的txt文件乱码解决
是编码问题引起的问题: Linux下默认的编码是UTF-8,而Windows下默认的编码是GB2312/GBK.执行如下第一条语句即可 gsettings set org.gnome.gedit.pr ...
- 第1章 Spring Cloud 构建微服务架构(一)服务注册与发现
一.Spring Cloud 简介 Spring Cloud是一个基于Spring Boot实现的云应用开发工具,它为基于JVM的云应用开发中的配置管理.服务发现.断路器.智能路由.微代理.控制总 ...
- 前端必备性能知识 - http2.0
前端开发中,性能是一定绕不开的,今天就来说一下前后台通信中最重要的一个通道--HTTP2.0 最开始的通讯协议叫http1.0,作为始祖级的它,定义了最基本的数据结构,请求头和请求体,以及每一个字段的 ...
- MySQL的1067错误
1.打开my.ini文件,找到default-storage-engine=InnoDB这一行,把它改成default-storage-engine=MyISAM.*** my.ini必须为ansi格 ...