hadoop 输出中文乱码问题
本文转载至:
http://www.aboutyun.com/thread-7358-1-1.html
hadoop涉及输出文本的默认输出编码统一用没有BOM的UTF-8的形式,但是对于中文的输出window系统默认的是GBK,有些格式文件例如CSV格式的文件用excel打开输出编码为没有BOM的UTF-8文件时,输出的结果为乱码,只能由UE或者记事本打开才能正常显示。因此将hadoop默认输出编码更改为GBK成为非常常见的需求。
默认的情况下MR主程序中,设定输出编码的设置语句为:
- job.setOutputFormatClass(TextOutputFormat.class);
复制代码
- TextOutputFormat.class
复制代码
的代码如下:
- /**
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements. See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership. The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- package org.apache.hadoop.mapreduce.lib.output;
- import java.io.DataOutputStream;
- import java.io.IOException;
- import java.io.UnsupportedEncodingException;
- import org.apache.hadoop.classification.InterfaceAudience;
- import org.apache.hadoop.classification.InterfaceStability;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.OutputFormat;
- import org.apache.hadoop.mapreduce.RecordWriter;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.util.*;
- /** An {@link OutputFormat} that writes plain text files. */
- @InterfaceAudience.Public
- @InterfaceStability.Stable
- public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
- public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
- protected static class LineRecordWriter<K, V>
- extends RecordWriter<K, V> {
- private static final String utf8 = "UTF-8"; // 将UTF-8转换成GBK
- private static final byte[] newline;
- static {
- try {
- newline = "\n".getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- protected DataOutputStream out;
- private final byte[] keyValueSeparator;
- public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
- this.out = out;
- try {
- this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- public LineRecordWriter(DataOutputStream out) {
- this(out, "\t");
- }
- /**
- * Write the object to the byte stream, handling Text as a special
- * case.
- * @param o the object to print
- * @throws IOException if the write throws, we pass it on
- */
- private void writeObject(Object o) throws IOException {
- if (o instanceof Text) {
- Text to = (Text) o; // 将此行代码注释掉
- out.write(to.getBytes(), 0, to.getLength()); // 将此行代码注释掉
- } else { // 将此行代码注释掉
- out.write(o.toString().getBytes(utf8));
- }
- }
- public synchronized void write(K key, V value)
- throws IOException {
- boolean nullKey = key == null || key instanceof NullWritable;
- boolean nullValue = value == null || value instanceof NullWritable;
- if (nullKey && nullValue) {
- return;
- }
- if (!nullKey) {
- writeObject(key);
- }
- if (!(nullKey || nullValue)) {
- out.write(keyValueSeparator);
- }
- if (!nullValue) {
- writeObject(value);
- }
- out.write(newline);
- }
- public synchronized
- void close(TaskAttemptContext context) throws IOException {
- out.close();
- }
- }
- public RecordWriter<K, V>
- getRecordWriter(TaskAttemptContext job
- ) throws IOException, InterruptedException {
- Configuration conf = job.getConfiguration();
- boolean isCompressed = getCompressOutput(job);
- String keyValueSeparator= conf.get(SEPERATOR, "\t");
- CompressionCodec codec = null;
- String extension = "";
- if (isCompressed) {
- Class<? extends CompressionCodec> codecClass =
- getOutputCompressorClass(job, GzipCodec.class);
- codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
- extension = codec.getDefaultExtension();
- }
- Path file = getDefaultWorkFile(job, extension);
- FileSystem fs = file.getFileSystem(conf);
- if (!isCompressed) {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
- } else {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(new DataOutputStream
- (codec.createOutputStream(fileOut)),
- keyValueSeparator);
- }
- }
- }
复制代码
从上述代码的第48行可以看出hadoop已经限定此输出格式统一为UTF-8,因此为了改变hadoop的输出代码的文本编码只需定义一个和TextOutputFormat相同的类GbkOutputFormat同样继承FileOutputFormat(注意是org.apache.hadoop.mapreduce.lib.output.FileOutputFormat)即可,如下代码:
- import java.io.DataOutputStream;
- import java.io.IOException;
- import java.io.UnsupportedEncodingException;
- import org.apache.hadoop.classification.InterfaceAudience;
- import org.apache.hadoop.classification.InterfaceStability;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.OutputFormat;
- import org.apache.hadoop.mapreduce.RecordWriter;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.util.*;
- @InterfaceAudience.Public
- @InterfaceStability.Stable
- public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {
- public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
- protected static class LineRecordWriter<K, V>
- extends RecordWriter<K, V> {
- private static final String utf8 = "GBK";
- private static final byte[] newline;
- static {
- try {
- newline = "\n".getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- protected DataOutputStream out;
- private final byte[] keyValueSeparator;
- public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
- this.out = out;
- try {
- this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- public LineRecordWriter(DataOutputStream out) {
- this(out, "\t");
- }
- /**
- * Write the object to the byte stream, handling Text as a special
- * case.
- * @param o the object to print
- * @throws IOException if the write throws, we pass it on
- */
- private void writeObject(Object o) throws IOException {
- if (o instanceof Text) {
- // Text to = (Text) o;
- // out.write(to.getBytes(), 0, to.getLength());
- // } else {
- out.write(o.toString().getBytes(utf8));
- }
- }
- public synchronized void write(K key, V value)
- throws IOException {
- boolean nullKey = key == null || key instanceof NullWritable;
- boolean nullValue = value == null || value instanceof NullWritable;
- if (nullKey && nullValue) {
- return;
- }
- if (!nullKey) {
- writeObject(key);
- }
- if (!(nullKey || nullValue)) {
- out.write(keyValueSeparator);
- }
- if (!nullValue) {
- writeObject(value);
- }
- out.write(newline);
- }
- public synchronized
- void close(TaskAttemptContext context) throws IOException {
- out.close();
- }
- }
- public RecordWriter<K, V>
- getRecordWriter(TaskAttemptContext job
- ) throws IOException, InterruptedException {
- Configuration conf = job.getConfiguration();
- boolean isCompressed = getCompressOutput(job);
- String keyValueSeparator= conf.get(SEPERATOR, "\t");
- CompressionCodec codec = null;
- String extension = "";
- if (isCompressed) {
- Class<? extends CompressionCodec> codecClass =
- getOutputCompressorClass(job, GzipCodec.class);
- codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
- extension = codec.getDefaultExtension();
- }
- Path file = getDefaultWorkFile(job, extension);
- FileSystem fs = file.getFileSystem(conf);
- if (!isCompressed) {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
- } else {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(new DataOutputStream
- (codec.createOutputStream(fileOut)),
- keyValueSeparator);
- }
- }
- }
复制代码
最后将输出编码类型设置成GbkOutputFormat.class,如:
- job.setOutputFormatClass(GbkOutputFormat.class);
复制代码
参考:
- http://semantic.iteye.com/blog/1846238
复制代码
hadoop 输出中文乱码问题的更多相关文章
- .Net Core 控制台输出中文乱码
Net Core 控制台输出中文乱码的解决方法: public static void Main(string[] args) { Console.Output ...
- 在Servlet中出现一个输出中文乱码的问题(已经解)。
在Servlet中出现一个输出中文乱码的问题,已经解. @Override public void doPost(HttpServletRequest reqeust, HttpServletResp ...
- idea 控制台输出 中文乱码 解决方法
使用intellij idea 14.1时,console 会输出中文乱码.下面分两种情况解决这种问题:一种是maven构建项目.一种是tomcat(不以maven构建)构建项目. 1.tomcat输 ...
- 编码(ACSII unicod UTF-8)、QT输出中文乱码深入分析
总结: 1. qt输出中文乱码原因分析 qt的编程环境默认是utf-8编码格式(关于编码见下文知识要点一): cout << "中文" << endl; 程 ...
- 使用WebLogic时控制台输出中文乱码解决方法
使用WebLogic时控制台输出中文乱码解决方法 1.找到weblogic安装目录,当前项目配置的domain 2.找到bin下的setDomainEnv.cmd文件 3.打开文件,从文件最后搜索第一 ...
- 二十一、IntelliJ IDEA 控制台输出中文乱码问题的解决方法
首先,找到 IntelliJ IDEA 的安装目录,进入bin目录下,定位到idea.vmoptions文件,如下图所示: 双击打开idea.vmoptions文件,如下图所示: 然后,在其中追加-D ...
- 解决phantomjs输出中文乱码
解决phantomjs输出中文乱码,可以在js文件里添加如下语句: phantom.outputEncoding="gb2312"; // 解决输出乱码
- resin后台输出中文乱码的解决办法!
resin后台输出中文乱码的解决办法! 学习了:https://blog.csdn.net/kobeguang/article/details/34116429 编辑conf/resin.con文件: ...
- resin后台输出中文乱码的解决的方法!
近期从tomcat移植到resin,发现这东西不错啊! 仅仅是后台输出时有时候中文会乱码. 如今找到resin后台输出中文乱码的解决的方法: 编辑conf/resin.con文件: <!--ja ...
随机推荐
- pycharn设置git提交代码
1.设置pycharm的git地址 2.设置git地址及本地路径 3.提交代码
- [HAOI2012]Road
2750: [HAOI2012]Road Time Limit: 10 Sec Memory Limit: 128 MBSubmit: 728 Solved: 349[Submit][Status ...
- IDEA错误的将所有代码文件都加入版本控制
1.问题: IDEA将从Git上拉取的所有代码文件都加入版本控制里,而这些文件和远程服务器没有任何区别: 2.原因: 后来发现,虽然项目使用的是Git的版本控制,但是异常模块都是使用SVN的版本控制: ...
- hdu 4627 The Unsolvable Problem【hdu2013多校3签到】
链接: http://acm.hdu.edu.cn/showproblem.php?pid=4627 The Unsolvable Problem Time Limit: 2000/1000 MS ( ...
- C#窗体传值
整理一下: 1.静态变量传值,非常简单适合简单的非实例的 public calss form1:Form{ public static int A; } public class form2:Form ...
- Markov chain
w https://en.wikipedia.org/wiki/Markov_chain https://zh.wikipedia.org/wiki/马尔科夫链 In probability theo ...
- Ubuntu 16.04特性及使用基本方法
十招让Ubuntu 16.04用起来更得心应手 Ubuntu 16.04 LTS的这十项新功能,每个Ubuntu用户必须要知道! Ubuntu 16.04 LTS安装好需要设置的15件事
- MySQL中Btree和Hash的局限小结
在索引中,Btree索引和Hash索引的局限性,在这里粗略罗列一下 1 Btree局限 B-树中的节点都是顺序存储的,所以可以利用索引进行查找(找某些值),也可以对查询结果进行ORDER BY(注意O ...
- 关于服务器jdk版本和代码编译调试兼容问题
首先代码是基于哪个版本编写和调试,有没有用到新版本jdk新的特性,类啊接口啊啥的,用到了的话,就不行了 其他都共有的是向下兼容的 最好开发环境的jdk版本和部署环境的jdk版本匹配.
- R语言(一)
向量运算 R的强大功能之一就是把整个数据向量作为一个单一对象来处理.一个数据向量仅是数字的排列,一个向量可以通过如下方式构造 weight<-c(,,,) weight [] 结构c(--)用来 ...