Android--推断文本文件编码

方法1：利用windows文本文件编码特点。

windows下。Unicode、Unicode big endian和UTF-8编码的txt文件的开头会多出几个字节，各自是FF、FE（Unicode）,FE、FF（Unicode big endian）,EF、BB、BF（UTF-8）。

public static String getCharset(File file) {

        String charset = "GBK";

        byte[] first3Bytes = new byte[3];

        try {

            boolean checked = false;

            BufferedInputStream bis = new BufferedInputStream(

                  new FileInputStream(file));

            bis.mark(0);

            int read = bis.read(first3Bytes, 0, 3);

            if (read == -1)

                return charset;

            if (first3Bytes[0] == (byte) 0xFF && first3Bytes[1] == (byte) 0xFE) {

                charset = "UTF-16LE";

                checked = true;

            } else if (first3Bytes[0] == (byte) 0xFE && first3Bytes[1]

                == (byte) 0xFF) {

                charset = "UTF-16BE";

                checked = true;

            } else if (first3Bytes[0] == (byte) 0xEF && first3Bytes[1]

                    == (byte) 0xBB

                    && first3Bytes[2] == (byte) 0xBF) {

                charset = "UTF-8";

                checked = true;

            }

            bis.reset();

            if (!checked) {

                int loc = 0;

                while ((read = bis.read()) != -1) {

                    loc++;

                    if (read >= 0xF0)

                        break;

                    //单独出现BF下面的。也算是GBK

                    if (0x80 <= read && read <= 0xBF)

                        break;

                    if (0xC0 <= read && read <= 0xDF) {

                        read = bis.read();

                        if (0x80 <= read && read <= 0xBF)// 双字节 (0xC0 - 0xDF)

                            // (0x80 -

                            // 0xBF),也可能在GB编码内

                            continue;

                        else

                            break;

                     // 也有可能出错，可是几率较小

                    } else if (0xE0 <= read && read <= 0xEF) {

                        read = bis.read();

                        if (0x80 <= read && read <= 0xBF) {

                            read = bis.read();

                            if (0x80 <= read && read <= 0xBF) {

                                charset = "UTF-8";

                                break;

                            } else

                                break;

                        } else

                            break;

                    }

                }

                System.out.println(loc + " " + Integer.toHexString(read));

            }

            bis.close();

        } catch (Exception e) {

            e.printStackTrace();

        }

        return charset;

    }

缺点：不能这样去探測linux下的文件。

方法2：开源project JCharDet

http://www.iteye.com/topic/266501

package org.mozilla.intl.chardet;

import java.io.BufferedInputStream;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

/**

 * 借助JCharDet获取文件字符集

 * @author icer

 * PS:

 * JCharDet 是mozilla自己主动字符集探測算法代码的java移植，其官方主页为：

 *      http://jchardet.sourceforge.net/

 * @date	2008/11/13

 */

public class FileCharsetDetector {

	private boolean found = false;

	/**

	 * 假设全然匹配某个字符集检測算法, 则该属性保存该字符集的名称. 否则(如二进制文件)其值就为默认值 null, 这时应当查询属性

	 */

	private String encoding = null;

	public static void main(String[] argv) throws Exception {

		if (argv.length != 1 && argv.length != 2) {

			System.out

					.println("Usage: FileCharsetDetector <path> [<languageHint>]");

			System.out.println("");

			System.out.println("Where <path> is d:/demo.txt");

			System.out.println("For optional <languageHint>. Use following...");

			System.out.println("		1 => Japanese");

			System.out.println("		2 => Chinese");

			System.out.println("		3 => Simplified Chinese");

			System.out.println("		4 => Traditional Chinese");

			System.out.println("		5 => Korean");

			System.out.println("		6 => Dont know (default)");

			return;

		} else {

			String encoding = null;

			if (argv.length == 2) {

				encoding = new FileCharsetDetector().guestFileEncoding(argv[0],

						Integer.valueOf(argv[1]));

			} else {

				encoding = new FileCharsetDetector().guestFileEncoding(argv[0]);

			}

			System.out.println("文件编码:" + encoding);

		}

	}

	/**

	 * 传入一个文件(File)对象，检查文件编码

	 *

	 * @param file

	 *            File对象实例

	 * @return 文件编码。若无，则返回null

	 * @throws FileNotFoundException

	 * @throws IOException

	 */

	public String guestFileEncoding(File file) throws FileNotFoundException,

			IOException {

		return geestFileEncoding(file, new nsDetector());

	}

	/**

	 * 获取文件的编码

	 *

	 * @param file

	 *            File对象实例

	 * @param languageHint

	 *            语言提示区域代码 eg：1 : Japanese; 2 : Chinese; 3 : Simplified Chinese;

	 *            4 : Traditional Chinese; 5 : Korean; 6 : Dont know (default)

	 * @return 文件编码，eg：UTF-8,GBK,GB2312形式。若无，则返回null

	 * @throws FileNotFoundException

	 * @throws IOException

	 */

	public String guestFileEncoding(File file, int languageHint)

			throws FileNotFoundException, IOException {

		return geestFileEncoding(file, new nsDetector(languageHint));

	}

	/**

	 * 获取文件的编码

	 *

	 * @param path

	 *            文件路径

	 * @return 文件编码，eg：UTF-8,GBK,GB2312形式，若无。则返回null

	 * @throws FileNotFoundException

	 * @throws IOException

	 */

	public String guestFileEncoding(String path) throws FileNotFoundException,

			IOException {

		return guestFileEncoding(new File(path));

	}

	/**

	 * 获取文件的编码

	 *

	 * @param path

	 *            文件路径

	 * @param languageHint

	 *            语言提示区域代码 eg：1 : Japanese; 2 : Chinese; 3 : Simplified Chinese;

	 *            4 : Traditional Chinese; 5 : Korean; 6 : Dont know (default)

	 * @return

	 * @throws FileNotFoundException

	 * @throws IOException

	 */

	public String guestFileEncoding(String path, int languageHint)

			throws FileNotFoundException, IOException {

		return guestFileEncoding(new File(path), languageHint);

	}

	/**

	 * 获取文件的编码

	 *

	 * @param file

	 * @param det

	 * @return

	 * @throws FileNotFoundException

	 * @throws IOException

	 */

	private String geestFileEncoding(File file, nsDetector det)

			throws FileNotFoundException, IOException {

		// Set an observer...

		// The Notify() will be called when a matching charset is found.

		det.Init(new nsICharsetDetectionObserver() {

			public void Notify(String charset) {

				found = true;

				encoding = charset;

			}

		});

		BufferedInputStream imp = new BufferedInputStream(new FileInputStream(

				file));

		byte[] buf = new byte[1024];

		int len;

		boolean done = false;

		boolean isAscii = true;

		while ((len = imp.read(buf, 0, buf.length)) != -1) {

			// Check if the stream is only ascii.

			if (isAscii)

				isAscii = det.isAscii(buf, len);

			// DoIt if non-ascii and not done yet.

			if (!isAscii && !done)

				done = det.DoIt(buf, len, false);

		}

		det.DataEnd();

		if (isAscii) {

			encoding = "ASCII";

			found = true;

		}

		if (!found) {

			String prob[] = det.getProbableCharsets();

			if (prob.length > 0) {

				// 在没有发现情况下，则取第一个可能的编码

				encoding = prob[0];

			} else {

				return null;

			}

		}

		return encoding;

	}

}

jar包下载地址：http://download.csdn.net/detail/u012587637/8047697

方法3：开源projectjuniversalcharde

http://code.google.com/p/juniversalchardet/

public static String getFileIncode(File file) {

		if (!file.exists()) {

			System.err.println("getFileIncode: file not exists!");

			return null;

		}

		byte[] buf = new byte[4096];

		FileInputStream fis = null;

		try {

			fis = new FileInputStream(file);

			// (1)

			UniversalDetector detector = new UniversalDetector(null);

			// (2)

			int nread;

			while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {

				detector.handleData(buf, 0, nread);

			}

			// (3)

			detector.dataEnd();

			// (4)

			String encoding = detector.getDetectedCharset();

			if (encoding != null) {

				System.out.println("Detected encoding = " + encoding);

			} else {

				System.out.println("No encoding detected.");

			}

			// (5)

			detector.reset();

			fis.close();

			return encoding;

		} catch (Exception e) {

			e.printStackTrace();

		}

		return null;

	}

引入包的方法：

将包放入libs目录。

选中包，右键 --> build path--> add to build path。

jar包下载：http://download.csdn.net/detail/u012587637/8041181

说明：第三个方法要比第二个速度快些，也比較新，所以推荐使用第三个。

Android--推断文本文件编码的更多相关文章

android Java BASE64编码和解码二：图片的编码和解码
1.准备工作 (1)在项目中集成 Base64 代码,集成方法见第一篇博文:android Java BASE64编码和解码一:基础 (2)添加 ImgHelper 工具类 package com.a ...
android TextView Unicde编码转换 android中一些特殊字符Unicode码值
android TextView Unicde编码转换 android中一些特殊字符Unicode码值 android中一些特殊字符(如:←↑→↓等箭头符号,约等于号≍)的Unicode码值 Text ...
Windows文本文件编码
目录 1 ANSI编码 2 2 UTF16BE编码 2 3 UTF16LE编码 2 4 UTF-8编码 2 5 BOM 3 6 乱码 3 7 总结 5 如下图 ...
android Java BASE64编码和解码一：基础
今天在做Android项目的时候遇到一个问题,需求是向服务器上传一张图片,要求把图片转化成图片流放在 json字符串里传输. 类似这样的: {"name":"jike&q ...
android 推断Apk是否签名和签名是否一致
推断Apk是否签名用命令:jarsigner -verify -verbose -certs <apk文件> 假设有Android Debug字樣就是debug 假设已经签名: [证书的 ...
自动判断文本文件编码来读取文本文件内容(.net版本和java版本)
.net版本 using System; using System.IO; using System.Text; namespace G2.Common { /// <summary> / ...
android推断手机是否root
关于推断手机是否已经root的方法.假设app有一些特殊功能须要root权限,则须要推断是否root. 比方一些市场下载完app后自己主动安装. /** * @author Kevin Kowalew ...
java自动识别用户上传的文本文件编码
原文:http://www.open-open.com/code/view/1420514359234 经常碰到用户上传的部分数据文本文件乱码问题,又不能限制用户的上传的文件编码格式(这样对客户的要求 ...
[Android]推断网络连接是否可用
/** * 推断移动网络是否开启 * * @param context * @return */ public static boolean isNetEnabled(Context context) ...

随机推荐

解决RegexKitLite编译报错
原地址:http://blog.csdn.net/kepoon/article/details/7586861 在编译RegexKitLite的时候,报错如下: Undefined symbols f ...
算法笔记_114:等额本金(Java)
1 等额本金标题:等额本金小明从银行贷款3万元.约定分24个月,以等额本金方式还款. 这种还款方式就是把贷款额度等分到24个月.每个月除了要还固定的本金外,还要还贷款余额在一个月中产生的利息. ...
算法笔记_078:蓝桥杯练习最大最小公倍数（Java）
目录 1 问题描述 2 解决方案 1 问题描述问题描述已知一个正整数N,问从1~N中任选出三个数,他们的最小公倍数最大可以为多少. 输入格式输入一个正整数N. 输出格式输出一个整数,表示你 ...
JavaScript（三）-- DOM编程
JavaScript编程中最基本的就是DOM编程,DOM是 Document Object Model文本对象模型,就是对DOM对象进行编程的过程. Java语言和Js都有针对于DOM的编程,两者类似 ...
REMOTE HOST IDENTIFICATION HAS CHANGED问题的解决方式
好久没更新博客园. 这段没更新博客的时间内收获了很多,所以更新下博客来整理.记录这段时间内学到的内容. 最近腾讯云服务器欠费停机了,所以趁着缴费.趁着心血来潮就……重装了云系统.结果在进行远程ssh连 ...
Android studio 使用心得(一)—android studio快速掌握快捷键
大家都是从eclipse转过来了,所以早就熟悉了eclipse那一套快捷键. File—>settings—>keymap–>选择eclipse就搞定话是这么说,但是自动化提示的变 ...
javascirpt 用英文逗号替换英文分号、中英文逗号或者回车
function ReplaceSeperator(mobiles) { var i; var result = ""; var c; for (i = 0; i < mob ...
atitit.md5算法的原理与总结
atitit.md5算法的原理与总结 1. MD5的位数 128位1 2. 字节数组转换为32位字符串 base161 2.1. 十六进制字符用4个二进制位来表示1 2.2. byte[]和十六进 ...
TLS线程局部存储
0x01 TLS (Thread Local Storage) 为线程单独提供的私有空间 0x02 gcc中的隐式TLS使用方法隐式TLS __thread int number; 显式TLS pt ...
poj 3617 Best Cow Line (字符串反转贪心算法)
Best Cow Line Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 9284 Accepted: 2826 Des ...

Android--推断文本文件编码

Android--推断文本文件编码的更多相关文章

随机推荐

热门专题