[NLP自然语言处理]计算熵和KL距离，java实现汉字和英文单词的识别，UTF8变长字符读取

算法任务：

1. 给定一个文件，统计这个文件中所有字符的相对频率（相对频率就是这些字符出现的概率——该字符出现次数除以字符总个数，并计算该文件的熵）。

2. 给定另外一个文件，按上述同样的方法计算字符分布的概率，然后计算两个文件中的字符分布的KL距离。

（熵和KL距离都是NLP自然语言处理中术语，仅仅是涉及到一两个公式而已，不影响您对代码的理解，so just try！）

说明：

1. 给定的文件可以是两个中文文件或两个英文文件，也可以是两个中英文混合文件。对于中文，计算字符，对于英文，计算词。

2.有效字符不包括空格换行符标点符号。

3.将中文字符、英文单词、其他非有效字符及其出现次数，分别写入三个文件中。

4.代码用java完成。

文章的重点：

1.如何判断一个字符是汉字，而不是ASCII、标点、日文、阿拉伯文……

2.了解汉字是如何编码的。“UTF8”绝逼是要花你一整个下午时间去弄明白的东西。

3.正则表达式。对于计算机科班出身的人应该不陌生，在此我就不造次了。

代码如下：

 import java.io.BufferedReader;

 import java.io.FileInputStream;

 import java.io.FileReader;

 import java.io.FileWriter;

 import java.util.HashMap;

 import java.util.Iterator;

 import java.util.Map.Entry;

 import java.util.regex.Matcher;

 import java.util.regex.Pattern;

 public class NLPFileUnit {

     public HashMap<String, Integer> WordOccurrenceNumber;//The Occurrence Number of the single Chinese character

     //or Single English word in the file

     public HashMap<String, Float> WordProbability;//The probability of single Chinese character or English word

     public HashMap<String, Integer> Punctuations;//The punctuation that screened out from the file

     public float entropy;//熵，本文主要计算单个汉字，或者单个英文单词的熵值

     private String filePath;

     //构造函数

     public NLPFileUnit(String filePath) throws Exception {

         this.filePath = filePath;

         WordOccurrenceNumber = createHash(createReader(filePath));

         Punctuations = filterPunctuation(WordOccurrenceNumber);

         WordProbability = calProbability(WordOccurrenceNumber);

         this.entropy = calEntropy(this.WordProbability);

         System.out.println("all punctuations were saved at " + filePath.replace(".", "_punctuation.") + "!");

         this.saveFile(Punctuations, filePath.replace(".", "_punctuation."));

         System.out.println("all words(En & Ch) were saved at " + filePath.replace(".", "_AllWords.") + "!");

         this.saveFile(this.WordOccurrenceNumber, filePath.replace(".", "_AllWords."));

     }

     /**

      * get the English words form the file to HashMap

      * @param hash

      * @param path

      * @throws Exception

      */

     public void getEnWords(HashMap<String, Integer> hash, String path) throws Exception {

         FileReader fr = new FileReader(path);

         BufferedReader br = new BufferedReader(fr);

         //read all lines into content

         String content = "";

         String line = null;

         while((line = br.readLine())!=null){

             content+=line;

         }

         br.close();

         //extract words by regex正则表达式

         Pattern enWordsPattern = Pattern.compile("([A-Za-z]+)");

         Matcher matcher = enWordsPattern.matcher(content);

         while (matcher.find()) {

             String word = matcher.group();

             if(hash.containsKey(word))

                 hash.put(word, 1 + hash.get(word));

             else{

                 hash.put(word, 1);

             }

         }

     }

     private boolean isPunctuation(String tmp) {

         //Punctuation should not be EN words/ Chinese

         final String cnregex = "\\p{InCJK Unified Ideographs}";

         final String enregex = "[A-Za-z]+";

         return !(tmp.matches(cnregex) || tmp.matches(enregex)) ;

     }

     /**

      * judge whether the file is encoded by UTF-8 (UCS Transformation Format)format.

      * @param fs

      * @return

      * @throws Exception

      */

     private boolean isUTF8(FileInputStream fs) throws Exception {

         if (fs.read() == 0xEF && fs.read() == 0xBB && fs.read() == 0xBF)//所有utf8编码的文件前三个字节为0xEFBBBF

             return true;

         return false;

     }

     /**

      * utf8格式编码的字符，其第一个byte的二进制编码可以判断该字符的长度（汉字一般占三个字节）ASCII占一byte

      * @param b

      * @return

      */

     private int getlength(byte b) {

         int v = b & 0xff;//byte to 十六进制数

         if (v > 0xF0) {

             return 4;

         }

         // 110xxxxx

         else if (v > 0xE0) {

             return 3;

         } else if (v > 0xC0) {

             return 2;//该字符长度占2byte

         }

         return 1;

     }

     /**

      * 通过读取头一个byte来判断该字符占用字节数，并读取该字符，如1110xxxx，表示这个字符占三个byte

      * @param fs

      * @return

      * @throws Exception

      */

     private String readUnit(FileInputStream fs) throws Exception {

         byte b = (byte) fs.read();

         if (b == -1)

             return null;

         int len = getlength(b);

         byte[] units = new byte[len];

         units[0] = b;

         for (int i = 1; i < len; i++) {

             units[i] = (byte) fs.read();

         }

         String ret = new String(units, "UTF-8");

         return ret;

     }

     /**

      * 把单词，标点，汉字等全都读入hashmap

      * @param inputStream

      * @return

      * @throws Exception

      */

     private HashMap<String, Integer> createHash(FileInputStream inputStream)

             throws Exception {

         HashMap<String, Integer> hash = new HashMap<String, Integer>();

         String key = null;

         while ((key = readUnit(inputStream)) != null) {

             if (hash.containsKey(key)) {

                 hash.put(key, 1 + (int) hash.get(key));

             } else {

                 hash.put(key, 1);

             }

         }

         inputStream.close();

         getEnWords(hash, this.filePath);

         return hash;

     }

     /**

      * FileInputStream读取文件，若文件不是UTF8编码，返回null

      * @param path

      * @return

      * @throws Exception

      */

     private FileInputStream createReader(String path) throws Exception {

         FileInputStream br = new FileInputStream(path);

         if (!isUTF8(br))

             return null;

         return br;

     }

     /**

      * save punctuation filtered form (HashMap)hash into (HashMap)puncs,

      * @param hash;remove punctuation form (HashMap)hash at the same time

      * @return

      */

     private HashMap<String, Integer> filterPunctuation(

             HashMap<String, Integer> hash) {

         HashMap<String, Integer> puncs = new HashMap<String, Integer>();

         Iterator<?> iterator = hash.entrySet().iterator();

         while (iterator.hasNext()) {

             Entry<?, ?> entry = (Entry<?, ?>) iterator.next();

             String key = entry.getKey().toString();

             if (isPunctuation(key)) {

                 puncs.put(key, hash.get(key));

                 iterator.remove();

             }

         }

         return puncs;

     }

     /**

      * calculate the probability of the word in hash

      * @param hash

      * @return

      */

     private HashMap<String, Float> calProbability(HashMap<String, Integer> hash) {

         float count = countWords(hash);

         HashMap<String, Float> prob = new HashMap<String, Float>();

         Iterator<?> iterator = hash.entrySet().iterator();

         while (iterator.hasNext()) {

             Entry<?, ?> entry = (Entry<?, ?>) iterator.next();

             String key = entry.getKey().toString();

             prob.put(key, hash.get(key) / count);

         }

         return prob;

     }

     /**

      * save the content in the hash into file.txt

      * @param hash

      * @param path

      * @throws Exception

      */

     private void saveFile(HashMap<String, Integer> hash, String path)

             throws Exception {

         FileWriter fw = new FileWriter(path);

         fw.write(hash.toString());

         fw.close();

     }

     /**

      * calculate the total words in hash

      * @param hash

      * @return

      */

     private int countWords(HashMap<String, Integer> hash) {

         int count = 0;

         for (Entry<String, Integer> entry : hash.entrySet()) {

             count += entry.getValue();

         }

         return count;

     }

     /**

      * calculate the entropy（熵） of the characters

      * @param hash

      * @return

      */

     private float calEntropy(HashMap<String, Float> hash) {

         float entropy = 0;

         Iterator<Entry<String, Float>> iterator = hash.entrySet().iterator();

         while (iterator.hasNext()) {

             Entry<String, Float> entry = (Entry<String, Float>) iterator.next();

             Float prob = entry.getValue();//get the probability of the characters

             entropy += 0 - (prob * Math.log(prob));//calculate the entropy of the characters

         }

         return entropy;

     }

 }

 import java.io.BufferedReader;

 import java.io.FileNotFoundException;

 import java.io.IOException;

 import java.io.InputStreamReader;

 import java.util.HashMap;

 import java.util.Iterator;

 import java.util.Map.Entry;

 public class NLPWork {

     /**

      * calculate the KL distance form file u1 to file u2

      * @param u1

      * @param u2

      * @return

      */

     public static float calKL(NLPFileUnit u1, NLPFileUnit u2) {

         HashMap<String, Float> hash1 = u1.WordProbability;

         HashMap<String, Float> hash2 = u2.WordProbability;

         float KLdistance = 0;

         Iterator<Entry<String, Float>> iterator = hash1.entrySet().iterator();

         while (iterator.hasNext()) {

             Entry<String, Float> entry = iterator.next();

             String key = entry.getKey().toString();

             if (hash2.containsKey(key)) {

                 Float value1 = entry.getValue();

                 Float value2 = hash2.get(key);

                 KLdistance += value1 * Math.log(value1 / value2);

             }

         }

         return KLdistance;

     }

     public static void main(String[] args) throws IOException, Exception {

         //all punctuation will be saved under working directory

         System.out.println("Now only UTF8 encoded file is supported!!!");

         System.out.println("PLS input file 1 path:");

         BufferedReader cin = new BufferedReader(

                 new InputStreamReader(System.in));

         String file1 = cin.readLine();

         System.out.println("PLS input file 2 path:");

         String file2 = cin.readLine();

         NLPFileUnit u1 = null;

         NLPFileUnit u2 = null;

         try{

             u1 = new NLPFileUnit(file1);//NLP:Nature Language Processing

             u2 = new NLPFileUnit(file2);

         }

         catch(FileNotFoundException e){

             System.out.println("File Not Found!!");

             e.printStackTrace();

             return;

         }

         float KLdistance = calKL(u1, u2);

         System.out.println("KLdistance is :" + KLdistance);

         System.out.println("File 1 Entropy: " + u1.entropy);

         System.out.println("File 2 Entropy: " + u2.entropy);

     }

 }

计算结果：

[NLP自然语言处理]计算熵和KL距离，java实现汉字和英文单词的识别，UTF8变长字符读取的更多相关文章

各种形式的熵函数，KL距离
自信息量I(x)=-log(p(x)),其他依次类推. 离散变量x的熵H(x)=E(I(x))=-$\sum\limits_{x}{p(x)lnp(x)}$ 连续变量x的微分熵H(x)=E(I(x)) ...
KL距离，Kullback-Leibler Divergence
http://www.cnblogs.com/ywl925/p/3554502.html http://www.cnblogs.com/hxsyl/p/4910218.html http://blog ...
（转载）KL距离，Kullback-Leibler Divergence
转自:KL距离,Kullback-Leibler Divergence KL距离,是Kullback-Leibler差异(Kullback-Leibler Divergence)的简称,也叫做相对 ...
【机器学习基础】熵、KL散度、交叉熵
熵(entropy).KL 散度(Kullback-Leibler (KL) divergence)和交叉熵(cross-entropy)在机器学习的很多地方会用到.比如在决策树模型使用信息增益来选择 ...
深度学习中交叉熵和KL散度和最大似然估计之间的关系
机器学习的面试题中经常会被问到交叉熵(cross entropy)和最大似然估计(MLE)或者KL散度有什么关系,查了一些资料发现优化这3个东西其实是等价的. 熵和交叉熵提到交叉熵就需要了解下信息论 ...
【转载】 KL距离（相对熵）
原文地址: https://www.cnblogs.com/nlpowen/p/3620470.html ----------------------------------------------- ...
KL距离（相对熵）
KL距离,是Kullback-Leibler差异(Kullback-Leibler Divergence)的简称,也叫做相对熵(Relative Entropy).它衡量的是相同事件空间里的两个概率分 ...
NLP 自然语言处理实战
前言自然语言处理 ( Natural Language Processing, NLP) 是计算机科学领域与人工智能领域中的一个重要方向.它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和 ...
最大熵与最大似然，以及KL距离。
DNN中最常使用的离散数值优化目标,莫过于交差熵.两个分布p,q的交差熵,与KL距离实际上是同一回事. $-\sum plog(q)=D_{KL}(p\shortparallel q)-\sum pl ...

随机推荐

java_eclipse_svn 与服务器同步时，忽略某类型文件和文件夹
1. 在项目开发中使用svn ,带来很大的方便,有时我们会把整个项目上传的svn服务器上这样就包含了编译过的class文件以及一些 .svn,.log文件,有些文件时本地complie 的 ...
JS Tree
jQuery插件实例七:一棵Tree的生成史在需要表示级联.层级的关系中,Tree作为最直观的表达方式常出现在组织架构.权限选择等层级关系中.典型的表现形试类似于: 一颗树的生成常常包括三个部分:1 ...
Cracking Microservices practices
微服务最佳实践英文原文:Cracking Microservices practices 在我还不知道什么叫微服务架构的时候我就使用过它.以前,我写了一些管道程序(pipeline applicat ...
WebBrowser控件使用详解
原文:WebBrowser控件使用详解方法说明 GoBack 相当于IE的“后退”按钮,使你在当前历史列表中后退一项 GoForward 相当于IE的“前进”按钮,使你在当前历史列表中前进一项 G ...
.net mvc ajax list post
http://stackoverflow.com/questions/13242414/passing-a-list-of-objects-into-an-mvc-controller-method- ...
kprobe 内核模块
代码来自于linux内核sample/kprobe kprobe_example.c /* * NOTE: This example is works on x86 and powerpc. * He ...
asp.net mvc上传头像加剪裁功能
原文:asp.net mvc上传头像加剪裁功能正好项目用到上传+剪裁功能,发上来便于以后使用. 我不能告诉你们其实是从博客园扒的前台代码,哈哈. 前端是jquery+fineuploader+jqu ...
WaitHandle、AutoResetEvent、ManualResetEvent
多线程中的锁系统(三)-WaitHandle.AutoResetEvent.ManualResetEvent 介绍本章主要说下基于内核模式构造的线程同步方式,事件,信号量. 目录一:理论二:Wa ...
.net4.5的弱事件
.net4.5的弱事件没有伟大的愿望,就没有伟大的天才--Aaronyang的博客(www.ayjs.net)-www.8mi.me 1. 事件-我的讲法老师常告诉我,事件是特殊的委托,为委托提供 ...
使用 C# 进行 Outlook 2007 编程
原文:使用 C# 进行 Outlook 2007 编程探讨如何使用 C# 编程语言生成 Outlook 识别的应用程序和 Outlook 外接程序. 请从"Add References&q ...

[NLP自然语言处理]计算熵和KL距离，java实现汉字和英文单词的识别，UTF8变长字符读取

[NLP自然语言处理]计算熵和KL距离，java实现汉字和英文单词的识别，UTF8变长字符读取的更多相关文章

随机推荐

热门专题