LingPipe-TextClassification(文本分类)
What is Text Classification?
Text classification typically involves assigning a document to a category by automated or human means. LingPipe provides a classification facility that takes examples of text classifications--typically generated by a human--and learns how to classify further documents using what it learned with language models. There are many other ways to construct classifiers, but language models are particularly good at some versions of this task.
什么是文本分类?
文本分类通常指的是把一个文档自动或者按照人的意愿去归类。LingPipe 提供了基于人为已经分好类的文本根据语言模型去学习自动分类。有很多方法都可以构建分类器,但是语言模型对于分类器的构建很有帮助。
20 Newsgroups Demo(20个新闻包示例)
A publicly available data set to work with is the 20 newsgroups data available from the20 Newsgroups Home Page
从这可以下载一个包含20个新闻包的公开数据集。
4 Newsgroups Sample (4个新闻包例子)
We have included a sample of 4 newsgroups with the LingPipe distribution in order to allow you to run the tutorial out of the box. You may also download and run over the entire 20 newsgroup dataset. LingPipe's performance over the whole data set is state of the art.
为了让用户能够顺利的看着本教程运行LingPipe的发布版本,我们的例子里面已经包含了4个新闻包。你也可以下载完整的新闻包(20个)然后在完整的数据集上运行LingPipe。LingPipe运行在完整的数据集上的效果会更好。
Quick Start(快速入门)
Once you have downloaded and installed LingPipe, change directories to the one containing this read-me:
如果你已经下载并且安装了LingPipe,进入跟目录(包含ReadMe的目录):
cd demos/tutorial/classify
You may then run the demo from the command line (placing all of the code on one line):
然后你就可以在命令行下运行我们的示例了(没有换行):
On Windows:(在windows下)
java
-cp "../../../lingpipe-4.1.0.jar;
classifyNews.jar"
ClassifyNews
or through Ant:(或者通过终端)
ant classifyNews
The demo will then train on the data in demos/fourNewsGroups/4news-train/
and evaluate on demos/4newsgroups/4news-test
. The results of scoring are printed to the command line and explained in the rest of this tutorial.
自带的示例将会在 demos/fourNewsGroups/4news-train/这四个训练集中进行训练,在 demos/4newsgroups/4news-test中进行测试评估。最终得分结果会打印在命令行窗口中,在接下来的教程中我们会对这个结果进行解释。
The Code(代码)
The entire source for the example is ClassifyNews.java. We will be using the API from Classifier and its subclasses to train the classifier, and Classifcation to evaluate it. The code should be pretty self explanatory in terms of how training and evaluation are done. Below I go over the API calls.
ClassifyNews.java 是整个分类的源文件。我们要用Classifier 以及它的子类去训练评估分类器。你可以通过阅读我们提供的规范的代码去了解怎么实现训练和评估分类器的。接下来我们开始学习分类器的API。
Training(分类器的训练)
We are going to train up a set of character based language models (one per newsgroup as named in the static array CATEGORIES
) that processes data in 6 character sequences as specified by the NGRAM_SIZE
constant.
根据语言模型训练出一组特征值(每一个新闻集都被命名进静态数组 CATEGORIES)that processes data in 6 character sequences as specified by the NGRAM_SIZE
constant.
private static String[] CATEGORIES
= { "soc.religion.christian",
"talk.religion.misc",
"alt.atheism",
"misc.forsale" }; private static int NGRAM_SIZE = 6;
The smaller your data generally the smaller the n-gram sample, but you can play around with different values--reasonable ranges are from 1 to 16 with 6 being a good general starting place.
通常情况下你的训练数据集越小,你的一阶马尔科夫链(n-gram)样本集就越小,但是你可以合理的权值范围范围1到16内选取起始值,6就是一个不错的起始值。
The actual classifier involves one language model per classifier. In this case, we are going to use process language models (LanguageModel.Process
). There is a factory method in DynamicLMClassifier
to construct actual models.
实际上每个分类器都有一个语言分类模型。在这种情况下,我们才能处理语言模型。 DynamicLMClassifier是一个构建实际模型的动态工厂方法。
DynamicLMClassifier classifier
= DynamicLMClassifier
.createNGramBoundary(CATEGORIES,
NGRAM_SIZE);
There are two other kinds of language model classifiers that may be constructed, for bounded character language models and tokenized language models.
还可以构造另外其他的两类分类器,边界特征集语言模型和标记语言模型。
Training a classifier simply involves providing examples of text of the various categories. This is called through the handle
method after first constructing a classification from the category and a classified object from the classification and text:
简单的通过提供每个分类的示例文本来训练一个分类器。
Classification classification
= new Classification(CATEGORIES[i]);
Classified<CharSequence> classified
= new Classified<CharSequence>(text,classification);
classifier.handle(classified);
That's all you need to train up a language model classifier. Now we can see what it can do with some evaluation data.
这就已经完成了一个语言模型的分类器的训练。现在怎么利用测试数据测试分类器呢。
Classifying News Articles(对新闻进行分类)
The DynamicLMClassifier is pretty slow when doing classification so it is generally worth going through a compile step to produce the more efficient compiled version, which will classify character sequences into joint classification results. A simple way to do that is in the code as:
DynamicLMClassifier 这个动态类在进行分类的时候是相当慢的,所以对分类器进行联合编译是很有必要的。如代码所示:
JointClassifier<CharSequence> compiledClassifier
= (JointClassifier<CharSequence>)
AbstractExternalizable.compile(classifier);
Now the rubber hits the road and we can can see how well the machine learning is doing. The example code both reports classifications to the console and evaluates the performance. The crucial lines of code are:
现在一切准备就绪,我们看一下机器是怎么自动学习的。示例代码包括输出分类信息和评估测试。关键代码如下:
JointClassification jc = compiledClassifier.classifyJoint(text);
String bestCategory = jc.bestCategory();
String details = jc.toString();
The text is an article that was not trained on and the JointClassification is the result of evaluating the text against all the language models. Contained in it is a bestCategory()
method that returns the highest scoring language model name for the text. Just to be sure that some statistics are involved the toString()
method dumps out all the results and they are presented as:
找一篇没有训练过的文章,通过JointClassification对这个文章在 所有语言模型上进行评估测试。在这个类中的bestCategory() 方法可以针对这个文章返回一个得分最高的分类语言模型。统计结果会通过toString()方法输出,如下:
Testing on soc.religion.christian/21417
Best Cat: soc.religion.christian
Rank Cat Score P(Cat|In) log2 P(Cat,In)
0=soc.religion.christian -1.56 0.45 -1.56
1=talk.religion.misc -2.68 0.20 -2.68
2=alt.atheism -2.70 0.20 -2.70
3=misc.forsale -3.25 0.13 -3.25
Scoring Accuracy(得分精度)
The remaining API of note is how the system is scored against a gold standard. In this case our testing data. Since we know what newsgroup the article came from we can evaluate how well the software is doing with the JointClassifierEvaluator class.
剩余的API主要说明系统的黄金标准是怎么得分的。在测试数据中,我们知道新闻集的分类以及来源,但是软件是怎么做到的呢。
boolean storeInputs = true;
JointClassifierEvaluator<CharSequence> evaluator
= new JointClassifierEvaluator<CharSequence>(compiledClassifier,
CATEGORIES,
storeInputs);
This class wraps the compiledClassifier
in an evaluation framework that provide very rich reporting of how well the system is doing. Later in the code it is populated with data points with the method addCase()
, after first constructing a classified object as for training:
这个类封装在一个提供了非常丰富的系统是如何好做报告的编制分类评估框架。在后面的代码是填充的方法addCase(数据点),之后先构造一个分类对象的培训:
Classification classification
= new Classification(CATEGORIES[i]);
Classified<CharSequence> classified
= new Classified<CharSequence>(text,classification);
evaluator.handle(classified);
This will get a JointClassification for the text and then keep track of the results for reporting later. After all the data is run, then many methods exist to see how well the software did. In the demo code we just print out the total accuracy via the ConfusionMatrix class, but it is well worth looking at the relevant Javadoc for what reporting is available.
Cross-Validation(交叉验证)
Running Cross-Validation(进行交叉验证)
There's an ant target crossValidateNews
which cross-validates the news classifier over 10 folds. Here's what a run looks like:
ant的目标是对10倍以上的新闻进行交叉验证。运行结果如下:
> cd $LINGPIPE/demos/tutorial/classify
> ant crossValidateNews Reading data.
Num instances=250.
Permuting corpus.
FOLD ACCU
0 1.00 +/- 0.00
1 0.96 +/- 0.08
2 0.84 +/- 0.14
3 0.92 +/- 0.11
4 1.00 +/- 0.00
5 0.96 +/- 0.08
6 0.88 +/- 0.13
7 0.84 +/- 0.14
8 0.88 +/- 0.13
9 0.84 +/- 0.14
原文:http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
英语太菜了,接下来的吃力了
LingPipe-TextClassification(文本分类)的更多相关文章
- NLP学习(2)----文本分类模型
实战:https://github.com/jiangxinyang227/NLP-Project 一.简介: 1.传统的文本分类方法:[人工特征工程+浅层分类模型] (1)文本预处理: ①(中文) ...
- AI - TensorFlow - 示例02:影评文本分类
影评文本分类 文本分类(Text classification):https://www.tensorflow.org/tutorials/keras/basic_text_classificatio ...
- (4.2)基于LingPipe的文本基本极性分析【demo】
酒店评论情感分析系统(四)—— 基于LingPipe的文本基本极性分析[demo] (Positive (favorable) vs. Negative (unfavorable)) 这篇文章为Lin ...
- 基于Text-CNN模型的中文文本分类实战 流川枫 发表于AI星球订阅
Text-CNN 1.文本分类 转眼学生生涯就结束了,在家待就业期间正好有一段空闲期,可以对曾经感兴趣的一些知识点进行总结. 本文介绍NLP中文本分类任务中核心流程进行了系统的介绍,文末给出一个基于T ...
- 基于Text-CNN模型的中文文本分类实战
Text-CNN 1.文本分类 转眼学生生涯就结束了,在家待就业期间正好有一段空闲期,可以对曾经感兴趣的一些知识点进行总结. 本文介绍NLP中文本分类任务中核心流程进行了系统的介绍,文末给出一个基于T ...
- Python 基于 NLP 的文本分类
这是前一段时间在做的事情,有些python库需要python3.5以上,所以mac请先升级 brew安装以下就好,然后Preference(comm+',')->Project: Text-Cl ...
- Tensorflow二分类处理dense或者sparse(文本分类)的输入数据
这里做了一些小的修改,感谢谷歌rd的帮助,使得能够统一处理dense的数据,或者类似文本分类这样sparse的输入数据.后续会做进一步学习优化,比如如何多线程处理. 具体如何处理sparse 主要是使 ...
- Atitti 文本分类 以及 垃圾邮件 判断原理 以及贝叶斯算法的应用解决方案
Atitti 文本分类 以及 垃圾邮件 判断原理 以及贝叶斯算法的应用解决方案 1.1. 七.什么是贝叶斯过滤器?1 1.2. 八.建立历史资料库2 1.3. 十.联合概率的计算3 1.4. 十一. ...
- 基于weka的文本分类实现
weka介绍 参见 1)百度百科:http://baike.baidu.com/link?url=V9GKiFxiAoFkaUvPULJ7gK_xoEDnSfUNR1woed0YTmo20Wjo0wY ...
随机推荐
- iOS中使用自定义字体
1.确定你的项目工程的Resources下有你要用的字体文件(.ttf或者.odf). 2.然后在你的工程的Info.plist文件中新建一行,添加key为:UIAppFonts,类型为Array或D ...
- Ubuntu14.04下中山大学锐捷上网设置
Ubuntu14.04下中山大学锐捷上网设置 打开终端后的初始目录是 -,Ubuntu安装完毕默认路径,不是的请自行先运行cd ~ 非斜体字命令行方法,斜体字是图形管理方法,二选一即可 记得善用Tab ...
- Qt窗口部件及子部件
QWidget类是所有用户界面对象的基类,被称为基础窗口部件. #include <QApplication> #include<QLabel> #include<QWi ...
- 用例图 UseCase Diagram
从上面的用例图模型,我们可以大致了解用例图所描述的是什么.下面进行详细介绍. 用例图,即用来描述什么角色通过某某系统能做什么事情的图,用例图关注的是系统的外在表现,系统与人的交互,系统与其它系统的交互 ...
- 【BZOJ 3504】[Cqoi2014]危桥
Description Alice和Bob居住在一个由N座岛屿组成的国家,岛屿被编号为0到N-1.某些岛屿之间有桥相连,桥上的道路是双 向的,但一次只能供一人通行.其中一些桥由于年久失修成为危桥,最多 ...
- 史上最全的Excel数据编辑处理技巧(转)
史上最全的数据编辑处理技巧,让你在日常数据分析处理的疯魔状态中解放出来. 一.隐藏行列 “不得了了,Excel出现灵异事件,部分区域消失不见了!”办公室里的一个MM跑过来大声喊叫着,着实吓了俺一跳.待 ...
- NopCommerce——Web层中的布局页
援引上一篇文章关于nopcommerce源代码结构的翻译:“Nop.Web也是一个MVC Web应用程序项目,一个公有区域的展示层.它就是你实际能够运行的应用程序.它是应用程序的启动项目”.对于nop ...
- 【tarjan】BZOJ 1051:受欢迎的牛
1051: [HAOI2006]受欢迎的牛 Time Limit: 10 Sec Memory Limit: 162 MBSubmit: 3134 Solved: 1642[Submit][Sta ...
- cg 到hlsl的转换
http://msdn.microsoft.com/en-us/library/windows/desktop/ff471376(v=vs.85).aspx http://gamedev.stacke ...
- Unsupervised Learning: Use Cases
Unsupervised Learning: Use Cases Contents Visualization K-Means Clustering Transfer Learning K-Neare ...