Twenty Newsgroups Classification任务之二seq2sparse

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles，从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息，分别是：（1）DocumentTokenizer（2）WordCount（3）MakePartialVectors（4）MergePartialVectors（5）VectorTfIdf Document Frequency Count（6）MakePartialVectors（7）MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息：

Usage:

 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize

<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma

<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>

--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options

  --minSupport (-s) minSupport        (Optional) Minimum Support. Default

                                      Value: 2

  --analyzerName (-a) analyzerName    The class name of the analyzer

  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB

  --output (-o) output                The directory pathname for output.

  --input (-i) input                  Path to job input directory.

  --minDF (-md) minDF                 The minimum document frequency.  Default

                                      is 1

  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors

                                      to be used, expressed in times the

                                      standard deviation (sigma) of the

                                      document frequencies of these vectors.

                                      Can be used to remove really high

                                      frequency terms. Expressed as a double

                                      value. Good value to be specified is 3.0.

                                      In case the value is less then 0 no

                                      vectors will be filtered out. Default is

                                      -1.0.  Overrides maxDFPercent

  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.

                                      Can be used to remove really high

                                      frequency terms. Expressed as an integer

                                      between 0 and 100. Default is 99.  If

                                      maxDFSigma is also set, it will override

                                      this value.

  --weight (-wt) weight               The kind of weight to use. Currently TF

                                      or TFIDF

  --norm (-n) norm                    The norm to use, expressed as either a

                                      float or "INF" if you want to use the

                                      Infinite norm.  Must be greater or equal

                                      to 0.  The default is not to normalize

  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood

                                      Ratio(Float)  Default is 1.0

  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.

                                      Default Value: 1

  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to

                                      create (2 = bigrams, 3 = trigrams, etc)

                                      Default Value:1

  --overwrite (-ow)                   If set, overwrite the output directory

  --help (-h)                         Print out help

  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should

                                      be SequentialAccessVectors. If set true

                                      else false

  --namedVector (-nv)                 (Optional) Whether output vectors should

                                      be NamedVectors. If set true else false

  --logNormalize (-lnorm)             (Optional) Whether output vectors should

                                      be logNormalize. If set true else false

在昨天算法的终端信息中该步骤的调用命令如下：

./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

我们只看对应的参数，首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化（设置则为true），-nv解释为输出向量被设置为named 向量，这里的named是啥意思？（暂时不清楚），-wt tfidf解释为使用权重的算法，具体参考 http://zh.wikipedia.org/wiki/TF-IDF 。

第（1）步在SparseVectorsFromSequenceFiles的253行的：

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

这里进入可以看到使用的Mapper是：SequenceFileTokenizerMapper，没有使用Reducer。Mapper的代码如下：

protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {

    TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));

    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

    StringTuple document = new StringTuple();

    stream.reset();

    while (stream.incrementToken()) {

      if (termAtt.length() > 0) {

        document.add(new String(termAtt.buffer(), 0, termAtt.length()));

      }

    }

    context.write(key, document);

  }

该Mapper的setup函数主要设置Analyzer的，关于Analyzer的api参考： http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html ，其中在map中用到的函数为 reusableTokenStream( String fieldName, Reader reader) ：Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序：

package mahout.fansy.test.bayes;

import java.io.IOException;

import java.io.StringReader;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.Text;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.TokenStream;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import org.apache.mahout.common.ClassUtils;

import org.apache.mahout.common.StringTuple;

import org.apache.mahout.vectorizer.DefaultAnalyzer;

import org.apache.mahout.vectorizer.DocumentProcessor;

public class TestSequenceFileTokenizerMapper {

	/**

	 * @param args

	 */

	private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",

Analyzer.class);

	public static void main(String[] args) throws IOException {

		testMap();

	}

	public static void testMap() throws IOException{

		Text key=new Text("4096");

		Text value=new Text("today is also late.what about tomorrow?");

		TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));

	    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

	    StringTuple document = new StringTuple();

	    stream.reset();

	    while (stream.incrementToken()) {

	      if (termAtt.length() > 0) {

	        document.add(new String(termAtt.buffer(), 0, termAtt.length()));

	      }

	    }

	    System.out.println("key:"+key.toString()+",document"+document);

	}

}

得出的结果如下：

key:4096,document[today, also, late.what, about, tomorrow]

其中，TokenStream有一个stopwords属性，值为：[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]，所以当遇到这些单词的时候就不进行计算了。

额，又太晚了。哎，早困了，刷个牙线。。。

分享，快乐，成长

转载请注明出处：http://blog.csdn.net/fansy1990

Twenty Newsgroups Classification任务之二seq2sparse的更多相关文章

Twenty Newsgroups Classification任务之二seq2sparse（5）
接上篇blog,继续分析.接下来要调用代码如下: // Should document frequency features be processed if (shouldPrune || proce ...
Twenty Newsgroups Classification任务之二seq2sparse（3）
接上篇,如果想对上篇的问题进行测试其实可以简单的编写下面的代码: package mahout.fansy.test.bayes.write; import java.io.IOException; ...
Twenty Newsgroups Classification任务之二seq2sparse（2）
接上篇,SequenceFileTokenizerMapper的输出文件在/home/mahout/mahout-work-mahout0/20news-vectors/tokenized-docum ...
mahout 运行Twenty Newsgroups Classification实例
按照mahout官网https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups的说法,我只用运行一条命令就可以完成这个算法 ...
Twenty Newsgroups Classification实例任务之TrainNaiveBayesJob(一)
接着上篇blog,继续看log里面的信息如下: + echo 'Training Naive Bayes model' Training Naive Bayes model + ./bin/mahou ...
项目笔记《DeepLung:Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification》（二）（上）模型设计
我只讲讲检测部分的模型,后面两样性分类的试验我没有做,这篇论文采用了很多肺结节检测论文都采用的u-net结构,准确地说是具有DPN结构的3D版本的u-net,直接上图. DPN是颜水成老师团队的成果, ...
深度学习数据集Deep Learning Datasets
Datasets These datasets can be used for benchmarking deep learning algorithms: Symbolic Music Datase ...
Open Data for Deep Learning
Open Data for Deep Learning Here you’ll find an organized list of interesting, high-quality datasets ...
深度学习课程笔记（二）Classification： Probility Generative Model
深度学习课程笔记(二)Classification: Probility Generative Model 2017.10.05 相关材料来自:http://speech.ee.ntu.edu.tw ...

随机推荐

BZOJ 1699: [Usaco2007 Jan]Balanced Lineup排队
1699: [Usaco2007 Jan]Balanced Lineup排队 Description 每天,农夫 John 的N(1 <= N <= 50,000)头牛总是按同一序列排队. ...
imagemagick /tmp/magick-xxxxxxxx
问题 imagemagick在某种场景下会狂写/tmp目录,文件名形如magick-xxxxxxxx, ls -lh查看这些文件达到几百G, du -sh查看则只有几十M 被这个问题折磨了许久,大晚上 ...
Linux CPU 负载度量公式
一个top命令不就行了么?顶多再加一些管道什么的过滤一下.我一开始也是这么想得.其实还可以理解的更多. 首先一个问题,是统计某个时间点的CPU负载,还是某个时间段的? 为了画折线图报表,一般横坐标都是 ...
Linux下which、whereis、locate、find 区别
我们经常在linux要查找某个文件或命令,但不知道放在哪里了,可以使用下面的一些命令来搜索.which 查看可执行文件的位置 whereis 查看文件的位置 locate 配合 ...
BZOJ 1648: [Usaco2006 Dec]Cow Picnic 奶牛野餐( dfs )
直接从每个奶牛所在的farm dfs , 然后算一下.. ----------------------------------------------------------------------- ...
POJ 1830 【高斯消元第一题】
首先...使用abs()等数学函数的时候,浮点数用#include<cmath>,其它用#include<cstdlib>. 概念: [矩阵的秩] 在线性代数中,一个矩阵A的列 ...
spring mvc 和ajax异步交互完整实例
Spring MVC 异步交互demo: 1.jsp页面: <%@ page language="java" contentType="text/html; cha ...
gradle项目与maven项目相互转化（转）
根据build.gradle和setting.gradle文件生成idea项目: gradle idea gradle这几年发展迅猛,github越来越多的项目都开始采用gradle来构建了,但是并不 ...
#AzureChat - 自动伸缩和虚拟机
我们很高兴地推出再一次 #AzureChat,这是 @WindowsAzure 团队为您精心打造的一个在 Twitter 上进行的聊天活动! #AzureChat 专注于云计算的各个方面以及云开发的最 ...
Nginx 因 Selinux 服务导致无法远程訪问
文章来源:http://blog.csdn.net/johnnycode/article/details/41947581 2014-12-16日昨天晚上处理好的网络訪问连接.早晨又訪问不到了. 现 ...

Twenty Newsgroups Classification任务之二seq2sparse

Twenty Newsgroups Classification任务之二seq2sparse的更多相关文章

随机推荐

热门专题