mahout贝叶斯算法开发思路(拓展篇)2
如果想直接下面算法调用包,可以直接在mahout贝叶斯算法拓展下载,该算法调用的方式如下:
$HADOOP_HOME/bin hadoop jar mahout.jar mahout.fansy.bayes.BayerRunner -i hdfs_input_path -o hdfs_output_path -scl : -scv ,
调用参数如下:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input
directory.
--output (-o) output The directory pathname
for output.
--splitCharacterVector (-scv) splitCharacterVector Vector split
character,default is
','
--splitCharacterLabel (-scl) splitCharacterLabel Vector and Label split
character,default is
':'
--help (-h) Print out help
--tempDir tempDir Intermediate output
directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
接上篇分析下面的步骤:
4. 获取贝叶斯模型的属性值2:
这一步骤相当于 TrainNaiveBayesJob的第二个prepareJob,其中mapper和reducer都是参考这个job的,基本没有修改代码;代码如下:
package mahout.fansy.bayes; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.classifier.naivebayes.training.WeightsMapper;
import org.apache.mahout.common.AbstractJob;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.common.mapreduce.VectorSumReducer;
import org.apache.mahout.math.VectorWritable;
/**
* 贝叶斯算法第二个job任务相当于 TrainNaiveBayesJob的第二个prepareJob
* Mapper,Reducer还用原来的
* @author Administrator
*
*/
public class BayesJob2 extends AbstractJob {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new BayesJob2(),args);
} @Override
public int run(String[] args) throws Exception {
addInputOption();
addOutputOption();
addOption("labelNumber","ln", "The number of the labele ");
if (parseArguments(args) == null) {
return -1;
}
Path input = getInputPath();
Path output = getOutputPath();
String labelNumber=getOption("labelNumber");
Configuration conf=getConf();
conf.set(WeightsMapper.class.getName() + ".numLabels",labelNumber);
HadoopUtil.delete(conf, output);
Job job=new Job(conf);
job.setJobName("job2 get weightsFeture and weightsLabel by job1's output:"+input.toString());
job.setJarByClass(BayesJob2.class); job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setMapperClass(WeightsMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(VectorWritable.class);
job.setCombinerClass(VectorSumReducer.class);
job.setReducerClass(VectorSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(VectorWritable.class);
SequenceFileInputFormat.setInputPaths(job, input);
SequenceFileOutputFormat.setOutputPath(job, output); if(job.waitForCompletion(true)){
return 0;
}
return -1;
} }
其单独调用方式如下:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--labelNumber (-ln) labelNumber The number of the labele
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
其实也就是设置一个标识的个数而已,其他参考AbstractJob的默认参数;
5.贝叶斯模型写入文件:
这一步把3、4步骤的输出进行转换然后作为贝叶斯模型的一部分,然后把贝叶斯模型写入文件,其中的转换以及写入文件都参考BayesUtils中的相关方法,具体代码如下:
package mahout.fansy.bayes; import java.io.IOException; import mahout.fansy.bayes.util.OperateArgs; import org.apache.commons.cli.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
import org.apache.mahout.classifier.naivebayes.training.ThetaMapper;
import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathFilters;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.Matrix;
import org.apache.mahout.math.SparseMatrix;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable; import com.google.common.base.Preconditions; public class WriteBayesModel extends OperateArgs{ /**
* @param args,输入和输出都是没有用的,输入是job1和job 2 的输出,输出是model的路径
* model存储的路径是 输出路径下面的naiveBayesModel.bin文件
* @throws ParseException
* @throws IOException
*/
public static void main(String[] args) throws IOException, ParseException {
String[] arg={"-jt","ubuntu:9001",
"-i","",
"-o","",
"-mp","hdfs://ubuntu:9000/user/mahout/output_bayes/bayesModel",
"-bj1","hdfs://ubuntu:9000/user/mahout/output_bayes/job1",
"-bj2","hdfs://ubuntu:9000/user/mahout/output_bayes/job2"};
new WriteBayesModel().run(arg);
}
/**
* 把model写入文件中
* @param args
* @throws IOException
* @throws ParseException
*/
public int run(String[] args) throws IOException, ParseException{ // modelPath
setOption("mp","modelPath",true,"the path for bayesian model to store",true);
// bayes job 1 path
setOption("bj1","bayesJob1",true,"the path for bayes job 1",true);
// bayes job 2 path
setOption("bj2","bayesJob2",true,"the path for bayes job 2",true);
if(!parseArgs(args)){
return -1;
}
String job1Path=getNameValue("bj1");
String job2Path=getNameValue("bj2");
Configuration conf=getConf();
String modelPath=getNameValue("mp");
NaiveBayesModel naiveBayesModel=readFromPaths(job1Path,job2Path,conf);
naiveBayesModel.validate();
naiveBayesModel.serialize(new Path(modelPath), getConf());
System.out.println("Write bayesian model to '"+modelPath+"/naiveBayesModel.bin'");
return 0;
}
/**
* 摘自BayesUtils的readModelFromDir方法,只修改了相关路径
* @param job1Path
* @param job2Path
* @param conf
* @return
*/
public NaiveBayesModel readFromPaths(String job1Path,String job2Path,Configuration conf){
float alphaI = conf.getFloat(ThetaMapper.ALPHA_I, 1.0f);
// read feature sums and label sums
Vector scoresPerLabel = null;
Vector scoresPerFeature = null;
for (Pair<Text,VectorWritable> record : new SequenceFileDirIterable<Text, VectorWritable>(
new Path(job2Path), PathType.LIST, PathFilters.partFilter(), conf)) {
String key = record.getFirst().toString();
VectorWritable value = record.getSecond();
if (key.equals(TrainNaiveBayesJob.WEIGHTS_PER_FEATURE)) {
scoresPerFeature = value.get();
} else if (key.equals(TrainNaiveBayesJob.WEIGHTS_PER_LABEL)) {
scoresPerLabel = value.get();
}
} Preconditions.checkNotNull(scoresPerFeature);
Preconditions.checkNotNull(scoresPerLabel); Matrix scoresPerLabelAndFeature = new SparseMatrix(scoresPerLabel.size(), scoresPerFeature.size());
for (Pair<IntWritable,VectorWritable> entry : new SequenceFileDirIterable<IntWritable,VectorWritable>(
new Path(job1Path), PathType.LIST, PathFilters.partFilter(), conf)) {
scoresPerLabelAndFeature.assignRow(entry.getFirst().get(), entry.getSecond().get());
} Vector perlabelThetaNormalizer = scoresPerLabel.like();
return new NaiveBayesModel(scoresPerLabelAndFeature, scoresPerFeature, scoresPerLabel, perlabelThetaNormalizer,
alphaI);
} }
6. 应用贝叶斯模型分类原始数据:
这个部分的代码也基本是参考mahout中贝叶斯算法的源码,只是修改了其中的解析部分的代码而已,具体如下:
package mahout.fansy.bayes; import java.io.IOException; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.classifier.naivebayes.AbstractNaiveBayesClassifier;
import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
import org.apache.mahout.classifier.naivebayes.training.WeightsMapper;
import org.apache.mahout.common.AbstractJob;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
/**
* 用于分类的Job
* @author Administrator
*
*/
public class BayesClassifyJob extends AbstractJob {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new BayesClassifyJob(),args);
} @Override
public int run(String[] args) throws Exception {
addInputOption();
addOutputOption();
addOption("model","m", "The file where bayesian model store ");
addOption("labelNumber","ln", "The labels number ");
if (parseArguments(args) == null) {
return -1;
}
Path input = getInputPath();
Path output = getOutputPath();
String labelNumber=getOption("labelNumber");
String modelPath=getOption("model");
Configuration conf=getConf();
conf.set(WeightsMapper.class.getName() + ".numLabels",labelNumber);
HadoopUtil.cacheFiles(new Path(modelPath), conf);
HadoopUtil.delete(conf, output);
Job job=new Job(conf);
job.setJobName("Use bayesian model to classify the input:"+input.getName());
job.setJarByClass(BayesClassifyJob.class); job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setMapperClass(BayesClasifyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(VectorWritable.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(VectorWritable.class);
SequenceFileInputFormat.setInputPaths(job, input);
SequenceFileOutputFormat.setOutputPath(job, output); if(job.waitForCompletion(true)){
return 0;
}
return -1;
}
/**
* 自定义Mapper,只修改了解析部分代码
* @author Administrator
*
*/
public static class BayesClasifyMapper extends Mapper<Text, VectorWritable, Text, VectorWritable>{
private AbstractNaiveBayesClassifier classifier;
@Override
public void setup(Context context) throws IOException, InterruptedException {
System.out.println("Setup");
Configuration conf = context.getConfiguration();
Path modelPath = HadoopUtil.cachedFile(conf);
NaiveBayesModel model = NaiveBayesModel.materialize(modelPath, conf);
classifier = new StandardNaiveBayesClassifier(model);
} @Override
public void map(Text key, VectorWritable value, Context context) throws IOException, InterruptedException {
Vector result = classifier.classifyFull(value.get());
//the key is the expected value
context.write(new Text(key.toString()), new VectorWritable(result));
}
}
}
如果要单独运行这一步,可以参考:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--model (-m) model The file where bayesian model store
--labelNumber (-ln) labelNumber The labels number
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
只需提供model的路径和标识的个数这两个参数即可;
7. 对第6步分类的结果进行评价,这部分的代码如下:
package mahout.fansy.bayes; import java.io.IOException;
import java.util.Map; import mahout.fansy.bayes.util.OperateArgs; import org.apache.commons.cli.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.ResultAnalyzer;
import org.apache.mahout.classifier.naivebayes.BayesUtils;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathFilters;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory; public class AnalyzeBayesModel extends OperateArgs{ /**
* 输入是BayesClassifyJob的输出
* -o 参数没作用
*/
private static final Logger log = LoggerFactory.getLogger(AnalyzeBayesModel.class);
public static void main(String[] args) throws IOException, ParseException {
String[] arg={"-jt","ubuntu:9001",
"-i","hdfs://ubuntu:9000/user/mahout/output_bayes/classifyJob",
"-o","",
"-li","hdfs://ubuntu:9000/user/mahout/output_bayes/index.bin"
};
new AnalyzeBayesModel().run(arg);
}
/**
* 分析BayesClassifyJob输出文件和labelIndex做对比,分析正确率
* @param args
* @throws IOException
* @throws ParseException
*/
public int run(String[] args) throws IOException, ParseException{ // labelIndex
setOption("li","labelIndex",true,"the path where labelIndex store",true); if(!parseArgs(args)){
return -1;
}
Configuration conf=getConf();
String labelIndex=getNameValue("labelIndex");
String input=getInput();
Path inputPath=new Path(input);
//load the labels
Map<Integer, String> labelMap = BayesUtils.readLabelIndex(getConf(), new Path(labelIndex)); //loop over the results and create the confusion matrix
SequenceFileDirIterable<Text, VectorWritable> dirIterable =
new SequenceFileDirIterable<Text, VectorWritable>(inputPath,
PathType.LIST,
PathFilters.partFilter(),
conf);
ResultAnalyzer analyzer = new ResultAnalyzer(labelMap.values(), "DEFAULT");
analyzeResults(labelMap, dirIterable, analyzer); log.info("{} Results: {}", "Standard NB", analyzer);
return 0;
}
/**
* 摘自TestNaiveBayesDriver中的analyzeResults方法
*/
private void analyzeResults(Map<Integer, String> labelMap,
SequenceFileDirIterable<Text, VectorWritable> dirIterable,
ResultAnalyzer analyzer) {
for (Pair<Text, VectorWritable> pair : dirIterable) {
int bestIdx = Integer.MIN_VALUE;
double bestScore = Long.MIN_VALUE;
for (Vector.Element element : pair.getSecond().get()) {
if (element.get() > bestScore) {
bestScore = element.get();
bestIdx = element.index();
}
}
if (bestIdx != Integer.MIN_VALUE) {
ClassifierResult classifierResult = new ClassifierResult(labelMap.get(bestIdx), bestScore);
analyzer.addInstance(pair.getFirst().toString(), classifierResult);
}
}
} }
运行拓展篇1中的数据得到的模型的分类结果如下:
13/09/14 14:52:13 INFO bayes.AnalyzeBayesModel: Standard NB Results: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 7 70%
Incorrectly Classified Instances : 3 30%
Total Classified Instances : 10 =======================================================
Confusion Matrix
-------------------------------------------------------
a b c d <--Classified as
3 0 0 0 | 3 a = 1
0 1 0 1 | 2 b = 2
1 1 2 0 | 4 c = 3
0 0 0 1 | 1 d = 4
运行后可以在hdfs上面看到如下的文件夹:
任务列表如下:
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990
mahout贝叶斯算法开发思路(拓展篇)2的更多相关文章
- mahout贝叶斯算法开发思路(拓展篇)1
首先说明一点,此篇blog解决的问题是就下面的数据如何应用mahout中的贝叶斯算法?(这个问题是在上篇(...完结篇)blog最后留的问题,如果想直接使用该工具,可以在mahout贝叶斯算法拓展下载 ...
- Mahout贝叶斯算法拓展篇3---分类无标签数据
代码測试环境:Hadoop2.4+Mahout1.0 前面博客:mahout贝叶斯算法开发思路(拓展篇)1和mahout贝叶斯算法开发思路(拓展篇)2 分析了Mahout中贝叶斯算法针对数值型数据的处 ...
- 朴素贝叶斯算法下的情感分析——C#编程实现
这篇文章做了什么 朴素贝叶斯算法是机器学习中非常重要的分类算法,用途十分广泛,如垃圾邮件处理等.而情感分析(Sentiment Analysis)是自然语言处理(Natural Language Pr ...
- C#编程实现朴素贝叶斯算法下的情感分析
C#编程实现 这篇文章做了什么 朴素贝叶斯算法是机器学习中非常重要的分类算法,用途十分广泛,如垃圾邮件处理等.而情感分析(Sentiment Analysis)是自然语言处理(Natural Lang ...
- Python机器学习笔记:朴素贝叶斯算法
朴素贝叶斯是经典的机器学习算法之一,也是为数不多的基于概率论的分类算法.对于大多数的分类算法,在所有的机器学习分类算法中,朴素贝叶斯和其他绝大多数的分类算法都不同.比如决策树,KNN,逻辑回归,支持向 ...
- 机器学习---用python实现朴素贝叶斯算法(Machine Learning Naive Bayes Algorithm Application)
在<机器学习---朴素贝叶斯分类器(Machine Learning Naive Bayes Classifier)>一文中,我们介绍了朴素贝叶斯分类器的原理.现在,让我们来实践一下. 在 ...
- 朴素贝叶斯算法java实现(多项式模型)
网上有很多对朴素贝叶斯算法的说明的文章,在对算法实现前,参考了一下几篇文章: NLP系列(2)_用朴素贝叶斯进行文本分类(上) NLP系列(3)_用朴素贝叶斯进行文本分类(下) 带你搞懂朴素贝叶斯分类 ...
- 【数据挖掘】朴素贝叶斯算法计算ROC曲线的面积
题记: 近来关于数据挖掘学习过程中,学习到朴素贝叶斯运算ROC曲线.也是本节实验课题,roc曲线的计算原理以及如果统计TP.FP.TN.FN.TPR.FPR.ROC面积等等.往往运用 ...
- 朴素贝叶斯算法的python实现
朴素贝叶斯 算法优缺点 优点:在数据较少的情况下依然有效,可以处理多类别问题 缺点:对输入数据的准备方式敏感 适用数据类型:标称型数据 算法思想: 朴素贝叶斯比如我们想判断一个邮件是不是垃圾邮件,那么 ...
随机推荐
- 查看memcached依赖的库
LD_DEBUG=libs ./memcached -v
- 算法——A*——HDOJ:1813
Escape from Tetris Time Limit: 12000/4000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Othe ...
- CSS3 Media Query
在移动端火爆的今日,一个好的web应用不仅仅要有对应移动平台的APP,自己web网页也需要对不同屏幕大小的移动设备进行支持,也就是我们所说的响应式web页面. 本篇文章就来介绍下最常见的响应式页面的实 ...
- ThinkPHP第二十六天(JQuery操作select,SESSION和COOKIE)
1.JQuery操作select,假设<select id="my"> A:双击选项<option>事件,应该是select的dbclick事件. B:获得 ...
- 网络抓包--Wireshark
Wireshark 是一款非常棒的Unix和Windows上的开源网络协议分析器.它可以实时检测网络通讯数据,也可以检测其抓取的网络通讯数据快照文件.可以通过图形界面浏览这些数据,可以查看网络通讯数据 ...
- 第一次当Uber司机,就拉到漂亮妹纸
黑马哥的Uber司机端装上很久了,一次活儿也没拉,心里一直有一种当“张师傅”的冲动.黑马哥当Uber司机,肯定不是为了图挣钱,也不是因为Uber有“新约炮神器”的称号,能通过“拉活”来泡妹纸.黑马哥体 ...
- Device Mapper Multipath(DM-Multipath)
Device Mapper Multipath(DM-Multipath)能够将server节点和存储阵列之间的多条I/O链路配置为一个单独的设备.这些I/O链路是由不同的线缆.交换机.控制器组成的S ...
- c语言中-----分配内存函数
原型: void * realloc(void *p, size_t size); realloc 可以对给定的指针所指的空间进行扩大 或者 缩小, 原有内存的数据保持不变.当然,对于缩小,则缩小部 ...
- 如何上传base64编码图片到七牛云
接口说明 POST /putb64/<Fsize>/key/<EncodedKey>/mimeType/<EncodedMimeType>/crc32/<Cr ...
- http 代理 测试
Technorati 标记: http 代理验证及测试 Technorati 标记: C# 参考了网上很多资料,综合整理出来最终的代码: using System; using System.Co ...