MapReduce -- TF-IDF

通过MapReduce实现 TF-IDF值的统计

数据：文章ID　　文件内容

    今天约了姐妹去逛街吃美食，周末玩得很开心啊！

......

......

结果数据：

    开心:0.28558719539400335    吃:0.21277211221173534    了:0.1159152517783012    美食:0.29174432675350614    去:0.18044286652763497    玩:0.27205714412756765    啊:0.26272169358877784    姐妹:0.3983823545319593    逛街:0.33320559604063593    得很:0.45170136842118586    周末:0.2672478858982343    今天:0.16923426566752778    约:0.0946874743049455

......

......

在整个的处理过程中通过两步来完成

第一步主要生成三种格式的文件

1、使用分词工具将文章内容进行拆分成多个词条；并记录文章的总词条数关于分词工具的使用请参考 TF-IDF
第一步处理后结果：

今天_3823890378201539    A:,B:,

周末_3823890378201539    A:,B:,

得很_3823890378201539    A:,B:,

约_3823890378201539    B:,A:,

......

2、记录词条在多少篇文章中出现过

处理后结果：

今天

周末

约

......

3、记录文章总数

处理后结果：

counter

第二步将文件2，3的内容加载到缓存，利用2,3文件的内容对文件1的内容通过mapreduce进行计算

针对数据量不是很大的数据可以加载到缓存，如果数据量过大，不考虑这种方式；

源码

Step1.java:

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.wltea.analyzer.core.IKSegmenter;

 import org.wltea.analyzer.core.Lexeme;

 import java.io.IOException;

 import java.io.StringReader;

 import java.util.HashMap;

 import java.util.Map;

 import java.util.Map.Entry;

 /**

  * Created by Edward on 2016/7/21.

  */

 public class Step1 {

     public static void main(String[] args)

     {

         //access hdfs's user

         //System.setProperty("HADOOP_USER_NAME","root");

         Configuration conf = new Configuration();

         conf.set("fs.defaultFS", "hdfs://node1:8020");

         try {

             FileSystem fs = FileSystem.get(conf);

             Job job = Job.getInstance(conf);

             job.setJarByClass(RunJob.class);

             job.setMapperClass(MyMapper.class);

             job.setReducerClass(MyReducer.class);

             job.setPartitionerClass(FilterPartition.class);

             //需要指定 map out 的 key 和 value

             job.setOutputKeyClass(Text.class);

             job.setOutputValueClass(Text.class);

             //设置reduce task的数量

             job.setNumReduceTasks(4);

             FileInputFormat.addInputPath(job, new Path("/test/tfidf/input"));

             Path path = new Path("/test/tfidf/output");

             if(fs.exists(path))//如果目录存在，则删除目录

             {

                 fs.delete(path,true);

             }

             FileOutputFormat.setOutputPath(job, path);

             boolean b = job.waitForCompletion(true);

             if(b)

             {

                 System.out.println("OK");

             }

         } catch (Exception e) {

             e.printStackTrace();

         }

     }

     public static class MyMapper extends Mapper<LongWritable, Text, Text, Text > {

         @Override

         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

             Map<String, Integer> map = new HashMap<String, Integer>();

             String[] str = value.toString().split("\t");

             StringReader stringReader = new StringReader(str[1]);

             IKSegmenter ikSegmenter = new IKSegmenter(stringReader, true);

             Lexeme lexeme = null;

             Long count = 0l;

             while((lexeme = ikSegmenter.next())!=null) {

                 String word = lexeme.getLexemeText();

                 if(map.containsKey(word)) {

                     map.put(word, map.get(word)+1);

                 }

                 else{

                     map.put(word, 1);

                 }

                 count++;

             }

             for(Entry<String, Integer> entry: map.entrySet())

             {

                 context.write(new Text(entry.getKey()+"_"+str[0]), new Text("A:"+entry.getValue()));//tf词条在此文章中的个数

                 context.write(new Text(entry.getKey()+"_"+str[0]), new Text("B:"+count));//此文章中的总词条数

                 context.write(new Text(entry.getKey()),new Text("1"));//词条在此文章中出现+1，计算词条在那些文章中出现过

             }

             context.write(new Text("counter"), new Text(1+""));//文章数累加器

         }

     }

     public static class MyReducer extends Reducer<Text, Text, Text, Text> {

         @Override

         protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

             //计算总文章数

             if(key.toString().equals("conter")) {

                 long sum = 0l;

                 for(Text v :values)

                 {

                     sum += Long.parseLong(v.toString());

                 }

                 context.write(key, new Text(sum+""));

             }

             else{

                 if(key.toString().contains("_")) {

                     StringBuilder stringBuilder = new StringBuilder();

                     for (Text v : values) {

                         stringBuilder.append(v.toString());

                         stringBuilder.append(",");

                     }

                     context.write(key, new Text(stringBuilder.toString()));

                 }

                 else {//计算词条在那些文章中出现过

                     long sum = 0l;

                     for(Text v :values)

                     {

                         sum += Long.parseLong(v.toString());

                     }

                     context.write(key, new Text(sum+""));

                 }

             }

         }

     }

 }

FilterPartition.java

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

 /**

  * Created by Edward on 2016/7/22.

  */

 public class FilterPartition extends HashPartitioner<Text, Text> {

     @Override

     public int getPartition(Text key, Text value, int numReduceTasks) {

         if(key.toString().contains("counter"))

         {

             return numReduceTasks-1;

         }

         if(key.toString().contains("_"))

         {

             return super.getPartition(key, value, numReduceTasks-2);

         }

         else

         {

             return numReduceTasks-2;

         }

     }

 }

Step2.java

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import java.io.BufferedReader;

 import java.io.FileReader;

 import java.io.IOException;

 import java.net.URI;

 import java.util.HashMap;

 import java.util.Map;

 /**

  * Created by Edward on 2016/7/22.

  */

 public class Step2 {

     public static void main(String[] args)

     {

         //access hdfs's user

         //System.setProperty("HADOOP_USER_NAME","root");

         Configuration conf = new Configuration();

         conf.set("fs.defaultFS", "hdfs://node1:8020");

         try {

             FileSystem fs = FileSystem.get(conf);

             Job job = Job.getInstance(conf);

             job.setJarByClass(RunJob.class);

             job.setMapperClass(MyMapper.class);

             job.setReducerClass(MyReducer.class);

             //需要指定 map out 的 key 和 value

             job.setOutputKeyClass(Text.class);

             job.setOutputValueClass(Text.class);

             //分布式缓存，每个slave都能读到数据

                 //词条在多少文章中出现过

             job.addCacheFile(new Path("/test/tfidf/output/part-r-00002").toUri());

                 //文章的总数

             job.addCacheFile(new Path("/test/tfidf/output/part-r-00003").toUri());

             FileInputFormat.addInputPath(job, new Path("/test/tfidf/output"));

             Path path = new Path("/test/tfidf/output1");

             if(fs.exists(path))//如果目录存在，则删除目录

             {

                 fs.delete(path,true);

             }

             FileOutputFormat.setOutputPath(job, path);

             boolean b = job.waitForCompletion(true);

             if(b)

             {

                 System.out.println("OK");

             }

         } catch (Exception e) {

             e.printStackTrace();

         }

     }

     public static class MyMapper extends Mapper<LongWritable, Text, Text, Text > {

         public static Map<String, Double> dfmap = new HashMap<String, Double>();

         public static Map<String, Double> totalmap = new HashMap<String, Double>();

         @Override

         protected void setup(Context context) throws IOException, InterruptedException {

             URI[] cacheFiles = context.getCacheFiles();

             Path pArtNum = new Path(cacheFiles[0].getPath());

             Path pArtTotal = new Path(cacheFiles[1].getPath());

             //加载词条在多少篇文章中出现过

             BufferedReader buffer = new BufferedReader(new FileReader(pArtNum.getName()));

             String line = null;

             while((line = buffer.readLine()) != null){

                 String[] str = line.split("\t");

                 dfmap.put(str[0], Double.parseDouble(str[1]));

             }

             //加载文章总数

             buffer = new BufferedReader(new FileReader(pArtTotal.getName()));

             line = null;

             while((line = buffer.readLine()) != null){

                 String[] str = line.split("\t");

                 totalmap.put(str[0], Double.parseDouble(str[1]));

             }

         }

         @Override

         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

             String[] strings = value.toString().split("\t");

             String k = strings[0];

             if(k.contains("counter")) {

                 //过滤掉 文章总数

             }

             else if(k.contains("_")){

                 String word = k.split("_")[0];

                 String[] info = strings[1].split(",");

                 String n=null;

                 String num=null;

                 if(info[0].contains("A")){

                     n = info[0].substring(info[0].indexOf(":")+1);

                     num = info[1].substring(info[0].indexOf(":")+1);

                 }

                 if(info[0].contains("B")){

                     num = info[0].substring(info[0].indexOf(":")+1);

                     n = info[1].substring(info[0].indexOf(":")+1);

                 }

                 double result = 0l;

                 result = (Double.parseDouble(n)/Double.parseDouble(num)) * Math.log( totalmap.get("counter")/dfmap.get(word));

                 System.out.println("n=" + Double.parseDouble(n));

                 System.out.println("num=" + Double.parseDouble(num));

                 System.out.println("counter=" + totalmap.get("counter"));

                 System.out.println("wordnum=" + dfmap.get(word));

                 context.write(new Text(k.split("_")[1]), new Text(word+":"+result));

             }

             else{

                 //过滤掉 词条在多少篇文章中出现过

             }

         }

     }

     public static class MyReducer extends Reducer<Text, Text, Text, Text> {

         @Override

         protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

             StringBuilder stringBuilder = new StringBuilder();

             for(Text t: values){

                 stringBuilder.append(t.toString());

                 stringBuilder.append("\t");

             }

             context.write(key, new Text(stringBuilder.toString()) );

         }

     }

 }

MapReduce -- TF-IDF的更多相关文章

TF/IDF（term frequency/inverse document frequency)
TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...
基于TF/IDF的聚类算法原理
一.TF/IDF描述单个term与特定document的相关性TF(Term Frequency): 表示一个term与某个document的相关性. 公式为这个term在document中出 ...
使用solr的函数查询,并获取tf*idf值
1. 使用函数df(field,keyword) 和idf(field,keyword). http://118.85.207.11:11100/solr/mobile/select?q={!func ...
TF/IDF计算方法
FROM:http://blog.csdn.net/pennyliang/article/details/1231028 我们已经谈过了如何自动下载网页.如何建立索引.如何衡量网页的质量(Page R ...
tf–idf算法解释及其python代码实现(下)
tf–idf算法python代码实现这是我写的一个tf-idf的简单实现的代码,我们知道tfidf=tf*idf,所以可以分别计算tf和idf值在相乘,首先我们创建一个简单的语料库,作为例子,只有四 ...
tf–idf算法解释及其python代码实现(上)
tf–idf算法解释 tf–idf, 是term frequency–inverse document frequency的缩写,它通常用来衡量一个词对在一个语料库中对它所在的文档有多重要,常用在信息 ...
文本分类学习（三）特征权重（TF/IDF）和特征提取
上一篇中,主要说的就是词袋模型.回顾一下,在进行文本分类之前,我们需要把待分类文本先用词袋模型进行文本表示.首先是将训练集中的所有单词经过去停用词之后组合成一个词袋,或者叫做字典,实际上一个维度很大的 ...
信息检索中的TF/IDF概念与算法的解释
https://blog.csdn.net/class_brick/article/details/79135909 概念 TF-IDF(term frequency–inverse document ...
Elasticsearch学习之相关度评分TF&IDF
relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度 Elasticsearch使用的是 term frequency/inverse doc ...
tf idf公式及sklearn中TfidfVectorizer
在文本挖掘预处理之向量化与Hash Trick中我们讲到在文本挖掘的预处理中,向量化之后一般都伴随着TF-IDF的处理,那么什么是TF-IDF,为什么一般我们要加这一步预处理呢?这里就对TF-IDF的 ...

随机推荐

idea 快捷键总结
IntelliJ Idea 常用快捷键列表 Ctrl+Shift + Enter,语句完成“!”,否定完成,输入表达式时按 “!”键Ctrl+E,最近的文件Ctrl+Shift+E,最近更改的文件Sh ...
WampServer下修改和重置MySQL密码
Wampserver PHP环境中mysql数据库登录密码的修改和重置,mysql命令. 工具/原料电脑Windows系统 WampServer 方法/步骤1 启动WampSer ...
Linux 性能监控之CPU&内存&I/O监控Shell脚本2
Linux 性能监控之CPU&内存&I/O监控Shell脚本2 by:授客 QQ:1033553122 思路: 捕获数据->停止捕获数据->提取数据备注:一些命令的输 ...
Android项目实战（三十）：Fresco加载gif图片并播放
前言: 项目中图文混合使用的太多太多了,但是绝大部分都是静态图片. 然而项目开发中有这么一个需求:显示一个出一个简短的动画(一般都不超过3秒)演示比如说:一个功能提供很多步骤来教用户做广播体操,那么 ...
Android Activity切换与Activity间数据交互
在Android程序内部, startActivity借助Intent来启动一个子Activity(使用父子关系进行表述,只为表达清晰,Android中并未有父子Activity的概念).如下: In ...
接口调用，输出结果为Json格式（ConvertTo-Json），提交参数给URL（WebRequest）
1.直接输出为json格式: Get-Process -Id $pid | ConvertTo-Json | clip.exe 2.自定义结果为json格式: $serverinfoj = @&quo ...
python基础学习5----字典
字典由大括号和键值对组成,特点为无序,键唯一 1.字典的创建 #直接创建字典 dic1={'name':'a','age':20} #通过dict创建字典,输出都为{'name': 'a', 'age ...
gl 绘制多边形的函数解析分类： OpenGL（转）
http://blog.csdn.net/zhongjling/article/details/7528091 1,所谓正反面 glFrontFace(GL_CCW); // 设置CCW方向为“正面 ...
JDK5新特性之可变参数的方法
可变参数的方法:不知道这个方法该定义多少个参数注意: > 参数实际上是数组 > 必须写在参数列表最后一个 package cn.itcast.day24.varparam; import ...
自定义上传控件（兼容IE8）
上传控件是 <input type="file"/> 而实际开发过程中,都会自定义一个控件,因为这个控件本身难看,而且不同浏览器效果不一样. 如IE8显示如下: 谷歌浏 ...

MapReduce -- TF-IDF

MapReduce -- TF-IDF的更多相关文章

随机推荐

热门专题