MapReduce实现Apriori算法

Apiroi算法在Hadoop MapReduce上的实现

输入格式：

一行为一个Bucket

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 12 13 15 17 19 21 23 25 27 29 31 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 12 13 16 17 19 21 23 25 27 29 31 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 11 13 15 17 20 21 23 25 27 29 31 34 36 38 40 42 44 47 48 50 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 34 36 38 40 42 44 46 48 51 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 34 36 38 40 42 44 46 48 51 52 54 56 58 60 63 64 66 68 70 72 74

1 3 5 7 9 11 13 15 17 20 21 23 25 27 29 31 34 36 38 40 42 44 47 48 51 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 12 13 15 17 19 21 24 25 27 29 31 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 11 13 15 17 19 21 24 25 27 29 31 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 65 66 68 70 72 74

1 3 5 7 9 11 13 16 17 19 21 24 25 27 29 31 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 12 13 16 17 19 21 24 25 27 29 31 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 11 13 15 17 20 21 24 25 27 29 31 34 36 38 40 42 44 47 48 50 52 54 56 58 60 62 64 66 68 70 72 74

1 3 5 7 9 11 13 15 17 20 21 24 25 27 29 31 34 36 38 40 42 44 47 48 50 52 54 56 58 60 62 65 66 68 70 72 74

1 3 5 7 9 11 13 15 17 20 21 24 25 27 29 31 34 36 38 40 43 44 47 48 50 52 54 56 58 60 62 65 66 68 70 72 74

输出格式：

<item1,item2,...itemK, frequency>

代码：

 package apriori;

 import java.io.IOException;

 import java.util.Iterator;

 import java.util.StringTokenizer;

 import java.util.List;

 import java.util.ArrayList;

 import java.util.Collections;

 import java.util.Map;

 import java.util.HashMap;

 import java.io.*;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Mapper.Context;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl;

 import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 class AprioriPass1Mapper extends Mapper<Object,Text,Text,IntWritable>{

     private final static IntWritable one = new IntWritable(1);

     private Text number = new Text();

     //第一次pass的Mapper只要把每个item映射为1

     public void map(Object key,Text value,Context context) throws IOException,InterruptedException{

         String[] ids = value.toString().split("[\\s\\t]+");

         for(int i = 0;i < ids.length;i++){

             context.write(new Text(ids[i]),one);

         }

     }

 }

 class AprioriReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

     private IntWritable result = new IntWritable();

     //所有Pass的job共用一个reducer，即统计一种itemset的个数，并筛选除大于s的

     public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException,InterruptedException{

         int sum = 0;

         int minSup = context.getConfiguration().getInt("minSup",5);

         for(IntWritable val : values){

             sum += val.get();

         }

         result.set(sum);

         if(sum > minSup){

             context.write(key,result);

         }

     }

 }

 class AprioriPassKMapper extends Mapper<Object,Text,Text,IntWritable>{

     private final static IntWritable one = new IntWritable(1);

     private Text item = new Text();

     private List< List<Integer> > prevItemsets = new ArrayList< List<Integer> >();

     private List< List<Integer> > candidateItemsets = new ArrayList< List<Integer> >();

     private Map<String,Boolean> candidateItemsetsMap = new HashMap<String,Boolean>();

     //第一个以后的pass使用该Mapper，在map函数执行前会执行setup来从k-1次pass的输出中构建候选itemsets,对应于apriori算法

     @Override

     public void setup(Context context) throws IOException, InterruptedException{

         int passNum = context.getConfiguration().getInt("passNum",2);

         String prefix = context.getConfiguration().get("hdfsOutputDirPrefix","");

         String lastPass1 = context.getConfiguration().get("fs.default.name") + "/user/hadoop/chess-" + (passNum - 1) + "/part-r-00000";

         String lastPass = context.getConfiguration().get("fs.default.name") + prefix + (passNum - 1) + "/part-r-00000";

         try{

             Path path = new Path(lastPass);

             FileSystem fs = FileSystem.get(context.getConfiguration());

             BufferedReader fis = new BufferedReader(new InputStreamReader(fs.open(path)));

             String line = null;

             while((line = fis.readLine()) != null){

                 List<Integer> itemset = new ArrayList<Integer>();

                 String itemsStr = line.split("[\\s\\t]+")[0];

                 for(String itemStr : itemsStr.split(",")){

                     itemset.add(Integer.parseInt(itemStr));

                 }

                 prevItemsets.add(itemset);

             }

         }catch (Exception e){

             e.printStackTrace();

         }

         //get candidate itemsets from the prev itemsets

         candidateItemsets = getCandidateItemsets(prevItemsets,passNum - 1);

     }

     public void map(Object key,Text value,Context context) throws IOException,InterruptedException{

         String[] ids = value.toString().split("[\\s\\t]+");

         List<Integer> itemset = new ArrayList<Integer>();

         for(String id : ids){

             itemset.add(Integer.parseInt(id));

         }

         //遍历所有候选集合

         for(List<Integer> candidateItemset : candidateItemsets){

             //如果输入的一行中包含该候选集合，则映射1，这样来统计候选集合被包括的次数

             //子集合，消耗掉了大部分时间

             if(contains(candidateItemset,itemset)){

                 String outputKey = "";

                 for(int i = 0;i < candidateItemset.size();i++){

                     outputKey += candidateItemset.get(i) + ",";

                 }

                 outputKey = outputKey.substring(0,outputKey.length() - 1);

                 context.write(new Text(outputKey),one);

             }

         }

     }

     //返回items是否是allItems的子集

     private boolean contains(List<Integer> items,List<Integer> allItems){

         int i = 0;

         int j = 0;

         while(i < items.size() && j < allItems.size()){

             if(allItems.get(j) > items.get(i)){

                 return false;

             }else if(allItems.get(j) == items.get(i)){

                 j++;

                 i++;

             }else{

                 j++;

             }

         }

         if(i != items.size()){

             return false;

         }

         return true;

     }

     //获取所有候选集合，参考apriori算法

     private List< List<Integer> > getCandidateItemsets(List< List<Integer> > prevItemsets, int passNum){

         List< List<Integer> > candidateItemsets = new ArrayList<List<Integer> >();

         //上次pass的输出中选取连个itemset构造大小为k + 1的候选集合

         for(int i = 0;i < prevItemsets.size();i++){

             for(int j = i + 1;j < prevItemsets.size();j++){

                 List<Integer> outerItems = prevItemsets.get(i);

                 List<Integer> innerItems = prevItemsets.get(j);

                 List<Integer> newItems = null;

                 if(passNum == 1){

                     newItems = new ArrayList<Integer>();

                     newItems.add(outerItems.get(0));

                     newItems.add(innerItems.get(0));

                 }

                 else{

                     int nDifferent = 0;

                     int index = -1;

                     for(int k = 0; k < passNum && nDifferent < 2;k++){

                         if(!innerItems.contains(outerItems.get(k))){

                             nDifferent++;

                             index = k;

                         }

                     }

                     if(nDifferent == 1){

                         //System.out.println("inner " + innerItems + " outer : " + outerItems);

                         newItems = new ArrayList<Integer>();

                         newItems.addAll(innerItems);

                         newItems.add(outerItems.get(index));

                     }

                 }

                 if(newItems == null){continue;}

                 Collections.sort(newItems);

                 //候选集合必须满足所有的子集都在上次pass的输出中，调用isCandidate进行检测，通过后加入到候选子集和列表

                 if(isCandidate(newItems,prevItemsets) && !candidateItemsets.contains(newItems)){

                     candidateItemsets.add(newItems);

                     //System.out.println(newItems);

                 }

             }

         }

         return candidateItemsets;

     }

     private boolean isCandidate(List<Integer> newItems,List< List<Integer> > prevItemsets){

         List<List<Integer>> subsets = getSubsets(newItems);     

         for(List<Integer> subset : subsets){

             if(!prevItemsets.contains(subset)){

                 return false;

             }

         }

         return true;

     }

     private List<List<Integer>> getSubsets(List<Integer> items){

         List<List<Integer>> subsets = new ArrayList<List<Integer>>();

         for(int i = 0;i < items.size();i++){

             List<Integer> subset = new ArrayList<Integer>(items);

             subset.remove(i);

             subsets.add(subset);

         }

         return subsets;

     }

 }

 public class Apriori extends Configured implements Tool{

     public static int s;

     public static int k;

     public int run(String[] args)throws IOException,InterruptedException,ClassNotFoundException{

         long startTime = System.currentTimeMillis();

         String hdfsInputDir = args[0];        //从参数1中读取输入数据

         String hdfsOutputDirPrefix = args[1];    //参数2为输出数据前缀，和第pass次组成输出目录

         s = Integer.parseInt(args[2]);        //阈值

         k = Integer.parseInt(args[3]);        //k次pass

         //循环执行K次pass

         for(int pass = 1; pass <= k;pass++){

             long passStartTime = System.currentTimeMillis();

             //配置执行该job

             if(!runPassKMRJob(hdfsInputDir,hdfsOutputDirPrefix,pass)){

                 return -1;

             }

             long passEndTime = System.currentTimeMillis();

             System.out.println("pass " + pass + " time : " + (passEndTime - passStartTime));

         }

         long endTime = System.currentTimeMillis();

         System.out.println("total time : " + (endTime - startTime));

         return 0;

     }

     private static boolean runPassKMRJob(String hdfsInputDir,String hdfsOutputDirPrefix,int passNum)

             throws IOException,InterruptedException,ClassNotFoundException{

             Configuration passNumMRConf = new Configuration();

             passNumMRConf.setInt("passNum",passNum);

             passNumMRConf.set("hdfsOutputDirPrefix",hdfsOutputDirPrefix);

             passNumMRConf.setInt("minSup",s);

             Job passNumMRJob = new Job(passNumMRConf,"" + passNum);

             passNumMRJob.setJarByClass(Apriori.class);

             if(passNum == 1){

                 //第一次pass的Mapper类特殊对待，不许要构造候选itemsets

                 passNumMRJob.setMapperClass(AprioriPass1Mapper.class);

             }

             else{

                 //第一次之后的pass的Mapper类特殊对待，不许要构造候选itemsets

                 passNumMRJob.setMapperClass(AprioriPassKMapper.class);

             }

             passNumMRJob.setReducerClass(AprioriReducer.class);

             passNumMRJob.setOutputKeyClass(Text.class);

             passNumMRJob.setOutputValueClass(IntWritable.class);

             FileInputFormat.addInputPath(passNumMRJob,new Path(hdfsInputDir));

             FileOutputFormat.setOutputPath(passNumMRJob,new Path(hdfsOutputDirPrefix + passNum));

             return passNumMRJob.waitForCompletion(true);

     }

     public static void main(String[] args) throws Exception{

         int exitCode = ToolRunner.run(new Apriori(),args);

         System.exit(exitCode);

     }

 }

MapReduce实现Apriori算法的更多相关文章

利用Apriori算法对交通路况的研究
首先简单描述一下Apriori算法:Apriori算法分为频繁项集的产生和规则的产生. Apriori算法频繁项集的产生: 令ck为候选k-项集的集合,而Fk为频繁k-项集的集合. 1.首先通过单遍扫 ...
#研发解决方案#基于Apriori算法的Nginx+Lua+ELK异常流量拦截方案
郑昀基于杨海波的设计文档创建于2015/8/13 最后更新于2015/8/25 关键词:异常流量.rate limiting.Nginx.Apriori.频繁项集.先验算法.Lua.ELK 本文档 ...
基于Apriori算法的Nginx+Lua+ELK异常流量拦截方案郑昀基于杨海波的设计文档（转）
郑昀基于杨海波的设计文档创建于2015/8/13 最后更新于2015/8/25 关键词:异常流量.rate limiting.Nginx.Apriori.频繁项集.先验算法.Lua.ELK 本文档 ...
基于Hadoop的改进Apriori算法
一.Apriori算法性质性质一: 候选的k元组集合Ck中,任意k-1个项组成的集合都来自于Lk. 性质二: 若k维数据项目集X={i1,i2,-,ik}中至少存在一个j∈X,使得|L(k-1)(j ...
海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进：非hash方法
http://blog.csdn.net/pipisorry/article/details/48914067 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
Apriori算法的原理与python 实现。
前言:这是一个老故事, 但每次看总是能从中想到点什么.在一家超市里,有一个有趣的现象:尿布和啤酒赫然摆在一起出售.但是这个奇怪的举措却使尿布和啤酒的销量双双增加了.这不是一个笑话,而是发生在美国沃尔玛 ...
数据挖掘算法（四）Apriori算法
参考文献: 关联分析之Apriori算法
机器学习实战 - 读书笔记(11) - 使用Apriori算法进行关联分析
前言最近在看Peter Harrington写的"机器学习实战",这是我的学习心得,这次是第11章 - 使用Apriori算法进行关联分析. 基本概念关联分析(associat ...
关联规则挖掘之apriori算法
前言: 众所周知,关联规则挖掘是数据挖掘中重要的一部分,如著名的啤酒和尿布的问题.今天要学习的是经典的关联规则挖掘算法--Apriori算法一.算法的基本原理由k项频繁集去导出k+1项频繁集. 二 ...

随机推荐

Mac 虚拟打印机PDFWriter on Sierra
之前就装过PdfWriter,第一次装的时候失败了,后来在app store 装了PDF Printer,好像挺好用的,但是升级有点贵.又回去研究了一下PDFWriter. 和PDFWriter在so ...
C# 将对应的xml文档赋值给指定模型（对象）
public static IList<T> XmlToEntityList<T>(string xml) where T : new() { ...
make capslock+hjkl as arrows
Solution 2 (probably better) I was happy with solution 1, until I realized I couldn't use the key bi ...
Linux——高效玩转命令行
[0]统计文件or压缩文件的行数 zcat file.gz | sed -n '$=' #迅速.直接打印出多少行.-n 取消默认的输出,使用安静(silent)模式 '$=' 不知道是什么 ...
HBase详解
1. hbase简介 1.1. 什么是hbase HBASE是一个高可靠性.高性能.面向列.可伸缩的分布式存储系统,利用HBASE技术可在廉价PC Server上搭建起大规模结构化存储集群. H ...
MySQL 存储过程错误处理
MySQL 存储过程错误处理如何使用MySQL处理程序来处理在存储过程中遇到的异常或错误. 当存储过程中发生错误时,重要的是适当处理它,例如:继续或退出当前代码块的执行,并发出有意义的错误消息. ...
js的点滴
一些好的博客 http://www.cnblogs.com/coding4/p/7809063.html canvas http://www.cnblogs.com/coding4/p/5593954 ...
主成分分析（PCA）原理及R语言实现 | dimension reduction降维
如果你的职业定位是数据分析师/计算生物学家,那么不懂PCA.t-SNE的原理就说不过去了吧.跑通软件没什么了不起的,网上那么多教程,copy一下就会.关键是要懂其数学原理,理解算法的假设,适合解决什么 ...
非阻塞tcp服务器与阻塞的tcp服务器对比
一般的tcp服务器(阻塞)是使用的如下 [erlang] gen_tcp传输文件原型 http://www.cnblogs.com/bluefrog/archive/2012/09/10/267904 ...
C#基于LibUsbDotNet实现USB通信（一）
网上C#USB通信的资料比较少, 基本上都是基于LibUsbDotNet 和 CyUsb, 关于打印机设备的还有一个OPOS. 本篇文章基于LibUsbDotNet. 1. 下载并安装 LibUsbD ...

MapReduce实现Apriori算法

输入格式：

输出格式：

代码：

MapReduce实现Apriori算法的更多相关文章

随机推荐

热门专题