文本挖掘之文本聚类（MapReduce）

简介

　　针对大数量的文本数据，采用单线程处理时，一方面消耗较长处理时间，另一方面对大量数据的I/O操作也会消耗较长处理时间，同时对内存空间的消耗也是很大，因此，本文引入MapReduce计算模型，以分布式方式处理文本数据，以期提高数据处理速率。本文结合Kmeans和DBSCAN算法，对上述算法进行改进，其中借鉴Kmeans聚类方法（类别个数的确定性）以及DBSCAN聚类方法（基于密度），并在数据处理过程中引入多个Reducer对数据进行归并处理。测试结果表明：在文本个数为457条，迭代次数为50次时，该算法具有可行性；但是在数据规模较小时，其处理速率较单线程处理存在一定的劣势，但是当数据量继续增大（数据量达到一定规模）时，基于分布式的算法，其速率优势会更加明显。

基于MapReduce的改进算法实现

图-1 基于MapReduce的改进算法框架图

　　如图-1所示，为基于MapReduce的改进算法，单次执行框架图。对该框架中部分核心内容解释如下：

　　1) 在Mapper端，借鉴Kmeans算法确定K个类别及其初始质心，然后根据该质心，将所有文本进行一次聚类，以相似度与哪个质心相近，则属于质心属于该类别。

　　2）在Reducer端，借鉴DBSCAN算法，计算某个所属类别的e领域中所含个数，并以该e领域类所含个数，即minPts个数最多者为新的质心，即密度最大者为新的质心。

　　3）在Reducer端，为加快程序访问速率，采用5个Reducer来重新计算类别质心。

　　4）在Mapper端，通过读取缓存文件来获取每次迭代所需的类别新质心。

　　以下将本次设计中Mapper和Reducer端各自的输入和输出介绍一下：

　　Mapper : <Object，Text>--><IntWritable, Text>

　　输入：key未使用， value为Web文本数据；

　　输出：key为类别ID，value为Web文本数据。

　　Mapper设计的目标：给每篇文本计算出其所属类别，即类别ID。

　　Reducer: <IntWritable, Text>--><NullWritable, Text>

　　输入：key为文本类别ID， value为Web文本数据；

　　输出：key为Null， value为Web文本数据，即新的质心。

　　Reducer设计的目标：给每类数据确定新的质心。

测试结果与性能分析

　　由于本次测试目的，在于判别基于MapReduce的文本聚类算法可行性，因此数据规模并未设置很大。测试数据集为随机从网络上抓取的457篇Web标题，并迭代50次来展开测试，迭代的目的在于使每个类别的质心收敛。

表-1 改进的Kmeans和DBSCAN文本聚类算法测试结果

　　从表-1测试结果可知：在数据规模较小时，单线程处理的速率明显要优于MapReduce。主要原因在于，基于MapReduce框架，其每次迭代需要重新加载词典，同时读/写缓存文件，以获取质心或者修改质心，因此在数据规模较小时，处理数据的时间甚至不及上述文件的I/O时间，因此其优势并未发挥出来。本文作者曾尝试采用Java反射机制，加载数据对象以期解决上述问题，但收效甚微。

　　但是，采用MapReduce框架，在计算新的质心时，采用多个Reduer，明显能够改善数据规约的速率，较单线程处理来说，不仅能节省存储空间，同时处理简单、便捷。考虑到后期文本数据规模日益增大的趋势，引入分布式处理框架，对海量文本数据展开处理，已趋于一种潮流趋势，因此本文提出的算法有一定的实践意义。

　　程序源代码：

 public class ElementDict {

     private String term;

     private int freq;

     public ElementDict(String term, int freq) {

         this.term = term;

         this.freq = freq;

     }

     public void setFreq (int freq) {

         this.freq = freq;

     }

     public String getTerm() {

         return term;

     }

     public int getFreq() {

         return freq;

     }

     public boolean equals(ElementDict e) {

         boolean ret = false;

         if (term.equals(e.getTerm()) && freq == e.getFreq())

         {

             ret = true;

         }

         return ret;

     }

 }

Class ElementDict

 import java.io.BufferedReader;

 import java.io.IOException;

 import java.io.InputStreamReader;

 import java.net.URI;

 import java.util.HashMap;

 import java.util.List;

 import java.util.ArrayList;

 import java.util.Map;

 import org.apache.lucene.analysis.TokenStream;

 import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

 import org.wltea.analyzer.lucene.IKAnalyzer;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.FSDataInputStream;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.logging.log4j.LogManager;

 import org.apache.logging.log4j.Logger;

 public class TextCosine {

     private Map<String, String> map= null;

     private double common;

     private double special;

     private static final String PATH = "hdfs://10.1.130.10:9000";

     private static Logger logger = LogManager.getLogger(TextCosine.class);

     public TextCosine() {

         map = new HashMap<String, String>();

         try {

             Configuration conf = new Configuration();

             FileSystem fs = FileSystem.get(URI.create(PATH), conf);

             Path path = new Path("/user/hadoop/doc/synonyms.dict");

             FSDataInputStream is = fs.open(path);

             BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));

             String s = null;

             while ((s = br.readLine()) != null) {

                 String []synonymsEnum = s.split("→");

                 map.put(synonymsEnum[0], synonymsEnum[1]);

             }

             br.close();

         } catch (IOException e) {

             logger.error("TextCosine IOException!");

         }

     }

     public TextCosine(double common, double special) {

         map = new HashMap<String, String>();

         try {

             Configuration conf = new Configuration();

             FileSystem fs = FileSystem.get(URI.create(PATH), conf);

             Path path = new Path("/user/hadoop/doc/synonyms.dict");

             FSDataInputStream is = fs.open(path);

             BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));

             String s = null;

             while ((s = br.readLine()) != null) {

                 String []synonymsEnum = s.split("→");

                 map.put(synonymsEnum[0], synonymsEnum[1]);

             }

             br.close();

         } catch (IOException e) {

             logger.error("TextCosine IOException!");

         }

         this.common = common;

         this.special = special;

     }

     public void setCommon(double common) {

         this.common = common;

     }

     public void setSpecial(double special) {

         this.special = special;

     }

     // get the word with IK Analyzer

     public List<ElementDict> tokenizer(String str) {

         List<ElementDict> list = new ArrayList<ElementDict>();

         IKAnalyzer analyzer = new IKAnalyzer(true);

         try {

             TokenStream stream = analyzer.tokenStream("", str);

             CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);

             stream.reset();

             int index = -1;

             while (stream.incrementToken()) {

                 if ((index = isContain(cta.toString(), list)) >= 0) {

                     list.get(index).setFreq(list.get(index).getFreq() + 1);

                 }

                 else {

                     list.add(new ElementDict(cta.toString(), 1));

                 }

             }

             analyzer.close();

         } catch (IOException e) {

             e.printStackTrace();

         }

         return list;

     }

     // assert one term is in the List

     public int isContain(String str, List<ElementDict> list) {

         for (ElementDict ed : list) {

             if (ed.getTerm().equals(str)) {

                 return list.indexOf(ed);

             } else if (map.get(ed.getTerm())!= null && map.get(ed.getTerm()).equals(str)) {

                 return list.indexOf(ed);

             }

         }

         return -1;

     }

     // merge the two List to align the vector

     public List<String> mergeTerms(List<ElementDict> list1, List<ElementDict> list2) {

         List<String> list = new ArrayList<String>();

         for (ElementDict ed : list1) {

             if (!list.contains(ed.getTerm())) {

                 list.add(ed.getTerm());

             } else if (!list.contains(map.get(ed.getTerm()))) {

                 list.add(ed.getTerm());

             }

         }

         for (ElementDict ed : list2) {

             if (!list.contains(ed.getTerm())) {

                 list.add(ed.getTerm());

             } else if (!list.contains(map.get(ed.getTerm()))) {

                 list.add(ed.getTerm());

             }

         }

         return list;

     }

     // get the max cosine

     public double analysisText(List<ElementDict> list1, List<ElementDict> list2) {

         int len1 = list1.size();

         int len2 = list2.size();

         double ret = 0;

         if (len2 >= len1 * 1.5) {

             List<ElementDict> newList = new ArrayList<ElementDict>();

             for (int i = 0; i + len1 <= len2; i++) {

                 for (int j = 0; j < len1; j++)

                     newList.add(list2.get(i+j));

                 newList = adjustList(newList, list2, len2, len1, i);

                 double tmp = analysis(list1, newList);

                 if (tmp > ret)

                     ret = tmp;

             }

         } else if (len1 >= len2 * 1.5) {

             List<ElementDict> newList = new ArrayList<ElementDict>();

             for (int i = 0; i + len2 <= len1; i++) {

                 for (int j = 0; j < len2; j++)

                     newList.add(list1.get(i+j));

                 newList = adjustList(newList, list1, len1, len2, i);

                 double tmp = analysis(list1, newList);

                 if (tmp > ret)

                     ret = tmp;

             }

         } else {

             ret = analysis(list1, list2);

         }

         return ret;

     }

     // adjust the new List with the length about the original List

     public List<ElementDict> adjustList(List<ElementDict> newList, List<ElementDict> list, int lenBig, int lenSmall, int index) {

         int gap = lenBig -lenSmall;

         int size = (gap/2 > 2) ? 2: gap/2;

         if (index < gap/2) {

             for (int i = 0; i < size; i++) {

                 newList.add(list.get(lenSmall+index+i));

             }

         } else {

             for (int i = 0; i > size; i++) {

                 newList.add(list.get(lenBig-index-i));

             }

         }

         return newList;

     }

     // analysis the cosine for two vectors

     public double analysis(List<ElementDict> list1, List<ElementDict> list2) {

         List<String> list = mergeTerms(list1, list2);

         List<Integer> weightList1 = assignWeight(list, list1);

         List<Integer> weightList2 = assignWeight(list, list2);

         return countCosSimilarity(weightList1, weightList2);

     }

     // according the frequency to assign the weight

     public List<Integer> assignWeight(List<String> list, List<ElementDict> list1) {

         List<Integer> vecList = new ArrayList<Integer>(list.size());

         boolean isEqual = false;

         for (String str : list) {

             for (ElementDict ed : list1) {

                 if (ed.getTerm().equals(str)) {

                     isEqual = true;

                     vecList.add(new Integer(ed.getFreq()));

                 } else if (map.get(ed.getTerm())!= null && map.get(ed.getTerm()).equals(str)) {

                     isEqual = true;

                     vecList.add(new Integer(ed.getFreq()));

                 }

             }

             if (!isEqual) {

                 vecList.add(new Integer(0));

             }

             isEqual = false;

         }

         return vecList;

     }

     // count the cosine about the two vectors

     public double countCosSimilarity(List<Integer> list1, List<Integer> list2) {

         double countScores = 0;

         int element = 0;

         int denominator1 = 0;

         int denominator2 = 0;

         int index = -1;

         for (Integer it : list1) {

             index ++;

             int left = it.intValue();

             int right = list2.get(index).intValue();

             element += left * right;

             denominator1 += left * left;

             denominator2 += right * right;

         }

         try {

             countScores = (double)element / Math.sqrt(denominator1 * denominator2);

         } catch (ArithmeticException e) {

             e.printStackTrace();

         }

         return countScores;

     }

     public boolean isSimilarity(double param, double score) {

         boolean ret = false;

         if (score >= param)

             ret = true;

         return ret;

     }

     public boolean assertSimilarity(List<ElementDict> list1, List<ElementDict> list2)

     {

         int len1 = list1.size();

         int len2 = list2.size();

         if (len2 >= len1 * 1.5) {

             List<ElementDict> newList = new ArrayList<ElementDict>();

             for (int i = 0; i + len1 <= len2; i++) {

                 for (int j = 0; j < len1; j++)

                     newList.add(list2.get(i+j));

                 newList = adjustList(newList, list2, len2, len1, i);

                 if (isSimilarity(special, analysis(list1, newList)))

                     return true;

             }

         } else if (len1 >= len2 * 1.5) {

             List<ElementDict> newList = new ArrayList<ElementDict>();

             for (int i = 0; i + len2 <= len1; i++) {

                 for (int j = 0; j < len2; j++)

                     newList.add(list1.get(i+j));

                 newList = adjustList(newList, list1, len1, len2, i);

                 if (isSimilarity(special, analysis(list1, newList)))

                     return true;

             }

         } else {

             if (isSimilarity(common, analysis(list1, list2)))

                 return true;

         }

         return false;

     }

 }

Class TextCosine

 import java.util.Collections;

 import java.util.List;

 import org.apache.logging.log4j.LogManager;

 import org.apache.logging.log4j.Logger;

 import com.gta.cosine.TextCosine;

 import com.gta.cosine.ElementDict;

 public class DensityCenter {

     private Logger logger = LogManager.getLogger(DensityCenter.class);

     private double eps;

     private TextCosine cosine;

     public DensityCenter(double eps, TextCosine cosine) {

         this.eps = eps;

         this.cosine = cosine;

     }

     public double cosineDistance(String src, String dst)

     {

         List<ElementDict> vec1 = cosine.tokenizer(src);

         List<ElementDict> vec2 = cosine.tokenizer(dst);

         return cosine.analysisText(vec1, vec2);

     }

     public int getNeighbors(String src, List<String> dst) {

         int ret = 0;

         double score = 0;

         for (String s : dst) {

             score = cosineDistance(src, s);

             if (score >= eps)

                 ret++;

         }

         return ret;

     }

     public String getDensityCenter(List<String> text) {

         int max = 0;

         int i = 0;

         int index = 0;

         for (String s : text) {

             int ret = getNeighbors(s, text);

             if (ret > max) {

                 index = i;

                 max = ret;

             }

             i++;

         }

         return text.get(index);

     }

     public boolean compareCenters(List<String> oldCenters, List<String> newCenters)

     {

         boolean ret = false;

         Collections.sort(oldCenters);

         Collections.sort(newCenters);

         int oldSize = oldCenters.size();

         int newSize = newCenters.size();

         logger.info("oldSize : " + oldSize);

         logger.info("newSize : " + newSize);

         int size = oldSize > newSize ? newSize : oldSize;

         int index = 0;

         int count = 0;

         for (String s : oldCenters) {

             if (s.equals(newCenters.get(index)))

                 count++;

             index++;

             if (index >= size)  // Avoid the size of two List is not the same

                 break;

         }

         logger.info("count : " + count);

         if (count == index)

             ret = true;

         return ret;

     }

 }

Class DensityCenter

 import java.io.BufferedReader;

 import java.io.InputStreamReader;

 import java.io.IOException;

 import java.net.URI;

 import java.util.ArrayList;

 import java.util.List;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.FSDataInputStream;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.NullWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.logging.log4j.LogManager;

 import org.apache.logging.log4j.Logger;

 import com.gta.cosine.TextCosine;

 import com.gta.cosine.ElementDict;

 import com.gta.util.DensityCenter;

 public class KMeansProcess {

     public static class TextMapper extends Mapper<Object, Text, IntWritable, Text> {

         private static Logger logger = LogManager.getLogger(TextMapper.class);

         public static List<String> centersList = new ArrayList<String>();

         public static TextCosine cosine = new TextCosine();

         public void setup(Context context)

         {

             int iteration = context.getConfiguration().getInt("ITERATION", 100);

             if (iteration == 0) {

                 int task = context.getConfiguration().getInt("TASK", 0);

                 try {

                     URI[] caches = context.getCacheFiles();

                     if (caches == null || caches.length <= 0) {

                         System.exit(1);

                     }

                     for (int i = 0; i < task; i++) {

                         FileSystem fs = FileSystem.get(caches[i], context.getConfiguration());

                         FSDataInputStream is = fs.open(new Path(caches[i].toString()));

                         BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));

                         String s = null;

                         while ((s = br.readLine()) != null) {

                             centersList.add(s);

                         }

                         br.close();

                     }

                 } catch (IOException e) {

                     logger.error(e.getMessage());

                 }

             }

         }

         public void map(Object key, Text value, Context context)

         {

             try {

                 String str = value.toString();

                 double score = 0;

                 double countTmp = 0;

                 int clusterID = 0;

                 int index = 0;

                 List<ElementDict> vec1 = cosine.tokenizer(str);

                 for (String s : centersList) {

                     List<ElementDict> vec2 = cosine.tokenizer(s);

                     countTmp = cosine.analysisText(vec1, vec2);

                     if (countTmp > score) {

                         clusterID = index;

                         score = countTmp;

                     }

                     index++;

                 }

                 context.write(new IntWritable(clusterID), new Text(str));

             } catch (IOException e) {

                 logger.error(e.getMessage());

             } catch (InterruptedException e) {

                 logger.error(e.getMessage());

             }

         }

     }

     public static class TextReducer extends Reducer<IntWritable, Text, NullWritable, Text> {

         private static Logger logger = LogManager.getLogger(TextReducer.class);

         public static DensityCenter center = new DensityCenter(0.75, KMeansProcess.TextMapper.cosine);

         public void reduce(IntWritable key, Iterable<Text> values, Context context) {

             try {

                 List<String> list = new ArrayList<String>();

                 for (Text val : values) {

                     list.add(val.toString());

                 }

                 context.write(NullWritable.get(), new Text(center.getDensityCenter(list)));

             } catch (IOException e) {

                 logger.error(e.getMessage());

             } catch (InterruptedException e) {

                 logger.error(e.getMessage());

             }

         }

     }

 }

Class KMeansProcess

 import java.io.BufferedReader;

 import java.io.IOException;

 import java.io.InputStreamReader;

 import java.net.URI;

 import java.util.List;

 import org.apache.hadoop.fs.FSDataInputStream;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.logging.log4j.LogManager;

 import org.apache.logging.log4j.Logger;

 import com.gta.cosine.TextCosine;

 import com.gta.cosine.ElementDict;

 public class KMeans {

     public static class KMeansMapper extends Mapper<Object, Text, IntWritable, Text> {

         private List<String> centersList = KMeansProcess.TextMapper.centersList;

         private static Logger logger = LogManager.getLogger(KMeans.KMeansMapper.class);

         private TextCosine cosine = KMeansProcess.TextMapper.cosine;

         public void setup(Context context)

         {

             int task = context.getConfiguration().getInt("TASK", 0);

             try {

                 URI[] caches = context.getCacheFiles();

                 if (caches == null || caches.length <= 0) {

                     System.exit(1);

                 }

                 for (int i = 0; i < task; i++) {

                     FileSystem fs = FileSystem.get(caches[i], context.getConfiguration());

                     FSDataInputStream is = fs.open(new Path(caches[i].toString()));

                     BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));

                     String s = null;

                     while ((s = br.readLine()) != null)

                         centersList.add(s);

                     br.close();

                 }

             } catch (IOException e) {

                 logger.error(e.getMessage());

             }

         }

         public void map(Object key, Text value, Context context) {

             try {

                 String str = value.toString();

                 double score = 0;

                 double countTmp = 0;

                 int clusterID = 0;

                 int index = 0;

                 List<ElementDict> vec1 = cosine.tokenizer(str);

                 for (String s : centersList) {

                     List<ElementDict> vec2 = cosine.tokenizer(s);

                     countTmp = cosine.analysisText(vec1, vec2);

                     if (countTmp > score) {

                         clusterID = index;

                         score = countTmp;

                     }

                     index++;

                 }

                 context.write(new IntWritable(clusterID), new Text(str));

             } catch (IOException e) {

                 logger.error(e.getMessage());

             } catch (InterruptedException e) {

                 logger.error(e.getMessage());

             }

         }

         public void cleanup(Context context)

         {

             centersList.clear();

         }

     }

     public static class KMeansReducer extends Reducer<IntWritable, Text, IntWritable, Text> {

         private static Logger logger = LogManager.getLogger(KMeans.KMeansReducer.class);

         public void ruduce(IntWritable key, Iterable<Text> values, Context context) {

             try {

                 for (Text val : values) {

                     context.write(key, val);

                 }

             } catch (IOException e) {

                 logger.error(e.getMessage());

             } catch (InterruptedException e) {

                 logger.error(e.getMessage());

             }

         }

     }

 }

Class KMeans

 import java.io.BufferedReader;

 import java.io.IOException;

 import java.io.InputStreamReader;

 import java.util.List;

 import java.util.ArrayList;

 import java.net.URI;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.FSDataInputStream;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import com.gta.cluster.KMeans.KMeansMapper;

 import com.gta.cluster.KMeans.KMeansReducer;

 import com.gta.cluster.KMeansProcess.TextMapper;

 import com.gta.cluster.KMeansProcess.TextReducer;

 import org.apache.logging.log4j.LogManager;

 import org.apache.logging.log4j.Logger;

 public class Cluster {

     public static final int MAX = 50;

     public static final String INPUT_PATH = "hdfs://10.1.130.10:9000/user/hadoop/input/";

     public static final String OUTPUT_PATH = "hdfs://10.1.130.10:9000/user/hadoop/output/";

     public static final String TMP_PATH = "hdfs://10.1.130.10:9000/user/hadoop/tmp/";

     public static final int TASK = 5;

     public static Logger logger = LogManager.getLogger(Cluster.class);

     private Configuration conf;

     private int iteration = 0;

     public Cluster()

     {

         this.conf = new Configuration();

         conf.setInt("TASK", TASK);

     }

     public void run() throws IOException, InterruptedException, ClassNotFoundException

     {

         while (iteration < MAX) {

             logger.info("次数   ：  " + (iteration+1));

             conf.setInt("ITERATION", iteration);

             Job job = Job.getInstance(conf, "KMeans Process");

             if (iteration == 0) {

                 String cacheFile = TMP_PATH + iteration + "/part-r-0000";

                 for (int i = 0; i < TASK; i++)

                     job.addCacheFile(URI.create(cacheFile+i));

             }

             job.setJarByClass(KMeansProcess.class);

             job.setMapperClass(TextMapper.class);

             job.setNumReduceTasks(TASK);

             job.setReducerClass(TextReducer.class);

             job.setOutputKeyClass(IntWritable.class);

             job.setOutputValueClass(Text.class);

             iteration++;

             String outFile = TMP_PATH + iteration;

             FileInputFormat.addInputPath(job, new Path(INPUT_PATH));

             FileOutputFormat.setOutputPath(job, new Path(outFile));

             job.waitForCompletion(true);

             conf.unset("ITERATION");

             List<String> tmpList = getCenterList(outFile);

             if (KMeansProcess.TextReducer.center.compareCenters(KMeansProcess.TextMapper.centersList, tmpList))

                 break;

             else {

                 KMeansProcess.TextMapper.centersList.clear();

                 for (String s : tmpList) {

                     KMeansProcess.TextMapper.centersList.add(s);

                 }

             }

         }

     }

     public void lastRun() throws IOException, InterruptedException, ClassNotFoundException

     {

         String cacheFile = TMP_PATH + iteration + "/part-r-0000";

         Job job = Job.getInstance(conf, "KMeans");

         for (int i = 0; i < TASK; i++)

             job.addCacheFile(URI.create(cacheFile+i));

         job.setJarByClass(KMeans.class);

         job.setMapperClass(KMeansMapper.class);

         job.setReducerClass(KMeansReducer.class);

         job.setOutputKeyClass(IntWritable.class);

         job.setOutputValueClass(Text.class);

         FileInputFormat.addInputPath(job, new Path(INPUT_PATH));

         FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

         job.waitForCompletion(true);

     }

     public List<String> getCenterList(String outFile)

     {

         List<String> centerList = new ArrayList<String>();

         String fileName = outFile + "/part-r-0000";

         try {

             for (int i = 0; i < TASK; i++) {

                 FileSystem fs = FileSystem.get(URI.create((fileName+i)), conf);

                 FSDataInputStream is = fs.open(new Path((fileName+i).toString()));

                 BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));

                 String s = null;

                 while ((s = br.readLine()) != null)

                     centerList.add(s);

                 br.close();

             }

         } catch (IOException e) {

             logger.info(e.getMessage());

         }

         return centerList;

     }

     public static void main(String[] args) {

         Cluster cluster = new Cluster();

         try {

             long start = System.currentTimeMillis();

             cluster.run();

             cluster.lastRun();

             long end = System.currentTimeMillis();

             Cluster.logger.info(end-start);

         } catch (ClassNotFoundException e) {

             e.printStackTrace();

         } catch (IOException e) {

             e.printStackTrace();

         } catch (InterruptedException e) {

             e.printStackTrace();

         }

     }

 }

Class Cluster

　　鉴于在分布式环境下，多次迭代需要多次读取缓存文件，因此本文引入静态变量，以减少对TextCosine等初始化，以达到提升文本处理速率的目的。本文作者一直试图将对象实体传入Job中，但是经过多次实践，均以失败告终，若是有更好的解决方案，请联系我。

　　作者：志青云集
　　出处：http://www.cnblogs.com/lyssym
　　如果，您认为阅读这篇博客让您有些收获，不妨点击一下右下角的【推荐】。
　　如果，您希望更容易地发现我的新博客，不妨点击一下左下角的【关注我】。
　　如果，您对我的博客所讲述的内容有兴趣，请继续关注我的后续博客，我是【志青云集】。
　　本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接。