刘 勇  Email:lyssym@sina.com

简介

  针对大数量的文本数据,采用单线程处理时,一方面消耗较长处理时间,另一方面对大量数据的I/O操作也会消耗较长处理时间,同时对内存空间的消耗也是很大,因此,本文引入MapReduce计算模型,以分布式方式处理文本数据,以期提高数据处理速率。本文结合Kmeans和DBSCAN算法,对上述算法进行改进,其中借鉴Kmeans聚类方法(类别个数的确定性)以及DBSCAN聚类方法(基于密度),并在数据处理过程中引入多个Reducer对数据进行归并处理。测试结果表明:在文本个数为457条,迭代次数为50次时,该算法具有可行性;但是在数据规模较小时,其处理速率较单线程处理存在一定的劣势,但是当数据量继续增大(数据量达到一定规模)时,基于分布式的算法,其速率优势会更加明显。

相关模型

  本文本着以实际工程应用的角度,针对其中涉及到的数学模型简要描述如下,更多内容请参考本系列之前的内容:

  1)余弦相似度

  本文在判别Web 文本数据相似度时,采用余弦定理对其进行判别,即将两篇Web文本先中文分词后进行量化,然后计算上述两个向量的余弦值,进而对其进行判定相似度,其中为了让处理结果更加准确,引入了同义词词林,即加载同义词词典。更多内容在此不再赘述,详细内容见本系列之文本挖掘之文本相似度判定

  2)DBSCAN

  DBSCAN聚类方法涉及两个重要的参数,即e邻域(近似理解为半径)和最少个数minPts(某一个固定值,即密度标识),上述参数表征在某个对象(理解为某一Web文本数据)的e邻域内对象数据个数大于minPts,则该对象为核心对象;然后根据该核心对象的e邻域内某一个对象i,选择对象i的e邻域内核心对象或边缘对象,继而持续递归,从而将上述所有寻找到的数据归为一个聚类集合。DBSCAN的目的在于寻找密度相连对象的最大集合,浅显的解释为,通过A可以找到B,通过B可以找到C,则A、B和C为同一聚类。更多内容在此不再赘述,详细内容见本系列之文本挖掘之文本聚类(DBSCAN)

  3)Kmeans

  Kmeans聚类方法先初始化K(K在聚类前就已确定,需要指出,DBSCAN方法在聚类前的类别个数是无法知道的)个聚类中心(即质心),然后将文本数据与聚类中心比较,与哪个聚类中心更合适(本文以余弦相似度表征)则与该聚类中心为一类,一轮过后则重新计算各个聚类中心(即质心),并进行迭代,直至最终收敛或者达到迭代次数为止。其中在聚类中心计算中引入DBSCAN方法中基于密度的思维,即在某一类中,若某个向量的密度最大,则该节点向量成为新的质心,较其它算法采用距离统计的算法有所创新。更多内容在此不再赘述。

基于MapReduce的改进算法实现

图-1 基于MapReduce的改进算法框架图

  如图-1所示,为基于MapReduce的改进算法,单次执行框架图。对该框架中部分核心内容解释如下:

  1)  在Mapper端,借鉴Kmeans算法确定K个类别及其初始质心,然后根据该质心,将所有文本进行一次聚类,以相似度与哪个质心相近,则属于质心属于该类别。

  2)在Reducer端,借鉴DBSCAN算法,计算某个所属类别的e领域中所含个数,并以该e领域类所含个数,即minPts个数最多者为新的质心,即密度最大者为新的质心。

  3)在Reducer端,为加快程序访问速率,采用5个Reducer来重新计算类别质心。

  4)在Mapper端,通过读取缓存文件来获取每次迭代所需的类别新质心。

  以下将本次设计中Mapper和Reducer端各自的输入和输出介绍一下:

  Mapper :  <Object,Text>--><IntWritable, Text>

  输入:key未使用, value为Web文本数据;

  输出:key为类别ID,value为Web文本数据。

  Mapper设计的目标:给每篇文本计算出其所属类别,即类别ID。

  Reducer:  <IntWritable, Text>--><NullWritable, Text>

  输入:key为文本类别ID, value为Web文本数据;

  输出:key为Null, value为Web文本数据,即新的质心。

  Reducer设计的目标:给每类数据确定新的质心。

测试结果与性能分析

  由于本次测试目的,在于判别基于MapReduce的文本聚类算法可行性,因此数据规模并未设置很大。测试数据集为随机从网络上抓取的457篇Web标题,并迭代50次来展开测试,迭代的目的在于使每个类别的质心收敛。

表-1 改进的Kmeans和DBSCAN文本聚类算法测试结果

  从表-1测试结果可知:在数据规模较小时,单线程处理的速率明显要优于MapReduce。主要原因在于,基于MapReduce框架,其每次迭代需要重新加载词典,同时读/写缓存文件,以获取质心或者修改质心,因此在数据规模较小时,处理数据的时间甚至不及上述文件的I/O时间,因此其优势并未发挥出来。本文作者曾尝试采用Java反射机制,加载数据对象以期解决上述问题,但收效甚微。

  但是,采用MapReduce框架,在计算新的质心时,采用多个Reduer,明显能够改善数据规约的速率,较单线程处理来说,不仅能节省存储空间,同时处理简单、便捷。考虑到后期文本数据规模日益增大的趋势,引入分布式处理框架,对海量文本数据展开处理,已趋于一种潮流趋势,因此本文提出的算法有一定的实践意义。

  程序源代码:

  1. public class ElementDict {
  2. private String term;
  3. private int freq;
  4.  
  5. public ElementDict(String term, int freq) {
  6. this.term = term;
  7. this.freq = freq;
  8. }
  9.  
  10. public void setFreq (int freq) {
  11. this.freq = freq;
  12. }
  13.  
  14. public String getTerm() {
  15. return term;
  16. }
  17.  
  18. public int getFreq() {
  19. return freq;
  20. }
  21.  
  22. public boolean equals(ElementDict e) {
  23. boolean ret = false;
  24. if (term.equals(e.getTerm()) && freq == e.getFreq())
  25. {
  26. ret = true;
  27. }
  28.  
  29. return ret;
  30. }
  31. }

Class ElementDict

  1. import java.io.BufferedReader;
  2. import java.io.IOException;
  3. import java.io.InputStreamReader;
  4. import java.net.URI;
  5. import java.util.HashMap;
  6. import java.util.List;
  7. import java.util.ArrayList;
  8. import java.util.Map;
  9. import org.apache.lucene.analysis.TokenStream;
  10. import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
  11. import org.wltea.analyzer.lucene.IKAnalyzer;
  12. import org.apache.hadoop.conf.Configuration;
  13. import org.apache.hadoop.fs.FSDataInputStream;
  14. import org.apache.hadoop.fs.FileSystem;
  15. import org.apache.hadoop.fs.Path;
  16. import org.apache.logging.log4j.LogManager;
  17. import org.apache.logging.log4j.Logger;
  18.  
  19. public class TextCosine {
  20. private Map<String, String> map= null;
  21. private double common;
  22. private double special;
  23. private static final String PATH = "hdfs://10.1.130.10:9000";
  24. private static Logger logger = LogManager.getLogger(TextCosine.class);
  25.  
  26. public TextCosine() {
  27. map = new HashMap<String, String>();
  28. try {
  29. Configuration conf = new Configuration();
  30. FileSystem fs = FileSystem.get(URI.create(PATH), conf);
  31. Path path = new Path("/user/hadoop/doc/synonyms.dict");
  32. FSDataInputStream is = fs.open(path);
  33. BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  34. String s = null;
  35. while ((s = br.readLine()) != null) {
  36. String []synonymsEnum = s.split("→");
  37. map.put(synonymsEnum[0], synonymsEnum[1]);
  38. }
  39. br.close();
  40. } catch (IOException e) {
  41. logger.error("TextCosine IOException!");
  42. }
  43. }
  44.  
  45. public TextCosine(double common, double special) {
  46. map = new HashMap<String, String>();
  47. try {
  48. Configuration conf = new Configuration();
  49. FileSystem fs = FileSystem.get(URI.create(PATH), conf);
  50. Path path = new Path("/user/hadoop/doc/synonyms.dict");
  51. FSDataInputStream is = fs.open(path);
  52. BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  53. String s = null;
  54. while ((s = br.readLine()) != null) {
  55. String []synonymsEnum = s.split("→");
  56. map.put(synonymsEnum[0], synonymsEnum[1]);
  57. }
  58. br.close();
  59. } catch (IOException e) {
  60. logger.error("TextCosine IOException!");
  61. }
  62.  
  63. this.common = common;
  64. this.special = special;
  65. }
  66.  
  67. public void setCommon(double common) {
  68. this.common = common;
  69. }
  70.  
  71. public void setSpecial(double special) {
  72. this.special = special;
  73. }
  74.  
  75. // get the word with IK Analyzer
  76. public List<ElementDict> tokenizer(String str) {
  77. List<ElementDict> list = new ArrayList<ElementDict>();
  78. IKAnalyzer analyzer = new IKAnalyzer(true);
  79. try {
  80. TokenStream stream = analyzer.tokenStream("", str);
  81. CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
  82. stream.reset();
  83. int index = -1;
  84. while (stream.incrementToken()) {
  85. if ((index = isContain(cta.toString(), list)) >= 0) {
  86. list.get(index).setFreq(list.get(index).getFreq() + 1);
  87. }
  88. else {
  89. list.add(new ElementDict(cta.toString(), 1));
  90. }
  91. }
  92. analyzer.close();
  93. } catch (IOException e) {
  94. e.printStackTrace();
  95. }
  96. return list;
  97. }
  98.  
  99. // assert one term is in the List
  100. public int isContain(String str, List<ElementDict> list) {
  101. for (ElementDict ed : list) {
  102. if (ed.getTerm().equals(str)) {
  103. return list.indexOf(ed);
  104. } else if (map.get(ed.getTerm())!= null && map.get(ed.getTerm()).equals(str)) {
  105. return list.indexOf(ed);
  106. }
  107. }
  108. return -1;
  109. }
  110.  
  111. // merge the two List to align the vector
  112. public List<String> mergeTerms(List<ElementDict> list1, List<ElementDict> list2) {
  113. List<String> list = new ArrayList<String>();
  114. for (ElementDict ed : list1) {
  115. if (!list.contains(ed.getTerm())) {
  116. list.add(ed.getTerm());
  117. } else if (!list.contains(map.get(ed.getTerm()))) {
  118. list.add(ed.getTerm());
  119. }
  120. }
  121.  
  122. for (ElementDict ed : list2) {
  123. if (!list.contains(ed.getTerm())) {
  124. list.add(ed.getTerm());
  125. } else if (!list.contains(map.get(ed.getTerm()))) {
  126. list.add(ed.getTerm());
  127. }
  128. }
  129. return list;
  130. }
  131.  
  132. // get the max cosine
  133. public double analysisText(List<ElementDict> list1, List<ElementDict> list2) {
  134. int len1 = list1.size();
  135. int len2 = list2.size();
  136. double ret = 0;
  137. if (len2 >= len1 * 1.5) {
  138. List<ElementDict> newList = new ArrayList<ElementDict>();
  139. for (int i = 0; i + len1 <= len2; i++) {
  140. for (int j = 0; j < len1; j++)
  141. newList.add(list2.get(i+j));
  142.  
  143. newList = adjustList(newList, list2, len2, len1, i);
  144. double tmp = analysis(list1, newList);
  145. if (tmp > ret)
  146. ret = tmp;
  147. }
  148. } else if (len1 >= len2 * 1.5) {
  149. List<ElementDict> newList = new ArrayList<ElementDict>();
  150. for (int i = 0; i + len2 <= len1; i++) {
  151. for (int j = 0; j < len2; j++)
  152. newList.add(list1.get(i+j));
  153.  
  154. newList = adjustList(newList, list1, len1, len2, i);
  155. double tmp = analysis(list1, newList);
  156. if (tmp > ret)
  157. ret = tmp;
  158. }
  159. } else {
  160. ret = analysis(list1, list2);
  161. }
  162. return ret;
  163. }
  164.  
  165. // adjust the new List with the length about the original List
  166. public List<ElementDict> adjustList(List<ElementDict> newList, List<ElementDict> list, int lenBig, int lenSmall, int index) {
  167. int gap = lenBig -lenSmall;
  168. int size = (gap/2 > 2) ? 2: gap/2;
  169. if (index < gap/2) {
  170. for (int i = 0; i < size; i++) {
  171. newList.add(list.get(lenSmall+index+i));
  172. }
  173. } else {
  174. for (int i = 0; i > size; i++) {
  175. newList.add(list.get(lenBig-index-i));
  176. }
  177. }
  178. return newList;
  179. }
  180.  
  181. // analysis the cosine for two vectors
  182. public double analysis(List<ElementDict> list1, List<ElementDict> list2) {
  183. List<String> list = mergeTerms(list1, list2);
  184. List<Integer> weightList1 = assignWeight(list, list1);
  185. List<Integer> weightList2 = assignWeight(list, list2);
  186. return countCosSimilarity(weightList1, weightList2);
  187. }
  188.  
  189. // according the frequency to assign the weight
  190. public List<Integer> assignWeight(List<String> list, List<ElementDict> list1) {
  191. List<Integer> vecList = new ArrayList<Integer>(list.size());
  192. boolean isEqual = false;
  193. for (String str : list) {
  194. for (ElementDict ed : list1) {
  195. if (ed.getTerm().equals(str)) {
  196. isEqual = true;
  197. vecList.add(new Integer(ed.getFreq()));
  198. } else if (map.get(ed.getTerm())!= null && map.get(ed.getTerm()).equals(str)) {
  199. isEqual = true;
  200. vecList.add(new Integer(ed.getFreq()));
  201. }
  202. }
  203.  
  204. if (!isEqual) {
  205. vecList.add(new Integer(0));
  206. }
  207. isEqual = false;
  208. }
  209. return vecList;
  210. }
  211.  
  212. // count the cosine about the two vectors
  213. public double countCosSimilarity(List<Integer> list1, List<Integer> list2) {
  214. double countScores = 0;
  215. int element = 0;
  216. int denominator1 = 0;
  217. int denominator2 = 0;
  218. int index = -1;
  219. for (Integer it : list1) {
  220. index ++;
  221. int left = it.intValue();
  222. int right = list2.get(index).intValue();
  223. element += left * right;
  224. denominator1 += left * left;
  225. denominator2 += right * right;
  226. }
  227. try {
  228. countScores = (double)element / Math.sqrt(denominator1 * denominator2);
  229. } catch (ArithmeticException e) {
  230. e.printStackTrace();
  231. }
  232. return countScores;
  233. }
  234.  
  235. public boolean isSimilarity(double param, double score) {
  236. boolean ret = false;
  237. if (score >= param)
  238. ret = true;
  239. return ret;
  240. }
  241.  
  242. public boolean assertSimilarity(List<ElementDict> list1, List<ElementDict> list2)
  243. {
  244. int len1 = list1.size();
  245. int len2 = list2.size();
  246. if (len2 >= len1 * 1.5) {
  247. List<ElementDict> newList = new ArrayList<ElementDict>();
  248. for (int i = 0; i + len1 <= len2; i++) {
  249. for (int j = 0; j < len1; j++)
  250. newList.add(list2.get(i+j));
  251.  
  252. newList = adjustList(newList, list2, len2, len1, i);
  253. if (isSimilarity(special, analysis(list1, newList)))
  254. return true;
  255. }
  256. } else if (len1 >= len2 * 1.5) {
  257. List<ElementDict> newList = new ArrayList<ElementDict>();
  258. for (int i = 0; i + len2 <= len1; i++) {
  259. for (int j = 0; j < len2; j++)
  260. newList.add(list1.get(i+j));
  261.  
  262. newList = adjustList(newList, list1, len1, len2, i);
  263. if (isSimilarity(special, analysis(list1, newList)))
  264. return true;
  265. }
  266. } else {
  267. if (isSimilarity(common, analysis(list1, list2)))
  268. return true;
  269. }
  270. return false;
  271. }
  272. }

Class TextCosine

  1. import java.util.Collections;
  2. import java.util.List;
  3.  
  4. import org.apache.logging.log4j.LogManager;
  5. import org.apache.logging.log4j.Logger;
  6.  
  7. import com.gta.cosine.TextCosine;
  8. import com.gta.cosine.ElementDict;
  9.  
  10. public class DensityCenter {
  11. private Logger logger = LogManager.getLogger(DensityCenter.class);
  12. private double eps;
  13. private TextCosine cosine;
  14.  
  15. public DensityCenter(double eps, TextCosine cosine) {
  16. this.eps = eps;
  17. this.cosine = cosine;
  18. }
  19.  
  20. public double cosineDistance(String src, String dst)
  21. {
  22. List<ElementDict> vec1 = cosine.tokenizer(src);
  23. List<ElementDict> vec2 = cosine.tokenizer(dst);
  24. return cosine.analysisText(vec1, vec2);
  25. }
  26.  
  27. public int getNeighbors(String src, List<String> dst) {
  28. int ret = 0;
  29. double score = 0;
  30. for (String s : dst) {
  31. score = cosineDistance(src, s);
  32. if (score >= eps)
  33. ret++;
  34. }
  35. return ret;
  36. }
  37.  
  38. public String getDensityCenter(List<String> text) {
  39. int max = 0;
  40. int i = 0;
  41. int index = 0;
  42. for (String s : text) {
  43. int ret = getNeighbors(s, text);
  44. if (ret > max) {
  45. index = i;
  46. max = ret;
  47. }
  48. i++;
  49. }
  50. return text.get(index);
  51. }
  52.  
  53. public boolean compareCenters(List<String> oldCenters, List<String> newCenters)
  54. {
  55. boolean ret = false;
  56. Collections.sort(oldCenters);
  57. Collections.sort(newCenters);
  58. int oldSize = oldCenters.size();
  59. int newSize = newCenters.size();
  60. logger.info("oldSize : " + oldSize);
  61. logger.info("newSize : " + newSize);
  62. int size = oldSize > newSize ? newSize : oldSize;
  63. int index = 0;
  64. int count = 0;
  65. for (String s : oldCenters) {
  66. if (s.equals(newCenters.get(index)))
  67. count++;
  68.  
  69. index++;
  70. if (index >= size) // Avoid the size of two List is not the same
  71. break;
  72. }
  73. logger.info("count : " + count);
  74. if (count == index)
  75. ret = true;
  76.  
  77. return ret;
  78. }
  79. }

Class DensityCenter

  1. import java.io.BufferedReader;
  2. import java.io.InputStreamReader;
  3. import java.io.IOException;
  4. import java.net.URI;
  5. import java.util.ArrayList;
  6. import java.util.List;
  7. import org.apache.hadoop.fs.FileSystem;
  8. import org.apache.hadoop.fs.FSDataInputStream;
  9. import org.apache.hadoop.fs.Path;
  10. import org.apache.hadoop.io.IntWritable;
  11. import org.apache.hadoop.io.NullWritable;
  12. import org.apache.hadoop.io.Text;
  13. import org.apache.hadoop.mapreduce.Mapper;
  14. import org.apache.hadoop.mapreduce.Reducer;
  15. import org.apache.logging.log4j.LogManager;
  16. import org.apache.logging.log4j.Logger;
  17. import com.gta.cosine.TextCosine;
  18. import com.gta.cosine.ElementDict;
  19. import com.gta.util.DensityCenter;
  20.  
  21. public class KMeansProcess {
  22.  
  23. public static class TextMapper extends Mapper<Object, Text, IntWritable, Text> {
  24. private static Logger logger = LogManager.getLogger(TextMapper.class);
  25. public static List<String> centersList = new ArrayList<String>();
  26. public static TextCosine cosine = new TextCosine();
  27.  
  28. public void setup(Context context)
  29. {
  30. int iteration = context.getConfiguration().getInt("ITERATION", 100);
  31. if (iteration == 0) {
  32. int task = context.getConfiguration().getInt("TASK", 0);
  33. try {
  34. URI[] caches = context.getCacheFiles();
  35. if (caches == null || caches.length <= 0) {
  36. System.exit(1);
  37. }
  38. for (int i = 0; i < task; i++) {
  39. FileSystem fs = FileSystem.get(caches[i], context.getConfiguration());
  40. FSDataInputStream is = fs.open(new Path(caches[i].toString()));
  41. BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  42. String s = null;
  43. while ((s = br.readLine()) != null) {
  44. centersList.add(s);
  45. }
  46. br.close();
  47. }
  48. } catch (IOException e) {
  49. logger.error(e.getMessage());
  50. }
  51. }
  52. }
  53.  
  54. public void map(Object key, Text value, Context context)
  55. {
  56. try {
  57. String str = value.toString();
  58. double score = 0;
  59. double countTmp = 0;
  60. int clusterID = 0;
  61. int index = 0;
  62. List<ElementDict> vec1 = cosine.tokenizer(str);
  63. for (String s : centersList) {
  64. List<ElementDict> vec2 = cosine.tokenizer(s);
  65. countTmp = cosine.analysisText(vec1, vec2);
  66. if (countTmp > score) {
  67. clusterID = index;
  68. score = countTmp;
  69. }
  70. index++;
  71. }
  72. context.write(new IntWritable(clusterID), new Text(str));
  73. } catch (IOException e) {
  74. logger.error(e.getMessage());
  75. } catch (InterruptedException e) {
  76. logger.error(e.getMessage());
  77. }
  78. }
  79. }
  80.  
  81. public static class TextReducer extends Reducer<IntWritable, Text, NullWritable, Text> {
  82. private static Logger logger = LogManager.getLogger(TextReducer.class);
  83. public static DensityCenter center = new DensityCenter(0.75, KMeansProcess.TextMapper.cosine);
  84.  
  85. public void reduce(IntWritable key, Iterable<Text> values, Context context) {
  86. try {
  87. List<String> list = new ArrayList<String>();
  88. for (Text val : values) {
  89. list.add(val.toString());
  90. }
  91. context.write(NullWritable.get(), new Text(center.getDensityCenter(list)));
  92. } catch (IOException e) {
  93. logger.error(e.getMessage());
  94. } catch (InterruptedException e) {
  95. logger.error(e.getMessage());
  96. }
  97. }
  98. }
  99. }

Class KMeansProcess

  1. import java.io.BufferedReader;
  2. import java.io.IOException;
  3. import java.io.InputStreamReader;
  4. import java.net.URI;
  5. import java.util.List;
  6. import org.apache.hadoop.fs.FSDataInputStream;
  7. import org.apache.hadoop.fs.FileSystem;
  8. import org.apache.hadoop.fs.Path;
  9. import org.apache.hadoop.io.IntWritable;
  10. import org.apache.hadoop.io.Text;
  11. import org.apache.hadoop.mapreduce.Mapper;
  12. import org.apache.hadoop.mapreduce.Reducer;
  13. import org.apache.logging.log4j.LogManager;
  14. import org.apache.logging.log4j.Logger;
  15. import com.gta.cosine.TextCosine;
  16. import com.gta.cosine.ElementDict;
  17.  
  18. public class KMeans {
  19.  
  20. public static class KMeansMapper extends Mapper<Object, Text, IntWritable, Text> {
  21. private List<String> centersList = KMeansProcess.TextMapper.centersList;
  22. private static Logger logger = LogManager.getLogger(KMeans.KMeansMapper.class);
  23. private TextCosine cosine = KMeansProcess.TextMapper.cosine;
  24.  
  25. public void setup(Context context)
  26. {
  27. int task = context.getConfiguration().getInt("TASK", 0);
  28. try {
  29. URI[] caches = context.getCacheFiles();
  30. if (caches == null || caches.length <= 0) {
  31. System.exit(1);
  32. }
  33. for (int i = 0; i < task; i++) {
  34. FileSystem fs = FileSystem.get(caches[i], context.getConfiguration());
  35. FSDataInputStream is = fs.open(new Path(caches[i].toString()));
  36. BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  37. String s = null;
  38. while ((s = br.readLine()) != null)
  39. centersList.add(s);
  40. br.close();
  41. }
  42. } catch (IOException e) {
  43. logger.error(e.getMessage());
  44. }
  45. }
  46.  
  47. public void map(Object key, Text value, Context context) {
  48. try {
  49. String str = value.toString();
  50. double score = 0;
  51. double countTmp = 0;
  52. int clusterID = 0;
  53. int index = 0;
  54. List<ElementDict> vec1 = cosine.tokenizer(str);
  55. for (String s : centersList) {
  56. List<ElementDict> vec2 = cosine.tokenizer(s);
  57. countTmp = cosine.analysisText(vec1, vec2);
  58. if (countTmp > score) {
  59. clusterID = index;
  60. score = countTmp;
  61. }
  62. index++;
  63. }
  64. context.write(new IntWritable(clusterID), new Text(str));
  65. } catch (IOException e) {
  66. logger.error(e.getMessage());
  67. } catch (InterruptedException e) {
  68. logger.error(e.getMessage());
  69. }
  70. }
  71.  
  72. public void cleanup(Context context)
  73. {
  74. centersList.clear();
  75. }
  76. }
  77.  
  78. public static class KMeansReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
  79. private static Logger logger = LogManager.getLogger(KMeans.KMeansReducer.class);
  80.  
  81. public void ruduce(IntWritable key, Iterable<Text> values, Context context) {
  82. try {
  83. for (Text val : values) {
  84. context.write(key, val);
  85. }
  86. } catch (IOException e) {
  87. logger.error(e.getMessage());
  88. } catch (InterruptedException e) {
  89. logger.error(e.getMessage());
  90. }
  91. }
  92. }
  93.  
  94. }

Class KMeans

  1. import java.io.BufferedReader;
  2. import java.io.IOException;
  3. import java.io.InputStreamReader;
  4. import java.util.List;
  5. import java.util.ArrayList;
  6. import java.net.URI;
  7. import org.apache.hadoop.conf.Configuration;
  8. import org.apache.hadoop.fs.FSDataInputStream;
  9. import org.apache.hadoop.fs.FileSystem;
  10. import org.apache.hadoop.fs.Path;
  11. import org.apache.hadoop.io.IntWritable;
  12. import org.apache.hadoop.io.Text;
  13. import org.apache.hadoop.mapreduce.Job;
  14. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  15. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  16. import com.gta.cluster.KMeans.KMeansMapper;
  17. import com.gta.cluster.KMeans.KMeansReducer;
  18. import com.gta.cluster.KMeansProcess.TextMapper;
  19. import com.gta.cluster.KMeansProcess.TextReducer;
  20. import org.apache.logging.log4j.LogManager;
  21. import org.apache.logging.log4j.Logger;
  22.  
  23. public class Cluster {
  24. public static final int MAX = 50;
  25. public static final String INPUT_PATH = "hdfs://10.1.130.10:9000/user/hadoop/input/";
  26. public static final String OUTPUT_PATH = "hdfs://10.1.130.10:9000/user/hadoop/output/";
  27. public static final String TMP_PATH = "hdfs://10.1.130.10:9000/user/hadoop/tmp/";
  28. public static final int TASK = 5;
  29. public static Logger logger = LogManager.getLogger(Cluster.class);
  30. private Configuration conf;
  31. private int iteration = 0;
  32.  
  33. public Cluster()
  34. {
  35. this.conf = new Configuration();
  36. conf.setInt("TASK", TASK);
  37. }
  38.  
  39. public void run() throws IOException, InterruptedException, ClassNotFoundException
  40. {
  41. while (iteration < MAX) {
  42. logger.info("次数 : " + (iteration+1));
  43. conf.setInt("ITERATION", iteration);
  44. Job job = Job.getInstance(conf, "KMeans Process");
  45. if (iteration == 0) {
  46. String cacheFile = TMP_PATH + iteration + "/part-r-0000";
  47. for (int i = 0; i < TASK; i++)
  48. job.addCacheFile(URI.create(cacheFile+i));
  49. }
  50. job.setJarByClass(KMeansProcess.class);
  51. job.setMapperClass(TextMapper.class);
  52. job.setNumReduceTasks(TASK);
  53. job.setReducerClass(TextReducer.class);
  54. job.setOutputKeyClass(IntWritable.class);
  55. job.setOutputValueClass(Text.class);
  56. iteration++;
  57. String outFile = TMP_PATH + iteration;
  58. FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
  59. FileOutputFormat.setOutputPath(job, new Path(outFile));
  60. job.waitForCompletion(true);
  61. conf.unset("ITERATION");
  62. List<String> tmpList = getCenterList(outFile);
  63. if (KMeansProcess.TextReducer.center.compareCenters(KMeansProcess.TextMapper.centersList, tmpList))
  64. break;
  65. else {
  66. KMeansProcess.TextMapper.centersList.clear();
  67. for (String s : tmpList) {
  68. KMeansProcess.TextMapper.centersList.add(s);
  69. }
  70. }
  71. }
  72. }
  73.  
  74. public void lastRun() throws IOException, InterruptedException, ClassNotFoundException
  75. {
  76. String cacheFile = TMP_PATH + iteration + "/part-r-0000";
  77. Job job = Job.getInstance(conf, "KMeans");
  78. for (int i = 0; i < TASK; i++)
  79. job.addCacheFile(URI.create(cacheFile+i));
  80. job.setJarByClass(KMeans.class);
  81. job.setMapperClass(KMeansMapper.class);
  82. job.setReducerClass(KMeansReducer.class);
  83. job.setOutputKeyClass(IntWritable.class);
  84. job.setOutputValueClass(Text.class);
  85. FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
  86. FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
  87. job.waitForCompletion(true);
  88. }
  89.  
  90. public List<String> getCenterList(String outFile)
  91. {
  92. List<String> centerList = new ArrayList<String>();
  93. String fileName = outFile + "/part-r-0000";
  94. try {
  95. for (int i = 0; i < TASK; i++) {
  96. FileSystem fs = FileSystem.get(URI.create((fileName+i)), conf);
  97. FSDataInputStream is = fs.open(new Path((fileName+i).toString()));
  98. BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  99. String s = null;
  100. while ((s = br.readLine()) != null)
  101. centerList.add(s);
  102. br.close();
  103. }
  104. } catch (IOException e) {
  105. logger.info(e.getMessage());
  106. }
  107.  
  108. return centerList;
  109. }
  110.  
  111. public static void main(String[] args) {
  112. Cluster cluster = new Cluster();
  113. try {
  114. long start = System.currentTimeMillis();
  115. cluster.run();
  116. cluster.lastRun();
  117. long end = System.currentTimeMillis();
  118. Cluster.logger.info(end-start);
  119. } catch (ClassNotFoundException e) {
  120. e.printStackTrace();
  121. } catch (IOException e) {
  122. e.printStackTrace();
  123. } catch (InterruptedException e) {
  124. e.printStackTrace();
  125. }
  126. }
  127. }

Class Cluster

  鉴于在分布式环境下,多次迭代需要多次读取缓存文件,因此本文引入静态变量,以减少对TextCosine等初始化,以达到提升文本处理速率的目的。本文作者一直试图将对象实体传入Job中,但是经过多次实践,均以失败告终,若是有更好的解决方案,请联系我


  作者:志青云集
  出处:http://www.cnblogs.com/lyssym
  如果,您认为阅读这篇博客让您有些收获,不妨点击一下右下角的【推荐】。
  如果,您希望更容易地发现我的新博客,不妨点击一下左下角的【关注我】。
  如果,您对我的博客所讲述的内容有兴趣,请继续关注我的后续博客,我是【志青云集】。
  本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。


文本挖掘之文本聚类(MapReduce)的更多相关文章

  1. 文本挖掘之文本聚类(OPTICS)

    刘 勇  Email:lyssym@sina.com 简介 鉴于DBSCAN算法对输入参数,邻域半径E和阈值M比较敏感,在参数调优时比较麻烦,因此本文对另一种基于密度的聚类算法OPTICS(Order ...

  2. 文本挖掘之文本聚类(DBSCAN)

    刘 勇   Email:lyssym@sina.com 简介 鉴于基于划分的文本聚类方法只能识别球形的聚类,因此本文对基于密度的文本聚类算法展开研究.DBSCAN(Density-Based Spat ...

  3. 灵玖软件NLPIRParser智能文本聚类

    随着互联网的迅猛发展,信息的爆炸式增加,信息超载问题变的越来越严重,信息的更新率也越来越高,用户在信息海洋里查找信息就像大海捞针一样.搜索引擎服务应运而生,在一定程度上满足了用户查找信息的需要.然而互 ...

  4. 10.HanLP实现k均值--文本聚类

    笔记转载于GitHub项目:https://github.com/NLP-LOVE/Introduction-NLP 10. 文本聚类 正所谓物以类聚,人以群分.人们在获取数据时需要整理,将相似的数据 ...

  5. K-means算法及文本聚类实践

    K-Means是常用的聚类算法,与其他聚类算法相比,其时间复杂度低,聚类的效果也还不错,这里简单介绍一下k-means算法,下图是一个手写体数据集聚类的结果. 基本思想 k-means算法需要事先指定 ...

  6. R语言做文本挖掘 Part4文本分类

    Part4文本分类 Part3文本聚类提到过.与聚类分类的简单差异. 那么,我们需要理清训练集的分类,有明白分类的文本:測试集,能够就用训练集来替代.预測集,就是未分类的文本.是分类方法最后的应用实现 ...

  7. [python] 使用Jieba工具中文分词及文本聚类概念

    声明:由于担心CSDN博客丢失,在博客园简单对其进行备份,以后两个地方都会写文章的~感谢CSDN和博客园提供的平台.        前面讲述了很多关于Python爬取本体Ontology.消息盒Inf ...

  8. pyhanlp 文本聚类详细介绍

    文本聚类 文本聚类简单点的来说就是将文本视作一个样本,在其上面进行聚类操作.但是与我们机器学习中常用的聚类操作不同之处在于. 我们的聚类对象不是直接的文本本身,而是文本提取出来的特征.因此如何提取特征 ...

  9. [转]python进行中文文本聚类(切词以及Kmeans聚类)

    简介 查看百度搜索中文文本聚类我失望的发现,网上竟然没有一个完整的关于Python实现的中文文本聚类(乃至搜索关键词python 中文文本聚类也是如此),网上大部分是关于文本聚类的Kmeans聚类的原 ...

随机推荐

  1. 【linux】linux查看文件大小,磁盘大小

    查看指定目录下 文件或目录大小超过多少的 查看 /backup/tomcat7/ 目录下 超过500M大小的文件 并展示 文件详情 find /backup/tomcat7/  -type f -si ...

  2. anroid源码下载和编译

    本文是在Ubuntu10.10系统上进行实践的. 1 因为我们需要Android的模拟器,所以需要安装Android的SDK,并创建AVD, 可以一次命名为AVD15,AVD22,AVD23,... ...

  3. [runtime] iOS-Runtime-Headers

    Dynamically Generated iOS Headers https://github.com/nst/iOS-Runtime-Headers Here are iOS Objective- ...

  4. TQ2440实现触摸屏和qt图形 解决segmentation fault

    使用触摸屏,首先安装触摸屏矫正程序. 下载并解压tslib-1.4,进入主文件夹,运行: 1 [root@localhost ~]#./autogen.sh 2 [root@localhost ~]# ...

  5. JSP中解决中文乱码

    <%@ page language="java" contentType="text/html;charset=gbk" pageEncoding=&qu ...

  6. C语言:通过指针函数输出二维数组中每个学生的成绩

    // //  main.c //  Pointer_function // //  Created by ma c on 15/8/2. //  Copyright (c) 2015年 bjsxt. ...

  7. UVA 10474 (13.08.04)

     Where is the Marble?  Raju and Meena love to play with Marbles. They have got a lotof marbles with ...

  8. 协定须要双工,可是绑定“WSHttpBinding”不支持它或者因配置不对而无法支持它

    协定须要双工,可是绑定"WSHttpBinding"不支持它或者因配置不对而无法支持它 下面两种情况,我都遇到过. 一, < endpoint address =" ...

  9. GO语言基础条件、跳转、Array和Slice

    1. 判断语句if 1. 条件表达式没有括号(这点其他语言转过来的需要注意) 2. 支持一个初始化表达式(可以是并行方式,即:a, b, c := 1, 2, 3) 3. 左大括号必须和条件语句或 e ...

  10. dev -c++ 快捷键

    转自:http://blog.csdn.net/abcjennifer/article/details/7259222 F8:开始调试 F7:进一步执行当前行,并跳到下一行 F4:添加查看 ctrl ...