原博文出自于:  http://blog.fens.me/hadoop-mahout-kmeans/        感谢!

Mahout分步式程序开发 聚类Kmeans

Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa,新增加的项目包括,YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, Hue等。

从2011年开始,中国进入大数据风起云涌的时代,以Hadoop为代表的家族软件,占据了大数据处理的广阔地盘。开源界及厂商,所有数据软件,无一不向Hadoop靠拢。Hadoop也从小众的高富帅领域,变成了大数据开发的标准。在Hadoop原有技术基础之上,出现了Hadoop家族产品,通过“大数据”概念不断创新,推出科技进步。

作为IT界的开发人员,我们也要跟上节奏,抓住机遇,跟着Hadoop一起雄起!

关于作者:

  • 张丹(Conan), 程序员Java,R,PHP,Javascript
  • weibo:@Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

转载请注明出处:
http://blog.fens.me/hadoop-mahout-kmeans/

前言

Mahout是基于Hadoop用于机器学习的程序开发框架,Mahout封装了3大类的机器学习算法,其中包括聚类算法。kmeans是我们经常会提到用到的聚类算法之一,特别处理未知数据集的时,都会先聚类一下,看看数据集会有一些什么样的规则。

本文主要讲解,基于Mahout程序开发,实现分步式的kmeans算法。

目录

  1. 聚类算法kmeans
  2. Mahout开发环境介绍
  3. 用Mahout实现聚类算法kmeans
  4. 用R语言可视化结果
  5. 模板项目上传github

1. 聚类算法kmeans

聚类分析是数据挖掘及机器学习领域内的重点问题之一,在数据挖掘、模式识别、决策支持、机器学习及图像分割等领域有广泛的应用,是最重要的数据分析方法之一。聚类是在给定的数据集合中寻找同类的数据子集合,每一个子集合形成一个类簇,同类簇中的数据具有更大的相似性。聚类算法大体上可分为基于划分的方法、基于层次的方法、基于密度的方法、基于网格的方法以及基于模型的方法。

k-means algorithm算法是一种得到最广泛使用的基于划分的聚类算法,把n个对象分为k个簇,以使簇内具有较高的相似度。相似度的计算根据一个簇中对象的平均值来进行。它与处理混合正态分布的最大期望算法很相似,因为他们都试图找到数据中自然聚类的中心。

算法首先随机地选择k个对象,每个对象初始地代表了一个簇的平均值或中心。对剩余的每个对象根据其与各个簇中心的距离,将它赋给最近的簇,然后重新计算每个簇的平均值。这个过程不断重复,直到准则函数收敛。

kmeans介绍摘自:http://zh.wikipedia.org/wiki/K平均算法

2. Mahout开发环境介绍

接上一篇文章:Mahout分步式程序开发 基于物品的协同过滤ItemCF

所有环境变量 和 系统配置 与上文一致!

3. 用Mahout实现聚类算法kmeans

实现步骤:

  • 1. 准备数据文件: randomData.csv
  • 2. Java程序:KmeansHadoop.java
  • 3. 运行程序
  • 4. 聚类结果解读
  • 5. HDFS产生的目录

1). 准备数据文件: randomData.csv
数据文件randomData.csv,由R语言通过“随机正太分布函数”程序生成,单机内存实验请参考文章:
用Maven构建Mahout项目

原始数据文件:这里只截取了一部分数据。


  1. ~ vi datafile/randomData.csv
  2. -0.883033363823402 -3.31967192630249
  3. -2.39312626419456 3.34726861118871
  4. 2.66976353341256 1.85144276077058
  5. -1.09922906899594 -6.06261735207489
  6. -4.36361936997216 1.90509905380532
  7. -0.00351835125495037 -0.610105996559153
  8. -2.9962958796338 -3.60959839525735
  9. -3.27529418132066 0.0230099799641799
  10. 2.17665594420569 6.77290756817957
  11. -2.47862038335637 2.53431833167278
  12. 5.53654901906814 2.65089785582474
  13. 5.66257474538338 6.86783609641077
  14. -0.558946883114376 1.22332819416237
  15. 5.11728525486132 3.74663871584768
  16. 1.91240516693351 2.95874731384062
  17. -2.49747101306535 2.05006504756875
  18. 3.98781883213459 1.00780938946366
  19. 5.47470532716682 5.35084411045171

注:由于Mahout中kmeans算法,默认的分融符是” “(空格),因些我把逗号分隔的数据文件,改成以空格分隔。

2). Java程序:KmeansHadoop.java

kmeans的算法实现,请查看Mahout in Action。


  1. package org.conan.mymahout.cluster08;
  2. import org.apache.hadoop.fs.Path;
  3. import org.apache.hadoop.mapred.JobConf;
  4. import org.apache.mahout.clustering.conversion.InputDriver;
  5. import org.apache.mahout.clustering.kmeans.KMeansDriver;
  6. import org.apache.mahout.clustering.kmeans.RandomSeedGenerator;
  7. import org.apache.mahout.common.distance.DistanceMeasure;
  8. import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
  9. import org.apache.mahout.utils.clustering.ClusterDumper;
  10. import org.conan.mymahout.hdfs.HdfsDAO;
  11. import org.conan.mymahout.recommendation.ItemCFHadoop;
  12. public class KmeansHadoop {
  13. private static final String HDFS = "hdfs://192.168.1.210:9000";
  14. public static void main(String[] args) throws Exception {
  15. String localFile = "datafile/randomData.csv";
  16. String inPath = HDFS + "/user/hdfs/mix_data";
  17. String seqFile = inPath + "/seqfile";
  18. String seeds = inPath + "/seeds";
  19. String outPath = inPath + "/result/";
  20. String clusteredPoints = outPath + "/clusteredPoints";
  21. JobConf conf = config();
  22. HdfsDAO hdfs = new HdfsDAO(HDFS, conf);
  23. hdfs.rmr(inPath);
  24. hdfs.mkdirs(inPath);
  25. hdfs.copyFile(localFile, inPath);
  26. hdfs.ls(inPath);
  27. InputDriver.runJob(new Path(inPath), new Path(seqFile), "org.apache.mahout.math.RandomAccessSparseVector");
  28. int k = 3;
  29. Path seqFilePath = new Path(seqFile);
  30. Path clustersSeeds = new Path(seeds);
  31. DistanceMeasure measure = new EuclideanDistanceMeasure();
  32. clustersSeeds = RandomSeedGenerator.buildRandom(conf, seqFilePath, clustersSeeds, k, measure);
  33. KMeansDriver.run(conf, seqFilePath, clustersSeeds, new Path(outPath), measure, 0.01, 10, true, 0.01, false);
  34. Path outGlobPath = new Path(outPath, "clusters-*-final");
  35. Path clusteredPointsPath = new Path(clusteredPoints);
  36. System.out.printf("Dumping out clusters from clusters: %s and clusteredPoints: %s\n", outGlobPath, clusteredPointsPath);
  37. ClusterDumper clusterDumper = new ClusterDumper(outGlobPath, clusteredPointsPath);
  38. clusterDumper.printClusters(null);
  39. }
  40. public static JobConf config() {
  41. JobConf conf = new JobConf(ItemCFHadoop.class);
  42. conf.setJobName("ItemCFHadoop");
  43. conf.addResource("classpath:/hadoop/core-site.xml");
  44. conf.addResource("classpath:/hadoop/hdfs-site.xml");
  45. conf.addResource("classpath:/hadoop/mapred-site.xml");
  46. return conf;
  47. }
  48. }

3). 运行程序
控制台输出:


  1. Delete: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  2. Create: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  3. copy from: datafile/randomData.csv to hdfs://192.168.1.210:9000/user/hdfs/mix_data
  4. ls: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  5. ==========================================================
  6. name: hdfs://192.168.1.210:9000/user/hdfs/mix_data/randomData.csv, folder: false, size: 36655
  7. ==========================================================
  8. SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  9. SLF4J: Defaulting to no-operation (NOP) logger implementation
  10. SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  11. 2013-10-14 15:39:31 org.apache.hadoop.util.NativeCodeLoader
  12. 警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  13. 2013-10-14 15:39:31 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  14. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  15. 2013-10-14 15:39:31 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  16. 信息: Total input paths to process : 1
  17. 2013-10-14 15:39:31 org.apache.hadoop.io.compress.snappy.LoadSnappy
  18. 警告: Snappy native library not loaded
  19. 2013-10-14 15:39:31 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  20. 信息: Running job: job_local_0001
  21. 2013-10-14 15:39:31 org.apache.hadoop.mapred.Task initialize
  22. 信息: Using ResourceCalculatorPlugin : null
  23. 2013-10-14 15:39:31 org.apache.hadoop.mapred.Task done
  24. 信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
  25. 2013-10-14 15:39:31 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  26. 信息:
  27. 2013-10-14 15:39:31 org.apache.hadoop.mapred.Task commit
  28. 信息: Task attempt_local_0001_m_000000_0 is allowed to commit now
  29. 2013-10-14 15:39:31 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  30. 信息: Saved output of task 'attempt_local_0001_m_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/seqfile
  31. 2013-10-14 15:39:31 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  32. 信息:
  33. 2013-10-14 15:39:31 org.apache.hadoop.mapred.Task sendDone
  34. 信息: Task 'attempt_local_0001_m_000000_0' done.
  35. 2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  36. 信息: map 100% reduce 0%
  37. 2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  38. 信息: Job complete: job_local_0001
  39. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  40. 信息: Counters: 11
  41. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  42. 信息: File Output Format Counters
  43. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  44. 信息: Bytes Written=31390
  45. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  46. 信息: File Input Format Counters
  47. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  48. 信息: Bytes Read=36655
  49. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  50. 信息: FileSystemCounters
  51. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  52. 信息: FILE_BYTES_READ=475910
  53. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  54. 信息: HDFS_BYTES_READ=36655
  55. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  56. 信息: FILE_BYTES_WRITTEN=506350
  57. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  58. 信息: HDFS_BYTES_WRITTEN=68045
  59. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  60. 信息: Map-Reduce Framework
  61. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  62. 信息: Map input records=1000
  63. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  64. 信息: Spilled Records=0
  65. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  66. 信息: Total committed heap usage (bytes)=188284928
  67. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  68. 信息: SPLIT_RAW_BYTES=124
  69. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
  70. 信息: Map output records=1000
  71. 2013-10-14 15:39:32 org.apache.hadoop.io.compress.CodecPool getCompressor
  72. 信息: Got brand-new compressor
  73. 2013-10-14 15:39:32 org.apache.hadoop.io.compress.CodecPool getDecompressor
  74. 信息: Got brand-new decompressor
  75. 2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  76. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  77. 2013-10-14 15:39:32 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  78. 信息: Total input paths to process : 1
  79. 2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  80. 信息: Running job: job_local_0002
  81. 2013-10-14 15:39:32 org.apache.hadoop.mapred.Task initialize
  82. 信息: Using ResourceCalculatorPlugin : null
  83. 2013-10-14 15:39:32 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  84. 信息: io.sort.mb = 100
  85. 2013-10-14 15:39:32 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  86. 信息: data buffer = 79691776/99614720
  87. 2013-10-14 15:39:32 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  88. 信息: record buffer = 262144/327680
  89. 2013-10-14 15:39:33 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  90. 信息: Starting flush of map output
  91. 2013-10-14 15:39:33 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  92. 信息: Finished spill 0
  93. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Task done
  94. 信息: Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting
  95. 2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  96. 信息:
  97. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Task sendDone
  98. 信息: Task 'attempt_local_0002_m_000000_0' done.
  99. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Task initialize
  100. 信息: Using ResourceCalculatorPlugin : null
  101. 2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  102. 信息:
  103. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Merger$MergeQueue merge
  104. 信息: Merging 1 sorted segments
  105. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Merger$MergeQueue merge
  106. 信息: Down to the last merge-pass, with 1 segments left of total size: 623 bytes
  107. 2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  108. 信息:
  109. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Task done
  110. 信息: Task:attempt_local_0002_r_000000_0 is done. And is in the process of commiting
  111. 2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  112. 信息:
  113. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Task commit
  114. 信息: Task attempt_local_0002_r_000000_0 is allowed to commit now
  115. 2013-10-14 15:39:33 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  116. 信息: Saved output of task 'attempt_local_0002_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-1
  117. 2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  118. 信息: reduce > reduce
  119. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Task sendDone
  120. 信息: Task 'attempt_local_0002_r_000000_0' done.
  121. 2013-10-14 15:39:33 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  122. 信息: map 100% reduce 100%
  123. 2013-10-14 15:39:33 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  124. 信息: Job complete: job_local_0002
  125. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  126. 信息: Counters: 19
  127. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  128. 信息: File Output Format Counters
  129. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  130. 信息: Bytes Written=695
  131. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  132. 信息: FileSystemCounters
  133. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  134. 信息: FILE_BYTES_READ=4239303
  135. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  136. 信息: HDFS_BYTES_READ=203963
  137. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  138. 信息: FILE_BYTES_WRITTEN=4457168
  139. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  140. 信息: HDFS_BYTES_WRITTEN=140321
  141. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  142. 信息: File Input Format Counters
  143. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  144. 信息: Bytes Read=31390
  145. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  146. 信息: Map-Reduce Framework
  147. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  148. 信息: Map output materialized bytes=627
  149. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  150. 信息: Map input records=1000
  151. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  152. 信息: Reduce shuffle bytes=0
  153. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  154. 信息: Spilled Records=6
  155. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  156. 信息: Map output bytes=612
  157. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  158. 信息: Total committed heap usage (bytes)=376569856
  159. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  160. 信息: SPLIT_RAW_BYTES=130
  161. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  162. 信息: Combine input records=0
  163. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  164. 信息: Reduce input records=3
  165. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  166. 信息: Reduce input groups=3
  167. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  168. 信息: Combine output records=0
  169. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  170. 信息: Reduce output records=3
  171. 2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
  172. 信息: Map output records=3
  173. 2013-10-14 15:39:34 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  174. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  175. 2013-10-14 15:39:34 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  176. 信息: Total input paths to process : 1
  177. 2013-10-14 15:39:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  178. 信息: Running job: job_local_0003
  179. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Task initialize
  180. 信息: Using ResourceCalculatorPlugin : null
  181. 2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  182. 信息: io.sort.mb = 100
  183. 2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  184. 信息: data buffer = 79691776/99614720
  185. 2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  186. 信息: record buffer = 262144/327680
  187. 2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  188. 信息: Starting flush of map output
  189. 2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  190. 信息: Finished spill 0
  191. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Task done
  192. 信息: Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting
  193. 2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  194. 信息:
  195. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Task sendDone
  196. 信息: Task 'attempt_local_0003_m_000000_0' done.
  197. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Task initialize
  198. 信息: Using ResourceCalculatorPlugin : null
  199. 2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  200. 信息:
  201. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Merger$MergeQueue merge
  202. 信息: Merging 1 sorted segments
  203. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Merger$MergeQueue merge
  204. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  205. 2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  206. 信息:
  207. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Task done
  208. 信息: Task:attempt_local_0003_r_000000_0 is done. And is in the process of commiting
  209. 2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  210. 信息:
  211. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Task commit
  212. 信息: Task attempt_local_0003_r_000000_0 is allowed to commit now
  213. 2013-10-14 15:39:34 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  214. 信息: Saved output of task 'attempt_local_0003_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-2
  215. 2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  216. 信息: reduce > reduce
  217. 2013-10-14 15:39:34 org.apache.hadoop.mapred.Task sendDone
  218. 信息: Task 'attempt_local_0003_r_000000_0' done.
  219. 2013-10-14 15:39:35 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  220. 信息: map 100% reduce 100%
  221. 2013-10-14 15:39:35 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  222. 信息: Job complete: job_local_0003
  223. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  224. 信息: Counters: 19
  225. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  226. 信息: File Output Format Counters
  227. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  228. 信息: Bytes Written=695
  229. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  230. 信息: FileSystemCounters
  231. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  232. 信息: FILE_BYTES_READ=7527467
  233. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  234. 信息: HDFS_BYTES_READ=271193
  235. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  236. 信息: FILE_BYTES_WRITTEN=7901744
  237. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  238. 信息: HDFS_BYTES_WRITTEN=142099
  239. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  240. 信息: File Input Format Counters
  241. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  242. 信息: Bytes Read=31390
  243. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  244. 信息: Map-Reduce Framework
  245. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  246. 信息: Map output materialized bytes=681
  247. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  248. 信息: Map input records=1000
  249. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  250. 信息: Reduce shuffle bytes=0
  251. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  252. 信息: Spilled Records=6
  253. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  254. 信息: Map output bytes=666
  255. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  256. 信息: Total committed heap usage (bytes)=575930368
  257. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  258. 信息: SPLIT_RAW_BYTES=130
  259. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  260. 信息: Combine input records=0
  261. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  262. 信息: Reduce input records=3
  263. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  264. 信息: Reduce input groups=3
  265. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  266. 信息: Combine output records=0
  267. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  268. 信息: Reduce output records=3
  269. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
  270. 信息: Map output records=3
  271. 2013-10-14 15:39:35 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  272. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  273. 2013-10-14 15:39:35 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  274. 信息: Total input paths to process : 1
  275. 2013-10-14 15:39:35 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  276. 信息: Running job: job_local_0004
  277. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Task initialize
  278. 信息: Using ResourceCalculatorPlugin : null
  279. 2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  280. 信息: io.sort.mb = 100
  281. 2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  282. 信息: data buffer = 79691776/99614720
  283. 2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  284. 信息: record buffer = 262144/327680
  285. 2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  286. 信息: Starting flush of map output
  287. 2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  288. 信息: Finished spill 0
  289. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Task done
  290. 信息: Task:attempt_local_0004_m_000000_0 is done. And is in the process of commiting
  291. 2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  292. 信息:
  293. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Task sendDone
  294. 信息: Task 'attempt_local_0004_m_000000_0' done.
  295. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Task initialize
  296. 信息: Using ResourceCalculatorPlugin : null
  297. 2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  298. 信息:
  299. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Merger$MergeQueue merge
  300. 信息: Merging 1 sorted segments
  301. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Merger$MergeQueue merge
  302. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  303. 2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  304. 信息:
  305. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Task done
  306. 信息: Task:attempt_local_0004_r_000000_0 is done. And is in the process of commiting
  307. 2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  308. 信息:
  309. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Task commit
  310. 信息: Task attempt_local_0004_r_000000_0 is allowed to commit now
  311. 2013-10-14 15:39:35 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  312. 信息: Saved output of task 'attempt_local_0004_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-3
  313. 2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  314. 信息: reduce > reduce
  315. 2013-10-14 15:39:35 org.apache.hadoop.mapred.Task sendDone
  316. 信息: Task 'attempt_local_0004_r_000000_0' done.
  317. 2013-10-14 15:39:36 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  318. 信息: map 100% reduce 100%
  319. 2013-10-14 15:39:36 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  320. 信息: Job complete: job_local_0004
  321. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  322. 信息: Counters: 19
  323. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  324. 信息: File Output Format Counters
  325. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  326. 信息: Bytes Written=695
  327. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  328. 信息: FileSystemCounters
  329. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  330. 信息: FILE_BYTES_READ=10815685
  331. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  332. 信息: HDFS_BYTES_READ=338143
  333. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  334. 信息: FILE_BYTES_WRITTEN=11346320
  335. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  336. 信息: HDFS_BYTES_WRITTEN=143877
  337. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  338. 信息: File Input Format Counters
  339. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  340. 信息: Bytes Read=31390
  341. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  342. 信息: Map-Reduce Framework
  343. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  344. 信息: Map output materialized bytes=681
  345. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  346. 信息: Map input records=1000
  347. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  348. 信息: Reduce shuffle bytes=0
  349. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  350. 信息: Spilled Records=6
  351. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  352. 信息: Map output bytes=666
  353. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  354. 信息: Total committed heap usage (bytes)=775290880
  355. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  356. 信息: SPLIT_RAW_BYTES=130
  357. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  358. 信息: Combine input records=0
  359. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  360. 信息: Reduce input records=3
  361. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  362. 信息: Reduce input groups=3
  363. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  364. 信息: Combine output records=0
  365. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  366. 信息: Reduce output records=3
  367. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
  368. 信息: Map output records=3
  369. 2013-10-14 15:39:36 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  370. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  371. 2013-10-14 15:39:36 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  372. 信息: Total input paths to process : 1
  373. 2013-10-14 15:39:36 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  374. 信息: Running job: job_local_0005
  375. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Task initialize
  376. 信息: Using ResourceCalculatorPlugin : null
  377. 2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  378. 信息: io.sort.mb = 100
  379. 2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  380. 信息: data buffer = 79691776/99614720
  381. 2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  382. 信息: record buffer = 262144/327680
  383. 2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  384. 信息: Starting flush of map output
  385. 2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  386. 信息: Finished spill 0
  387. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Task done
  388. 信息: Task:attempt_local_0005_m_000000_0 is done. And is in the process of commiting
  389. 2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  390. 信息:
  391. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Task sendDone
  392. 信息: Task 'attempt_local_0005_m_000000_0' done.
  393. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Task initialize
  394. 信息: Using ResourceCalculatorPlugin : null
  395. 2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  396. 信息:
  397. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Merger$MergeQueue merge
  398. 信息: Merging 1 sorted segments
  399. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Merger$MergeQueue merge
  400. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  401. 2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  402. 信息:
  403. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Task done
  404. 信息: Task:attempt_local_0005_r_000000_0 is done. And is in the process of commiting
  405. 2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  406. 信息:
  407. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Task commit
  408. 信息: Task attempt_local_0005_r_000000_0 is allowed to commit now
  409. 2013-10-14 15:39:36 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  410. 信息: Saved output of task 'attempt_local_0005_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-4
  411. 2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  412. 信息: reduce > reduce
  413. 2013-10-14 15:39:36 org.apache.hadoop.mapred.Task sendDone
  414. 信息: Task 'attempt_local_0005_r_000000_0' done.
  415. 2013-10-14 15:39:37 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  416. 信息: map 100% reduce 100%
  417. 2013-10-14 15:39:37 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  418. 信息: Job complete: job_local_0005
  419. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  420. 信息: Counters: 19
  421. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  422. 信息: File Output Format Counters
  423. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  424. 信息: Bytes Written=695
  425. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  426. 信息: FileSystemCounters
  427. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  428. 信息: FILE_BYTES_READ=14103903
  429. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  430. 信息: HDFS_BYTES_READ=405093
  431. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  432. 信息: FILE_BYTES_WRITTEN=14790888
  433. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  434. 信息: HDFS_BYTES_WRITTEN=145655
  435. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  436. 信息: File Input Format Counters
  437. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  438. 信息: Bytes Read=31390
  439. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  440. 信息: Map-Reduce Framework
  441. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  442. 信息: Map output materialized bytes=681
  443. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  444. 信息: Map input records=1000
  445. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  446. 信息: Reduce shuffle bytes=0
  447. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  448. 信息: Spilled Records=6
  449. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  450. 信息: Map output bytes=666
  451. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  452. 信息: Total committed heap usage (bytes)=974651392
  453. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  454. 信息: SPLIT_RAW_BYTES=130
  455. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  456. 信息: Combine input records=0
  457. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  458. 信息: Reduce input records=3
  459. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  460. 信息: Reduce input groups=3
  461. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  462. 信息: Combine output records=0
  463. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  464. 信息: Reduce output records=3
  465. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
  466. 信息: Map output records=3
  467. 2013-10-14 15:39:37 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  468. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  469. 2013-10-14 15:39:37 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  470. 信息: Total input paths to process : 1
  471. 2013-10-14 15:39:37 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  472. 信息: Running job: job_local_0006
  473. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Task initialize
  474. 信息: Using ResourceCalculatorPlugin : null
  475. 2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  476. 信息: io.sort.mb = 100
  477. 2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  478. 信息: data buffer = 79691776/99614720
  479. 2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  480. 信息: record buffer = 262144/327680
  481. 2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  482. 信息: Starting flush of map output
  483. 2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  484. 信息: Finished spill 0
  485. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Task done
  486. 信息: Task:attempt_local_0006_m_000000_0 is done. And is in the process of commiting
  487. 2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  488. 信息:
  489. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Task sendDone
  490. 信息: Task 'attempt_local_0006_m_000000_0' done.
  491. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Task initialize
  492. 信息: Using ResourceCalculatorPlugin : null
  493. 2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  494. 信息:
  495. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Merger$MergeQueue merge
  496. 信息: Merging 1 sorted segments
  497. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Merger$MergeQueue merge
  498. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  499. 2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  500. 信息:
  501. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Task done
  502. 信息: Task:attempt_local_0006_r_000000_0 is done. And is in the process of commiting
  503. 2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  504. 信息:
  505. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Task commit
  506. 信息: Task attempt_local_0006_r_000000_0 is allowed to commit now
  507. 2013-10-14 15:39:37 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  508. 信息: Saved output of task 'attempt_local_0006_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-5
  509. 2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  510. 信息: reduce > reduce
  511. 2013-10-14 15:39:37 org.apache.hadoop.mapred.Task sendDone
  512. 信息: Task 'attempt_local_0006_r_000000_0' done.
  513. 2013-10-14 15:39:38 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  514. 信息: map 100% reduce 100%
  515. 2013-10-14 15:39:38 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  516. 信息: Job complete: job_local_0006
  517. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  518. 信息: Counters: 19
  519. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  520. 信息: File Output Format Counters
  521. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  522. 信息: Bytes Written=695
  523. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  524. 信息: FileSystemCounters
  525. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  526. 信息: FILE_BYTES_READ=17392121
  527. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  528. 信息: HDFS_BYTES_READ=472043
  529. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  530. 信息: FILE_BYTES_WRITTEN=18235456
  531. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  532. 信息: HDFS_BYTES_WRITTEN=147433
  533. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  534. 信息: File Input Format Counters
  535. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  536. 信息: Bytes Read=31390
  537. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  538. 信息: Map-Reduce Framework
  539. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  540. 信息: Map output materialized bytes=681
  541. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  542. 信息: Map input records=1000
  543. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  544. 信息: Reduce shuffle bytes=0
  545. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  546. 信息: Spilled Records=6
  547. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  548. 信息: Map output bytes=666
  549. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  550. 信息: Total committed heap usage (bytes)=1174011904
  551. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  552. 信息: SPLIT_RAW_BYTES=130
  553. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  554. 信息: Combine input records=0
  555. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  556. 信息: Reduce input records=3
  557. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  558. 信息: Reduce input groups=3
  559. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  560. 信息: Combine output records=0
  561. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  562. 信息: Reduce output records=3
  563. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
  564. 信息: Map output records=3
  565. 2013-10-14 15:39:38 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  566. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  567. 2013-10-14 15:39:38 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  568. 信息: Total input paths to process : 1
  569. 2013-10-14 15:39:38 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  570. 信息: Running job: job_local_0007
  571. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Task initialize
  572. 信息: Using ResourceCalculatorPlugin : null
  573. 2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  574. 信息: io.sort.mb = 100
  575. 2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  576. 信息: data buffer = 79691776/99614720
  577. 2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  578. 信息: record buffer = 262144/327680
  579. 2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  580. 信息: Starting flush of map output
  581. 2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  582. 信息: Finished spill 0
  583. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Task done
  584. 信息: Task:attempt_local_0007_m_000000_0 is done. And is in the process of commiting
  585. 2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  586. 信息:
  587. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Task sendDone
  588. 信息: Task 'attempt_local_0007_m_000000_0' done.
  589. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Task initialize
  590. 信息: Using ResourceCalculatorPlugin : null
  591. 2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  592. 信息:
  593. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Merger$MergeQueue merge
  594. 信息: Merging 1 sorted segments
  595. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Merger$MergeQueue merge
  596. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  597. 2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  598. 信息:
  599. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Task done
  600. 信息: Task:attempt_local_0007_r_000000_0 is done. And is in the process of commiting
  601. 2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  602. 信息:
  603. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Task commit
  604. 信息: Task attempt_local_0007_r_000000_0 is allowed to commit now
  605. 2013-10-14 15:39:38 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  606. 信息: Saved output of task 'attempt_local_0007_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-6
  607. 2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  608. 信息: reduce > reduce
  609. 2013-10-14 15:39:38 org.apache.hadoop.mapred.Task sendDone
  610. 信息: Task 'attempt_local_0007_r_000000_0' done.
  611. 2013-10-14 15:39:39 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  612. 信息: map 100% reduce 100%
  613. 2013-10-14 15:39:39 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  614. 信息: Job complete: job_local_0007
  615. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  616. 信息: Counters: 19
  617. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  618. 信息: File Output Format Counters
  619. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  620. 信息: Bytes Written=695
  621. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  622. 信息: FileSystemCounters
  623. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  624. 信息: FILE_BYTES_READ=20680339
  625. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  626. 信息: HDFS_BYTES_READ=538993
  627. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  628. 信息: FILE_BYTES_WRITTEN=21680040
  629. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  630. 信息: HDFS_BYTES_WRITTEN=149211
  631. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  632. 信息: File Input Format Counters
  633. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  634. 信息: Bytes Read=31390
  635. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  636. 信息: Map-Reduce Framework
  637. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  638. 信息: Map output materialized bytes=681
  639. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  640. 信息: Map input records=1000
  641. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  642. 信息: Reduce shuffle bytes=0
  643. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  644. 信息: Spilled Records=6
  645. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  646. 信息: Map output bytes=666
  647. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  648. 信息: Total committed heap usage (bytes)=1373372416
  649. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  650. 信息: SPLIT_RAW_BYTES=130
  651. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  652. 信息: Combine input records=0
  653. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  654. 信息: Reduce input records=3
  655. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  656. 信息: Reduce input groups=3
  657. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  658. 信息: Combine output records=0
  659. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  660. 信息: Reduce output records=3
  661. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
  662. 信息: Map output records=3
  663. 2013-10-14 15:39:39 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  664. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  665. 2013-10-14 15:39:39 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  666. 信息: Total input paths to process : 1
  667. 2013-10-14 15:39:39 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  668. 信息: Running job: job_local_0008
  669. 2013-10-14 15:39:39 org.apache.hadoop.mapred.Task initialize
  670. 信息: Using ResourceCalculatorPlugin : null
  671. 2013-10-14 15:39:39 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  672. 信息: io.sort.mb = 100
  673. 2013-10-14 15:39:39 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  674. 信息: data buffer = 79691776/99614720
  675. 2013-10-14 15:39:39 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  676. 信息: record buffer = 262144/327680
  677. 2013-10-14 15:39:39 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  678. 信息: Starting flush of map output
  679. 2013-10-14 15:39:40 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  680. 信息: Finished spill 0
  681. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Task done
  682. 信息: Task:attempt_local_0008_m_000000_0 is done. And is in the process of commiting
  683. 2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  684. 信息:
  685. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Task sendDone
  686. 信息: Task 'attempt_local_0008_m_000000_0' done.
  687. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Task initialize
  688. 信息: Using ResourceCalculatorPlugin : null
  689. 2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  690. 信息:
  691. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Merger$MergeQueue merge
  692. 信息: Merging 1 sorted segments
  693. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Merger$MergeQueue merge
  694. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  695. 2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  696. 信息:
  697. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Task done
  698. 信息: Task:attempt_local_0008_r_000000_0 is done. And is in the process of commiting
  699. 2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  700. 信息:
  701. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Task commit
  702. 信息: Task attempt_local_0008_r_000000_0 is allowed to commit now
  703. 2013-10-14 15:39:40 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  704. 信息: Saved output of task 'attempt_local_0008_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-7
  705. 2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  706. 信息: reduce > reduce
  707. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Task sendDone
  708. 信息: Task 'attempt_local_0008_r_000000_0' done.
  709. 2013-10-14 15:39:40 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  710. 信息: map 100% reduce 100%
  711. 2013-10-14 15:39:40 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  712. 信息: Job complete: job_local_0008
  713. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  714. 信息: Counters: 19
  715. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  716. 信息: File Output Format Counters
  717. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  718. 信息: Bytes Written=695
  719. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  720. 信息: FileSystemCounters
  721. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  722. 信息: FILE_BYTES_READ=23968557
  723. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  724. 信息: HDFS_BYTES_READ=605943
  725. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  726. 信息: FILE_BYTES_WRITTEN=25124624
  727. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  728. 信息: HDFS_BYTES_WRITTEN=150989
  729. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  730. 信息: File Input Format Counters
  731. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  732. 信息: Bytes Read=31390
  733. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  734. 信息: Map-Reduce Framework
  735. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  736. 信息: Map output materialized bytes=681
  737. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  738. 信息: Map input records=1000
  739. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  740. 信息: Reduce shuffle bytes=0
  741. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  742. 信息: Spilled Records=6
  743. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  744. 信息: Map output bytes=666
  745. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  746. 信息: Total committed heap usage (bytes)=1572732928
  747. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  748. 信息: SPLIT_RAW_BYTES=130
  749. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  750. 信息: Combine input records=0
  751. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  752. 信息: Reduce input records=3
  753. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  754. 信息: Reduce input groups=3
  755. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  756. 信息: Combine output records=0
  757. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  758. 信息: Reduce output records=3
  759. 2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
  760. 信息: Map output records=3
  761. 2013-10-14 15:39:41 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  762. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  763. 2013-10-14 15:39:41 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  764. 信息: Total input paths to process : 1
  765. 2013-10-14 15:39:41 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  766. 信息: Running job: job_local_0009
  767. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task initialize
  768. 信息: Using ResourceCalculatorPlugin : null
  769. 2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  770. 信息: io.sort.mb = 100
  771. 2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  772. 信息: data buffer = 79691776/99614720
  773. 2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  774. 信息: record buffer = 262144/327680
  775. 2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  776. 信息: Starting flush of map output
  777. 2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  778. 信息: Finished spill 0
  779. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task done
  780. 信息: Task:attempt_local_0009_m_000000_0 is done. And is in the process of commiting
  781. 2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  782. 信息:
  783. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task sendDone
  784. 信息: Task 'attempt_local_0009_m_000000_0' done.
  785. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task initialize
  786. 信息: Using ResourceCalculatorPlugin : null
  787. 2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  788. 信息:
  789. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Merger$MergeQueue merge
  790. 信息: Merging 1 sorted segments
  791. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Merger$MergeQueue merge
  792. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  793. 2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  794. 信息:
  795. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task done
  796. 信息: Task:attempt_local_0009_r_000000_0 is done. And is in the process of commiting
  797. 2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  798. 信息:
  799. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task commit
  800. 信息: Task attempt_local_0009_r_000000_0 is allowed to commit now
  801. 2013-10-14 15:39:41 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  802. 信息: Saved output of task 'attempt_local_0009_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-8
  803. 2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  804. 信息: reduce > reduce
  805. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task sendDone
  806. 信息: Task 'attempt_local_0009_r_000000_0' done.
  807. 2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  808. 信息: map 100% reduce 100%
  809. 2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  810. 信息: Job complete: job_local_0009
  811. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  812. 信息: Counters: 19
  813. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  814. 信息: File Output Format Counters
  815. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  816. 信息: Bytes Written=695
  817. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  818. 信息: FileSystemCounters
  819. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  820. 信息: FILE_BYTES_READ=27256775
  821. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  822. 信息: HDFS_BYTES_READ=673669
  823. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  824. 信息: FILE_BYTES_WRITTEN=28569192
  825. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  826. 信息: HDFS_BYTES_WRITTEN=152767
  827. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  828. 信息: File Input Format Counters
  829. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  830. 信息: Bytes Read=31390
  831. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  832. 信息: Map-Reduce Framework
  833. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  834. 信息: Map output materialized bytes=681
  835. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  836. 信息: Map input records=1000
  837. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  838. 信息: Reduce shuffle bytes=0
  839. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  840. 信息: Spilled Records=6
  841. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  842. 信息: Map output bytes=666
  843. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  844. 信息: Total committed heap usage (bytes)=1772093440
  845. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  846. 信息: SPLIT_RAW_BYTES=130
  847. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  848. 信息: Combine input records=0
  849. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  850. 信息: Reduce input records=3
  851. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  852. 信息: Reduce input groups=3
  853. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  854. 信息: Combine output records=0
  855. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  856. 信息: Reduce output records=3
  857. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  858. 信息: Map output records=3
  859. 2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  860. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  861. 2013-10-14 15:39:42 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  862. 信息: Total input paths to process : 1
  863. 2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  864. 信息: Running job: job_local_0010
  865. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task initialize
  866. 信息: Using ResourceCalculatorPlugin : null
  867. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  868. 信息: io.sort.mb = 100
  869. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  870. 信息: data buffer = 79691776/99614720
  871. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  872. 信息: record buffer = 262144/327680
  873. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  874. 信息: Starting flush of map output
  875. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  876. 信息: Finished spill 0
  877. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task done
  878. 信息: Task:attempt_local_0010_m_000000_0 is done. And is in the process of commiting
  879. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  880. 信息:
  881. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task sendDone
  882. 信息: Task 'attempt_local_0010_m_000000_0' done.
  883. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task initialize
  884. 信息: Using ResourceCalculatorPlugin : null
  885. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  886. 信息:
  887. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Merger$MergeQueue merge
  888. 信息: Merging 1 sorted segments
  889. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Merger$MergeQueue merge
  890. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  891. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  892. 信息:
  893. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task done
  894. 信息: Task:attempt_local_0010_r_000000_0 is done. And is in the process of commiting
  895. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  896. 信息:
  897. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task commit
  898. 信息: Task attempt_local_0010_r_000000_0 is allowed to commit now
  899. 2013-10-14 15:39:42 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  900. 信息: Saved output of task 'attempt_local_0010_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-9
  901. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  902. 信息: reduce > reduce
  903. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task sendDone
  904. 信息: Task 'attempt_local_0010_r_000000_0' done.
  905. 2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  906. 信息: map 100% reduce 100%
  907. 2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  908. 信息: Job complete: job_local_0010
  909. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  910. 信息: Counters: 19
  911. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  912. 信息: File Output Format Counters
  913. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  914. 信息: Bytes Written=695
  915. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  916. 信息: FileSystemCounters
  917. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  918. 信息: FILE_BYTES_READ=30544993
  919. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  920. 信息: HDFS_BYTES_READ=741007
  921. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  922. 信息: FILE_BYTES_WRITTEN=32013760
  923. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  924. 信息: HDFS_BYTES_WRITTEN=154545
  925. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  926. 信息: File Input Format Counters
  927. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  928. 信息: Bytes Read=31390
  929. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  930. 信息: Map-Reduce Framework
  931. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  932. 信息: Map output materialized bytes=681
  933. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  934. 信息: Map input records=1000
  935. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  936. 信息: Reduce shuffle bytes=0
  937. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  938. 信息: Spilled Records=6
  939. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  940. 信息: Map output bytes=666
  941. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  942. 信息: Total committed heap usage (bytes)=1966735360
  943. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  944. 信息: SPLIT_RAW_BYTES=130
  945. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  946. 信息: Combine input records=0
  947. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  948. 信息: Reduce input records=3
  949. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  950. 信息: Reduce input groups=3
  951. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  952. 信息: Combine output records=0
  953. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  954. 信息: Reduce output records=3
  955. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  956. 信息: Map output records=3
  957. 2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  958. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  959. 2013-10-14 15:39:43 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  960. 信息: Total input paths to process : 1
  961. 2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  962. 信息: Running job: job_local_0011
  963. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task initialize
  964. 信息: Using ResourceCalculatorPlugin : null
  965. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  966. 信息: io.sort.mb = 100
  967. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  968. 信息: data buffer = 79691776/99614720
  969. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  970. 信息: record buffer = 262144/327680
  971. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  972. 信息: Starting flush of map output
  973. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  974. 信息: Finished spill 0
  975. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task done
  976. 信息: Task:attempt_local_0011_m_000000_0 is done. And is in the process of commiting
  977. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  978. 信息:
  979. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task sendDone
  980. 信息: Task 'attempt_local_0011_m_000000_0' done.
  981. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task initialize
  982. 信息: Using ResourceCalculatorPlugin : null
  983. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  984. 信息:
  985. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Merger$MergeQueue merge
  986. 信息: Merging 1 sorted segments
  987. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Merger$MergeQueue merge
  988. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  989. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  990. 信息:
  991. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task done
  992. 信息: Task:attempt_local_0011_r_000000_0 is done. And is in the process of commiting
  993. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  994. 信息:
  995. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task commit
  996. 信息: Task attempt_local_0011_r_000000_0 is allowed to commit now
  997. 2013-10-14 15:39:43 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  998. 信息: Saved output of task 'attempt_local_0011_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-10
  999. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  1000. 信息: reduce > reduce
  1001. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task sendDone
  1002. 信息: Task 'attempt_local_0011_r_000000_0' done.
  1003. 2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  1004. 信息: map 100% reduce 100%
  1005. 2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  1006. 信息: Job complete: job_local_0011
  1007. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1008. 信息: Counters: 19
  1009. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1010. 信息: File Output Format Counters
  1011. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1012. 信息: Bytes Written=695
  1013. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1014. 信息: FileSystemCounters
  1015. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1016. 信息: FILE_BYTES_READ=33833211
  1017. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1018. 信息: HDFS_BYTES_READ=808345
  1019. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1020. 信息: FILE_BYTES_WRITTEN=35458320
  1021. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1022. 信息: HDFS_BYTES_WRITTEN=156323
  1023. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1024. 信息: File Input Format Counters
  1025. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1026. 信息: Bytes Read=31390
  1027. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1028. 信息: Map-Reduce Framework
  1029. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1030. 信息: Map output materialized bytes=681
  1031. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1032. 信息: Map input records=1000
  1033. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1034. 信息: Reduce shuffle bytes=0
  1035. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1036. 信息: Spilled Records=6
  1037. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1038. 信息: Map output bytes=666
  1039. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1040. 信息: Total committed heap usage (bytes)=2166095872
  1041. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1042. 信息: SPLIT_RAW_BYTES=130
  1043. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1044. 信息: Combine input records=0
  1045. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1046. 信息: Reduce input records=3
  1047. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1048. 信息: Reduce input groups=3
  1049. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1050. 信息: Combine output records=0
  1051. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1052. 信息: Reduce output records=3
  1053. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  1054. 信息: Map output records=3
  1055. 2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  1056. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  1057. 2013-10-14 15:39:44 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  1058. 信息: Total input paths to process : 1
  1059. 2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  1060. 信息: Running job: job_local_0012
  1061. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Task initialize
  1062. 信息: Using ResourceCalculatorPlugin : null
  1063. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Task done
  1064. 信息: Task:attempt_local_0012_m_000000_0 is done. And is in the process of commiting
  1065. 2013-10-14 15:39:44 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  1066. 信息:
  1067. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Task commit
  1068. 信息: Task attempt_local_0012_m_000000_0 is allowed to commit now
  1069. 2013-10-14 15:39:44 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  1070. 信息: Saved output of task 'attempt_local_0012_m_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
  1071. 2013-10-14 15:39:44 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  1072. 信息:
  1073. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Task sendDone
  1074. 信息: Task 'attempt_local_0012_m_000000_0' done.
  1075. 2013-10-14 15:39:45 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  1076. 信息: map 100% reduce 0%
  1077. 2013-10-14 15:39:45 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  1078. 信息: Job complete: job_local_0012
  1079. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1080. 信息: Counters: 11
  1081. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1082. 信息: File Output Format Counters
  1083. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1084. 信息: Bytes Written=41520
  1085. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1086. 信息: File Input Format Counters
  1087. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1088. 信息: Bytes Read=31390
  1089. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1090. 信息: FileSystemCounters
  1091. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1092. 信息: FILE_BYTES_READ=18560374
  1093. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1094. 信息: HDFS_BYTES_READ=437203
  1095. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1096. 信息: FILE_BYTES_WRITTEN=19450325
  1097. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1098. 信息: HDFS_BYTES_WRITTEN=120417
  1099. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1100. 信息: Map-Reduce Framework
  1101. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1102. 信息: Map input records=1000
  1103. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1104. 信息: Spilled Records=0
  1105. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1106. 信息: Total committed heap usage (bytes)=1083047936
  1107. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1108. 信息: SPLIT_RAW_BYTES=130
  1109. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  1110. 信息: Map output records=1000
  1111. Dumping out clusters from clusters: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-*-final and clusteredPoints: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
  1112. CL-552{n=443 c=[1.631, -0.412] r=[1.563, 1.407]}
  1113. Weight : [props - optional]: Point:
  1114. 1.0: [-2.393, 3.347]
  1115. 1.0: [-4.364, 1.905]
  1116. 1.0: [-3.275, 0.023]
  1117. 1.0: [-2.479, 2.534]
  1118. 1.0: [-0.559, 1.223]
  1119. ...
  1120. CL-847{n=77 c=[-2.953, -0.971] r=[1.767, 2.189]}
  1121. Weight : [props - optional]: Point:
  1122. 1.0: [-0.883, -3.320]
  1123. 1.0: [-1.099, -6.063]
  1124. 1.0: [-0.004, -0.610]
  1125. 1.0: [-2.996, -3.610]
  1126. 1.0: [3.988, 1.008]
  1127. ...
  1128. CL-823{n=480 c=[0.219, 2.600] r=[1.479, 1.385]}
  1129. Weight : [props - optional]: Point:
  1130. 1.0: [2.670, 1.851]
  1131. 1.0: [2.177, 6.773]
  1132. 1.0: [5.537, 2.651]
  1133. 1.0: [5.663, 6.868]
  1134. 1.0: [5.117, 3.747]
  1135. 1.0: [1.912, 2.959]
  1136. ...

4). 聚类结果解读
我们可以把上面的日志分解析成3个部分解读

  • a. 初始化环境
  • b. 算法执行
  • c. 打印聚类结果

a. 初始化环境
出初HDFS的数据目录和工作目录,并上传数据文件。


  1. Delete: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  2. Create: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  3. copy from: datafile/randomData.csv to hdfs://192.168.1.210:9000/user/hdfs/mix_data
  4. ls: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  5. ==========================================================
  6. name: hdfs://192.168.1.210:9000/user/hdfs/mix_data/randomData.csv, folder: false, size: 36655

b. 算法执行
算法执行,有3个步骤。

  • 1):把原始数据randomData.csv,转成Mahout sequence files of VectorWritable。
  • 2):通过随机的方法,选中kmeans的3个中心,做为初始集群
  • 3):根据迭代次数的设置,执行MapReduce,进行计算

1):把原始数据randomData.csv,转成Mahout sequence files of VectorWritable。

程序源代码:


  1. InputDriver.runJob(new Path(inPath), new Path(seqFile), "org.apache.mahout.math.RandomAccessSparseVector");

日志输出:

  1. Job complete: job_local_0001

2):通过随机的方法,选中kmeans的3个中心,做为初始集群

程序源代码:


  1. int k = 3;
  2. Path seqFilePath = new Path(seqFile);
  3. Path clustersSeeds = new Path(seeds);
  4. DistanceMeasure measure = new EuclideanDistanceMeasure();
  5. clustersSeeds = RandomSeedGenerator.buildRandom(conf, seqFilePath, clustersSeeds, k, measure);

日志输出:

  1. Job complete: job_local_0002

3):根据迭代次数的设置,执行MapReduce,进行计算
程序源代码:


  1. KMeansDriver.run(conf, seqFilePath, clustersSeeds, new Path(outPath), measure, 0.01, 10, true, 0.01, false);

日志输出:


  1. Job complete: job_local_0003
  2. Job complete: job_local_0004
  3. Job complete: job_local_0005
  4. Job complete: job_local_0006
  5. Job complete: job_local_0007
  6. Job complete: job_local_0008
  7. Job complete: job_local_0009
  8. Job complete: job_local_0010
  9. Job complete: job_local_0011
  10. Job complete: job_local_0012

c. 打印聚类结果


  1. Dumping out clusters from clusters: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-*-final and clusteredPoints: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
  2. CL-552{n=443 c=[1.631, -0.412] r=[1.563, 1.407]}
  3. CL-847{n=77 c=[-2.953, -0.971] r=[1.767, 2.189]}
  4. CL-823{n=480 c=[0.219, 2.600] r=[1.479, 1.385]}

运行结果:有3个中心。

  • Cluster1, 包括443个点,中心坐标[1.631, -0.412]
  • Cluster2, 包括77个点,中心坐标[-2.953, -0.971]
  • Cluster3, 包括480 个点,中心坐标[0.219, 2.600]

5). HDFS产生的目录


  1. # 根目录
  2. ~ hadoop fs -ls /user/hdfs/mix_data
  3. Found 4 items
  4. -rw-r--r-- 3 Administrator supergroup 36655 2013-10-04 15:31 /user/hdfs/mix_data/randomData.csv
  5. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result
  6. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/seeds
  7. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/seqfile
  8. # 输出目录
  9. ~ hadoop fs -ls /user/hdfs/mix_data/result
  10. Found 13 items
  11. -rw-r--r-- 3 Administrator supergroup 194 2013-10-04 15:31 /user/hdfs/mix_data/result/_policy
  12. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusteredPoints
  13. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-0
  14. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-1
  15. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-10-final
  16. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-2
  17. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-3
  18. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-4
  19. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-5
  20. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-6
  21. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-7
  22. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-8
  23. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-9
  24. # 产生的随机中心种子目录
  25. ~ hadoop fs -ls /user/hdfs/mix_data/seeds
  26. Found 1 items
  27. -rw-r--r-- 3 Administrator supergroup 599 2013-10-04 15:31 /user/hdfs/mix_data/seeds/part-randomSeed
  28. # 输入文件换成Mahout格式文件的目录
  29. ~ hadoop fs -ls /user/hdfs/mix_data/seqfile
  30. Found 2 items
  31. -rw-r--r-- 3 Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/seqfile/_SUCCESS
  32. -rw-r--r-- 3 Administrator supergroup 31390 2013-10-04 15:31 /user/hdfs/mix_data/seqfile/part-m-00000

4. 用R语言可视化结果

分别把聚类后的点,保存到不同的cluster*.csv文件,然后用R语言画图。


  1. c1<-read.csv(file="cluster1.csv",sep=",",header=FALSE)
  2. c2<-read.csv(file="cluster2.csv",sep=",",header=FALSE)
  3. c3<-read.csv(file="cluster3.csv",sep=",",header=FALSE)
  4. y<-rbind(c1,c2,c3)
  5. cols<-c(rep(1,nrow(c1)),rep(2,nrow(c2)),rep(3,nrow(c3)))
  6. plot(y, col=c("black","blue","green")[cols])
  7. center<-matrix(c(1.631, -0.412,-2.953, -0.971,0.219, 2.600),ncol=2,byrow=TRUE)
  8. points(center, col="violetred", pch = 19)

从上图中,我们看到有 黑,蓝,绿,三种颜色的空心点,这些点就是原始数据。
3个紫色实点,是Mahout的kmeans后生成的3个中心。

对比文章中用R语言实现的kmeans的分类和中心,都不太一样。 用Maven构建Mahout项目

简单总结一下,在使用kmeans时,根据距离算法,阈值,出始中心,迭代次数的不同,kmeans计算的结果是不相同的。因此,用kmeans算法,我们一般只能得到一个模糊的分类标准,这个标准对于我们认识未知领域的数据集是很有帮助的。不能做为精确衡量数据的指标。

5. 模板项目上传github

https://github.com/bsspirit/maven_mahout_template/tree/mahout-0.8

大家可以下载这个项目,做为开发的起点。


  1. ~ git clone https://github.com/bsspirit/maven_mahout_template
  2. ~ git checkout mahout-0.8

这样,我们完成了Mahout的聚类算法Kmeans的分步式实现。接下来,我们会继续做关于Mahout中分类的实验!

转载请注明出处:
http://blog.fens.me/hadoop-mahout-kmeans/

转】Mahout分步式程序开发 聚类Kmeans的更多相关文章

  1. Mahout分步式程序开发 聚类Kmeans(转)

    Posted: Oct 14, 2013 Tags: clusterHadoopkmeansMahoutR聚类 Comments: 13 Comments Mahout分步式程序开发 聚类Kmeans ...

  2. 转】Mahout分步式程序开发 基于物品的协同过滤ItemCF

    原博文出自于: http://blog.fens.me/hadoop-mahout-mapreduce-itemcf/ 感谢! Posted: Oct 14, 2013 Tags: Hadoopite ...

  3. Mahout分步式程序开发 基于物品的协同过滤ItemCF

    http://blog.fens.me/hadoop-mahout-mapreduce-itemcf/ Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, ...

  4. 插件式程序开发及其应用(C#)

    1.  定义 所谓“插件模型”,指应用程序由一些动态的独立模块构成,每个模块均具有一个或多个服务,并满足一定的插件协议,能够借助主程序实现主程序-插件:插件-插件之间的通讯. 应用该模型的系统,具有以 ...

  5. C# 插件式程序开发

    在网上找了下插件式编程的资料,这里自己先借鉴下别人的,同时发现有自己的看法,不过由于本人水平有限,不一定有参考价值,写出来一方面是为了总结自己,以求提高,另一方面也希望各为朋友看到我的不足,给我提出宝 ...

  6. 程序开发心理学阅读笔记——第II篇

    作为社会行为的软件开发程序开发组->程序开发团队->程序开发项目1.要判断程序员的某个集体是否构成一支团队,要看其中的成员以何种方式相互协作,以共同开发软件产品.2.健康的团队要始终能够保 ...

  7. mahout in Action2.2-聚类介绍-K-means聚类算法

    聚类介绍 本章包含 1 实战操作了解聚类 2.了解相似性概念 3 使用mahout执行一个简单的聚类实例 4.用于聚类的各种不同的距离測算方法 作为人类,我们倾向于与志同道合的人合作-"鸟的 ...

  8. 基于Flask的Web应用程序插件式结构开发

    事实上,很多应用程序基于插件式结构开发,可以很方便了扩展软件的功能,并且这些功能完全可以依托于第三方开发者,只要提供好接口和完备文档,比如wordpress.谷歌火狐浏览器等. Python这样的动态 ...

  9. 基于Zookeeper的分步式队列系统集成案例

    基于Zookeeper的分步式队列系统集成案例 Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, ...

随机推荐

  1. 等额本息Vs等额本金

    1:贷款种类一旦选择不能改变.2:你提前还款的全部属于本金部分,若能一次性归还本金只需付清当月月息即可[不按年利率计算而是月利率],与你归还的本金违约金[设:提前还款10万*X.XXX%=违约金,具体 ...

  2. Hadoop集群(第8期)_HDFS初探之旅

    1.HDFS简介 HDFS(Hadoop Distributed File System)是Hadoop项目的核心子项目,是分布式计算中数据存储管理的基础,是基于流数据模式访问和处理超大文件的需求而开 ...

  3. Java NIO读书笔记2

    一.选择器(Selector) Selector(选择器)是Java NIO中能够检测一到多个NIO通道,并能够知晓通道是否为诸如读写事件做好准备的组件.这样,一个单独的线程可以管理多个channel ...

  4. 浅析JavaScript引用类型之--Object、Array

    1.Object类型 对象是某个特定引用类型的实例,新对象有两种创建方式: i.使用new操作符调用构造函数来创建. var person = new Object(); person.name = ...

  5. jquery ajax 开发手记

    1.json解析的格式要求更严格了,必须全部加引号,否则无法识别 {"result":"false"} 2.ashx如果要使用Session需要继承接口IReq ...

  6. design pattern及其使用

    什么是设计模式? design pattern是一个通用的,可以被重用的关于一个常见的问题的解决方案. 为什么要用设计模式? 引入设计模式的理论基础非常简单.我们每天都会碰到问题.我们可能碰到决定使用 ...

  7. SQL group by分组查询(转)

    本文导读:在实际SQL应用中,经常需要进行分组聚合,即将查询对象按一定条件分组,然后对每一个组进行聚合分析.创建分组是通过GROUP BY子句实现的.与WHERE子句不同,GROUP BY子句用于归纳 ...

  8. python练习程序(c100经典例12)

    题目: 判断101-200之间有多少个素数,并输出所有素数. for i in range(101,201): flag=0; for j in range(2,int(i**(1.0/2))): i ...

  9. setTimeout/setInterval

    //使用 setTimeout 时需注意,当该代码执行时,JS 会立即编译函数第一个参数“code” //所以该函数的第一个参数应该为:需要编译的代码.或者一个函数 //例1:setTimeout(& ...

  10. 【Android】跟着教程做の学习笔记

    教程 + <第一行代码 - Android> //尽量在十二月底学完吧(同步学习java基础)