mahout算法源码分析之Itembased Collaborative Filtering（三）RowSimilarityJob验证

Mahout版本：0.7，hadoop版本：1.0.4，jdk：1.7.0_25 64bit。

本篇分析上篇的分析是否正确，主要是编写上篇输出文件的读取以及添加log信息打印相关变量。

首先，编写下面的测试文件分析所有的输出：

package mahout.fansy.item;

import java.io.IOException;

import java.util.Map;

import mahout.fansy.utils.read.ReadArbiKV;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Writable;

import org.apache.mahout.math.Vector;

import org.apache.mahout.math.hadoop.similarity.cooccurrence.Vectors;

import junit.framework.TestCase;

public class ReadRowSimilarityJobOut extends TestCase {

	// 测试 weights 输出：

	public void testWeights() throws IOException{

		String path="hdfs://ubuntu:9000/user/mahout/item/temp/weights/part-r-00000";

		Map<Writable,Writable> map= ReadArbiKV.readFromFile(path);

		System.out.println("weights=================");

		System.out.println(map);

	}

	//normsPath

	public void testNormsPath() throws IOException{

		String path="hdfs://ubuntu:9000/user/mahout/item/temp/norms.bin";

		Vector map=getVector(path);

		System.out.println("normsPath=================");

		System.out.println(map);

	}

	//maxValues.bin

	public void testMaxValues() throws IOException{

		String path="hdfs://ubuntu:9000/user/mahout/item/temp/maxValues.bin";

		Vector map=getVector(path);

		System.out.println("maxValues=================");

		System.out.println(map);

	}

	//numNonZeroEntries.bin

	public void testNumNonZeroEntries() throws IOException{

		String path="hdfs://ubuntu:9000/user/mahout/item/temp/numNonZeroEntries.bin";

		Vector map=getVector(path);

		System.out.println("numNonZeroEntries=================");

		System.out.println(map);

	}

	//pairwiseSimilarityPath

	public void testPairwiseSimilarityPath() throws IOException{

		String path="hdfs://ubuntu:9000/user/mahout/item/temp/pairwiseSimilarity/part-r-00000";

		Map<Writable,Writable> map= ReadArbiKV.readFromFile(path);

		System.out.println("pairwiseSimilarityPath=================");

		System.out.println(map);

	}

	//similarityMatrix

	public void testSimilarityMatrix() throws IOException{

		String path="hdfs://ubuntu:9000/user/mahout/item/temp/similarityMatrix/part-r-00000";

		Map<Writable,Writable> map= ReadArbiKV.readFromFile(path);

		System.out.println("similarityMatrix=================");

		System.out.println(map);

	}

	// 读取.bin文件

	public Vector getVector(String path){

		Configuration conf=new Configuration();

		conf.set("mapred.job.tracker", "ubuntu:9001");

		Vector vector=null;

		try {

			vector = Vectors.read(new Path(path), conf);

		} catch (IOException e) {

			e.printStackTrace();

		}

		return vector;

	}

}

运行上面的文件得到下面的输出：

weights=================

{1={103:2.5,102:3.0,101:5.0}, 2={101:2.0,104:2.0,103:5.0,102:2.5}, 3={101:2.5,107:5.0,105:4.5,104:4.0}, 4={101:5.0,106:4.0,104:4.5,103:3.0}, 5={106:4.0,105:3.5,104:4.0,103:2.0,102:3.0,101:4.0}}

normsPath=================

{107:25.0,106:32.0,105:32.5,104:56.25,103:44.25,102:24.25,101:76.25}

maxValues=================

{}

numNonZeroEntries=================

{}

pairwiseSimilarityPath=================

{102={106:0.14972506706560876,105:0.14328432723886902,104:0.12789210656028413,103:0.1975496259559987}, 103={106:0.1424339656566283,105:0.11208890297777215,104:0.14037600977966974}, 101={107:0.10275248635596666,106:0.1424339656566283,105:0.1158457425543559,104:0.16015261286229274,103:0.15548737703860027,102:0.14201473202245876}, 106={}, 107={}, 104={107:0.13472338607037426,106:0.18181818181818182,105:0.16736577623297264}, 105={107:0.2204812092115424,106:0.14201473202245876}}

similarityMatrix=================

{102={101:0.14201473202245876,106:0.14972506706560876,105:0.14328432723886902,104:0.12789210656028413,103:0.1975496259559987}, 103={101:0.15548737703860027,106:0.1424339656566283,105:0.11208890297777215,104:0.14037600977966974,102:0.1975496259559987}, 101={107:0.10275248635596666,106:0.1424339656566283,105:0.1158457425543559,104:0.16015261286229274,103:0.15548737703860027,102:0.14201473202245876}, 106={101:0.1424339656566283,105:0.14201473202245876,104:0.18181818181818182,103:0.1424339656566283,102:0.14972506706560876}, 107={105:0.2204812092115424,104:0.13472338607037426,101:0.10275248635596666}, 104={107:0.13472338607037426,106:0.18181818181818182,105:0.16736577623297264,103:0.14037600977966974,102:0.12789210656028413,101:0.16015261286229274}, 105={107:0.2204812092115424,106:0.14201473202245876,104:0.16736577623297264,103:0.11208890297777215,102:0.14328432723886902,101:0.1158457425543559}}

其中第一个weights就和分析的一模一样，这里就不再相信写了。那就只分析pairwiseSimilarityPath和similarityMatrix了：

（1）pairwiseSimilarityPath：

前面关于这个的分析在最后reducer的时候是有错误的，应该说是没有分析完，如下图（此截图是使用log打印的变量信息）：

可以看到上篇其实只是分析到了第二行（第二行和第三行一样）而已，而没有分析到最后的输出。其实也只是少分析了一个while循环而已：

while (dotsWith.hasNext()) {

        Vector.Element b = dotsWith.next();

        double similarityValue = similarity.similarity(b.get(), normA, norms.getQuick(b.index()), numberOfColumns);

        if (similarityValue >= treshold) {

          similarities.set(b.index(), similarityValue);

        }

      }

这里来分析一下根据第二行的值如何求得第四行的值，首先normA是norms中的102对应的值，即24.25，然后来看similarity函数：

public double similarity(double dots, double normA, double normB, int numberOfColumns) {

    double euclideanDistance = Math.sqrt(normA - 2 * dots + normB);

    return 1.0 / (1.0 + euclideanDistance);

  }

项目106调用的参数应该是similarity(12.0，24.25，32.0，5)，所以返回的值是1/(1+sqrt(24.25-2*12+32))=0.149725067，刚好和第四行的值对应；最后的输出没有102，是因为设置了similarities.setQuick(row.get(), 0);这样一句代码，把相对应的值设置为了0，也就是不输出。

（2）similarityMatrix

由（1）的分析可以知道，（2）的输入是这样的：

{102={106:0.14972506706560876,105:0.14328432723886902,104:0.12789210656028413,103:0.1975496259559987},

103={106:0.1424339656566283,105:0.11208890297777215,104:0.14037600977966974},

101={107:0.10275248635596666,106:0.1424339656566283,105:0.1158457425543559,104:0.16015261286229274,103:0.15548737703860027,102:0.14201473202245876},

106={},

107={},

104={107:0.13472338607037426,106:0.18181818181818182,105:0.16736577623297264},

105={107:0.2204812092115424,106:0.14201473202245876}}

关于这个job的mapper分析是正确的，但是combiner分析中的merge方法是不对的，可以看到merge的代码如下：

public static Vector merge(Iterable<VectorWritable> partialVectors) {

    Iterator<VectorWritable> vectors = partialVectors.iterator();

    Vector accumulator = vectors.next().get();

    while (vectors.hasNext()) {

      VectorWritable v = vectors.next();

      if (v != null) {

        Iterator<Vector.Element> nonZeroElements = v.get().iterateNonZero();

        while (nonZeroElements.hasNext()) {

          Vector.Element nonZeroElement = nonZeroElements.next();

          accumulator.setQuick(nonZeroElement.index(), nonZeroElement.get());

        }

      }

    }

    return accumulator;

  }

看到这个代码的作用是把相同的key中的value全部设置一下，查看log信息如下：

首先是map的输出（key在101~103）：

（key在104~107）：

combiner的输出：

这样看到数据的输出后，就可以很好的理解combiner的具体操作了；

最后看reducer的操作，就是把combiner的输出进行排序即可：

但是，看到上面的log信息，似乎不是这样的，关于那个Vectors.topKElements方法没有细看，应该是和猜测的不同操作吧，这个下次在看了。

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

mahout算法源码分析之Itembased Collaborative Filtering（三）RowSimilarityJob验证的更多相关文章

mahout算法源码分析之Itembased Collaborative Filtering（二）RowSimilarityJob
Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit. 本篇开始之前先来验证前篇blog的分析结果,编写下面的测试文件来进行对上篇三个job的输出进行读取: p ...
mahout算法源码分析之Itembased Collaborative Filtering（四）共生矩阵乘法
Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit. 经过了SimilarityJob的计算共生矩阵后,就可以开始下面一个过程了,这个过程主要是共生矩阵的乘法 ...
mahout算法源码分析之Collaborative Filtering with ALS-WR 并行思路
Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit. mahout算法源码分析之Collaborative Filtering with ALS-WR 这个算 ...
mahout算法源码分析之Collaborative Filtering with ALS-WR （四）评价和推荐
Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit. 首先来总结一下 mahout算法源码分析之Collaborative Filtering with AL ...
mahout算法源码分析之Collaborative Filtering with ALS-WR拓展篇
Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit. 额,好吧,心头的一块石头总算是放下了.关于Collaborative Filtering with AL ...
diff.js 列表对比算法源码分析
diff.js列表对比算法源码分析 npm上的代码可以查看 (https://www.npmjs.com/package/list-diff2) 源码如下: /** * * @param {Arra ...
JUC源码分析-线程池篇（三）ScheduledThreadPoolExecutor
JUC源码分析-线程池篇(三)ScheduledThreadPoolExecutor ScheduledThreadPoolExecutor 继承自 ThreadPoolExecutor.它主要用来在 ...
JUC源码分析-线程池篇（三）Timer
JUC源码分析-线程池篇(三)Timer Timer 是 java.util 包提供的一个定时任务调度器,在主线程之外起一个单独的线程执行指定的计划任务,可以指定执行一次或者反复执行多次. 1. Ti ...
【Zookeeper】源码分析之Watcher机制（三）之Zookeeper
一.前言前面已经分析了Watcher机制中的大多数类,本篇对于ZKWatchManager的外部类Zookeeper进行分析. 二.Zookeeper源码分析 2.1 类的内部类 Zookeeper ...

随机推荐

hdu1722 bjfu1258 辗转相除法
这题就是个公式,代码极简单.但我想,真正明白这题原理的人并不多.很多人只是随便网上一搜,找到公式a了就行,其实这样对自己几乎没有提高. 鉴于网上关于这题的解题报告中几乎没有讲解原理的,我就多说几句,也 ...
IOS PUSH
第一阶段:.net应用程序把要发送的消息.目的iPhone的标识打包,发给APNS. 第二阶段:APNS在自身的已注册Push服务的iPhone列表中,查找有相应标识的iPhone,并把消息发到iPh ...
[Web API] 如何让 Web API 统一回传格式以及例外处理
[Web API] 如何让 Web API 统一回传格式以及例外处理前言当我们在开发 Web API 时,一般的情况下每个 API 回传的数据型态或格式都不尽相同,如果你的项目从头到尾都是由你一个 ...
DataGrid的打印预览和打印
using System;using System.Drawing;using System.Collections;using System.ComponentModel;using System. ...
jQuery遍历Table tr td td中包含标签
function shengchen() { var arrTR = $("#tbModule").children(); var Context=""; $( ...
bzoj 1778 [Usaco2010 Hol]Dotp 驱逐猪猡（高斯消元）
[题意] 炸弹从1开始运动,每次有P/Q的概率爆炸,否则等概率沿边移动,问在每个城市爆炸的概率. [思路] 设M表示移动一次后i->j的概率.Mk为移动k次后的概率,则有: Mk=M^k 设S= ...
bzoj 1419 Red is good（期望DP）
[题意] R红B蓝,选红得1选蓝失1,问最优状态下的期望得分. [思路] 设f[i][j]为i个Rj个B时的最优期望得分,则有转移式为: f[i][j]=max{ 0,(f[i-1][j]+1)*(i ...
Ubuntu_wifi&pppoe
学校现在上网全部要拨号,加上我在宿舍用的是无线路由,也就是要在ubuntu下实现连接wifi后再拨号,这个功能在默认的ubuntu网络设置里面是没有的,里面有dsl但是对有线网络使用的,有点小郁闷.不 ...
struts2实现文件上传
Struts2中实现简单的文件上传功能: 第一步:将如下文件引入到WEB_INF/lib目录下面,对应的jar文件可自行下载第二步:在包test.struts2下建立类UploadFile pack ...
Create a commit using pygit2
Create a commit using pygit2 Create a commit using pygit2 2015-04-06 10:41 user1479699 imported from ...

mahout算法源码分析之Itembased Collaborative Filtering（三）RowSimilarityJob验证

mahout算法源码分析之Itembased Collaborative Filtering（三）RowSimilarityJob验证的更多相关文章

随机推荐

热门专题