Mahout推荐算法之SlopOne

一、       算法原理

有别于基于用户的协同过滤和基于item的协同过滤,SlopeOne采用简单的线性模型估计用户对item的评分。如下图,估计UserB对ItemJ的偏好

图(1)

在真实情况下,该方法有如下几个问题:

1.  为什么要选择UserA计算?

2.  对大量稀疏的情况如何处理,而这种情况是最为普遍的。

图(2)

Item1和item2的相似度:((5-3)+(3-4))/2=0.5

Item1和Item3的相似度:(5-2)/1=3

Lucy对Item1的评估预估计为:((2+0.5)*2+(3+5)*1)/(2+1)=4.333

          

Item3和Item1的相似度:(2-3)/1=-1

Item3和Item2的相似度:(5-2)/1=3

Make对item3的评分预估计为:((4+3)*1+(3-1)*1)/(1+1)=4.5

通过以上例子可以看出:需要计算item对之间的平均差别,以及item对之间的差别次数。

Mahout给出的训练伪代码:

for every item i

for every other item j

for every user u expressing preference for both i and j

add the difference in u’s preference for i and j to an average

推荐伪代码:

for every item i the user u expresses no preference for
  for every item j that user u expresses a preference for 
find the average preference difference between j and i
    add this diff to u’s preference value for j
    add this to a running average
return the top items, ranked by these averages

二、       单机模型实现

(一)  构建difference

1.        单机模型构建(MemoryDiffStorage)

private void buildAverageDiffs() throws TasteException {
    log.info("Building average diffs...");
    try {
      buildAverageDiffsLock.writeLock().lock();
      averageDiffs.clear();
      long averageCount = 0L;
      LongPrimitiveIterator it = dataModel.getUserIDs();
      while (it.hasNext()) {
        averageCount = processOneUser(averageCount, it.nextLong());
      }
      
      pruneInconsequentialDiffs();
      updateAllRecommendableItems();
      
    } finally {
      buildAverageDiffsLock.writeLock().unlock();
    }
  }
  
  private void pruneInconsequentialDiffs() {
    // Go back and prune inconsequential diffs. "Inconsequential" means, here, only represented by one
    // data point, so possibly unreliable
    Iterator<Map.Entry<Long,FastByIDMap<RunningAverage>>> it1 = averageDiffs.entrySet().iterator();
    while (it1.hasNext()) {
      FastByIDMap<RunningAverage> map = it1.next().getValue();
      Iterator<Map.Entry<Long,RunningAverage>> it2 = map.entrySet().iterator();
      while (it2.hasNext()) {
        RunningAverage average = it2.next().getValue();
        if (average.getCount() <= 1) {
          it2.remove();
        }
      }
      if (map.isEmpty()) {
        it1.remove();
      } else {
        map.rehash();
      }
    }
    averageDiffs.rehash();
  }
  
  private void updateAllRecommendableItems() throws TasteException {
    FastIDSet ids = new FastIDSet(dataModel.getNumItems());
    for (Map.Entry<Long,FastByIDMap<RunningAverage>> entry : averageDiffs.entrySet()) {
      ids.add(entry.getKey());
      LongPrimitiveIterator it = entry.getValue().keySetIterator();
      while (it.hasNext()) {
        ids.add(it.next());
      }
    }
    allRecommendableItemIDs.clear();
    allRecommendableItemIDs.addAll(ids);
    allRecommendableItemIDs.rehash();
  }
  
  private long processOneUser(long averageCount, long userID) throws TasteException {
    log.debug("Processing prefs for user {}", userID);
    // Save off prefs for the life of this loop iteration
    PreferenceArray userPreferences = dataModel.getPreferencesFromUser(userID);
    int length = userPreferences.length();
    for (int i = 0; i < length; i++) { // Loop to length-1, not length-2, not for diffs but average item pref
      float prefAValue = userPreferences.getValue(i);
      long itemIDA = userPreferences.getItemID(i);
      FastByIDMap<RunningAverage> aMap = averageDiffs.get(itemIDA);
      if (aMap == null) {
        aMap = new FastByIDMap<RunningAverage>();
        averageDiffs.put(itemIDA, aMap);
      }
      for (int j = i + 1; j < length; j++) {
        // This is a performance-critical block
        long itemIDB = userPreferences.getItemID(j);
        RunningAverage average = aMap.get(itemIDB);
        if (average == null && averageCount < maxEntries) {
          average = buildRunningAverage();
          aMap.put(itemIDB, average);
          averageCount++;
        }
        if (average != null) {
          average.addDatum(userPreferences.getValue(j) - prefAValue);
        }
      }
      RunningAverage itemAverage = averageItemPref.get(itemIDA);
      if (itemAverage == null) {
        itemAverage = buildRunningAverage();
        averageItemPref.put(itemIDA, itemAverage);
      }
      itemAverage.addDatum(prefAValue);
    }
    return averageCount;
  }
  
  private RunningAverage buildRunningAverage() {
    return stdDevWeighted ? new FullRunningAverageAndStdDev() : new FullRunningAverage();
  }

2.        MapReduce模式构建(FileDiffStorage)

用MapReduce模式计算difference的部分参看下文。该方式是离线计算模式,不能实施更新,适合大数据量。由于mapreduce模式计算了所有item之间的全部值,故比单机模式更准确。构建好之后拷贝到本地,使用用FileDiffStorage(newFile("diff"),
500) 即可。FileDiffStorage不支持添加和删除pereference(实际上也是不能这么做的);

private void buildDiffs() {

if (buildAverageDiffsLock.writeLock().tryLock()) {

try {

averageDiffs.clear();

allRecommendableItemIDs.clear();

FileLineIterator iterator = new FileLineIterator(dataFile, false);

String firstLine = iterator.peek();

while (firstLine.isEmpty() || firstLine.charAt(0) == COMMENT_CHAR) {

iterator.next();

firstLine = iterator.peek();

}

long averageCount = 0L;

while (iterator.hasNext()) {

averageCount = processLine(iterator.next(), averageCount);

}

pruneInconsequentialDiffs();

updateAllRecommendableItems();

} catch (IOException ioe) {

log.warn("Exception while reloading", ioe);

} finally {

buildAverageDiffsLock.writeLock().unlock();

}

}

}

private long processLine(String line, long averageCount) {

if (line.isEmpty() || line.charAt(0) == COMMENT_CHAR) {

return averageCount;

}

String[] tokens = SEPARATOR.split(line);

Preconditions.checkArgument(tokens.length >= 3 && tokens.length != 5, "Bad line: %s", line);

long itemID1 = Long.parseLong(tokens[0]);

long itemID2 = Long.parseLong(tokens[1]);

double diff = Double.parseDouble(tokens[2]);

int count = tokens.length >= 4 ? Integer.parseInt(tokens[3]) : 1;

boolean hasMkSk = tokens.length >= 5;

if (itemID1 > itemID2) {

long temp = itemID1;

itemID1 = itemID2;

itemID2 = temp;

}

FastByIDMap<RunningAverage> level1Map = averageDiffs.get(itemID1);

if (level1Map == null) {

level1Map = new FastByIDMap<RunningAverage>();

averageDiffs.put(itemID1, level1Map);

}

RunningAverage average = level1Map.get(itemID2);

if (average != null) {

throw new IllegalArgumentException("Duplicated line for item-item pair " + itemID1 + " / " + itemID2);

}

if (averageCount < maxEntries) {

if (hasMkSk) {

double mk = Double.parseDouble(tokens[4]);

double sk = Double.parseDouble(tokens[5]);

average = new FullRunningAverageAndStdDev(count, diff, mk, sk);

} else {

average = new FullRunningAverage(count, diff);

}

level1Map.put(itemID2, average);

averageCount++;

}

allRecommendableItemIDs.add(itemID1);

allRecommendableItemIDs.add(itemID2);

return averageCount;

}

private void pruneInconsequentialDiffs() {

// Go back and prune inconsequential diffs. "Inconsequential" means, here, only represented by one

// data point, so possibly unreliable

Iterator<Map.Entry<Long,FastByIDMap<RunningAverage>>> it1 = averageDiffs.entrySet().iterator();

while (it1.hasNext()) {

FastByIDMap<RunningAverage> map = it1.next().getValue();

Iterator<Map.Entry<Long,RunningAverage>> it2 = map.entrySet().iterator();

while (it2.hasNext()) {

RunningAverage average = it2.next().getValue();

if (average.getCount() <= 1) {

it2.remove();

}

}

if (map.isEmpty()) {

it1.remove();

} else {

map.rehash();

}

}

averageDiffs.rehash();

}

private void updateAllRecommendableItems() {

for (Map.Entry<Long,FastByIDMap<RunningAverage>> entry : averageDiffs.entrySet()) {

allRecommendableItemIDs.add(entry.getKey());

LongPrimitiveIterator it = entry.getValue().keySetIterator();

while (it.hasNext()) {

allRecommendableItemIDs.add(it.next());

}

}

allRecommendableItemIDs.rehash();

}

(二)  估值

private float doEstimatePreference(long userID, long itemID) throws TasteException {
    double count = 0.0;
    double totalPreference = 0.0;
    PreferenceArray prefs = getDataModel().getPreferencesFromUser(userID);
    RunningAverage[] averages = diffStorage.getDiffs(userID, itemID, prefs);
    int size = prefs.length();
    for (int i = 0; i < size; i++) {
      RunningAverage averageDiff = averages[i];
      if (averageDiff != null) {
        double averageDiffValue = averageDiff.getAverage();
        if (weighted) {
          double weight = averageDiff.getCount();
          if (stdDevWeighted) {
            double stdev = ((RunningAverageAndStdDev) averageDiff).getStandardDeviation();
            if (!Double.isNaN(stdev)) {
              weight /= 1.0 + stdev;
            }
            // If stdev is NaN, then it is because count is 1. Because we're weighting by count,
            // the weight is already relatively low. We effectively assume stdev is 0.0 here and
            // that is reasonable enough. Otherwise, dividing by NaN would yield a weight of NaN
            // and disqualify this pref entirely
          
          }
          totalPreference += weight * (prefs.getValue(i) + averageDiffValue);
          count += weight;
        } else {
          totalPreference += prefs.getValue(i) + averageDiffValue;
          count += 1.0;
        }
      }
    }
    if (count <= 0.0) {
      RunningAverage itemAverage = diffStorage.getAverageItemPref(itemID);
      return itemAverage == null ? Float.NaN : (float) itemAverage.getAverage();
    } else {
      return (float) (totalPreference / count);
    }
  }

(三)  推荐

对于在线推荐系统,允许只有一个SlopeOneRecommender实例。

方法签名

说明

备注

public void setPreference(long userID,
long
itemID, float value)

添加偏好,线上系统经常需要。

动态添加偏好,添加之后会更新ItemID的和其他Item之间的相似度

public void removePreference(long userID,
long
itemID)

删除偏好,很少用。

删除偏好后,会更新itemId和其他Item之间的相似度

public List<RecommendedItem> recommend(long userID, int howMany, IDRescorer rescorer)

提供推荐。IDRescorer用于商业规则,调整item的得分

1.获取userId还未评分的item作为候选。2.估计每个Item的得分,选取topk 返回。

public float estimatePreference(long userID,long itemID)

估计userId对ItemId的评分

如userId对itemId有真实的值,则返回,否则估计。

1.        推荐接口

public List<RecommendedItem> recommend(long userID, int howMany, IDRescorer rescorer) throws TasteException {
    Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");
    log.debug("Recommending items for user ID '{}'", userID);
 
    FastIDSet possibleItemIDs = diffStorage.getRecommendableItemIDs(userID);
 
    TopItems.Estimator<Long> estimator = new Estimator(userID);
 
    List<RecommendedItem> topItems = TopItems.getTopItems(howMany, possibleItemIDs.iterator(), rescorer, estimator);
  
    log.debug("Recommendations are: {}", topItems);
    return topItems;
  }

2.        获取推荐候选项

public FastIDSet getRecommendableItemIDs(long userID) throws TasteException {
    FastIDSet result;
    try {
      buildAverageDiffsLock.readLock().lock();
      result = allRecommendableItemIDs.clone();
    } finally {
      buildAverageDiffsLock.readLock().unlock();
    }
    Iterator<Long> it = result.iterator();
    while (it.hasNext()) {
      if (dataModel.getPreferenceValue(userID, it.next()) != null) {
        it.remove();
      }
    }
    return result;
  }

3.        估计候选项的得分,返回topK个推荐列表

public static List<RecommendedItem> getTopItems(int howMany,
                                                  LongPrimitiveIterator possibleItemIDs,
                                                  IDRescorer rescorer,
                                                  Estimator<Long> estimator) throws TasteException {
    Preconditions.checkArgument(possibleItemIDs != null, "argument is null");
    Preconditions.checkArgument(estimator != null, "argument is null");
 
    Queue<RecommendedItem> topItems = new PriorityQueue<RecommendedItem>(howMany + 1,
      Collections.reverseOrder(ByValueRecommendedItemComparator.getInstance()));
    boolean full = false;
    double lowestTopValue = Double.NEGATIVE_INFINITY;
    while (possibleItemIDs.hasNext()) {
      long itemID = possibleItemIDs.next();
      if (rescorer == null || !rescorer.isFiltered(itemID)) {
        double preference;
        try {
          preference = estimator.estimate(itemID);
        } catch (NoSuchItemException nsie) {
          continue;
        }
        double rescoredPref = rescorer == null ? preference : rescorer.rescore(itemID, preference);
        if (!Double.isNaN(rescoredPref) && (!full || rescoredPref > lowestTopValue)) {
          topItems.add(new GenericRecommendedItem(itemID, (float) rescoredPref));
          if (full) {
            topItems.poll();
          } else if (topItems.size() > howMany) {
            full = true;
            topItems.poll();
          }
          lowestTopValue = topItems.peek().getValue();
        }
      }
    }
    int size = topItems.size();
    if (size == 0) {
      return Collections.emptyList();
    }
    List<RecommendedItem> result = Lists.newArrayListWithCapacity(size);
    result.addAll(topItems);
    Collections.sort(result, ByValueRecommendedItemComparator.getInstance());
    return result;
}

三、       MapReduce实现(计算diff)

1.  计算每个user的item之间的差值

Map: 输入,文本文件,格式为:userId\t itemId\t val
输出:key userId,value itemId\t val
 
Reduce:
for(user u :users){
        items of u
        for(int I  =0 ;i<items.size;i++){
               itema =items[i];
        for(int j =i+1;j<items.size;j++){
               itemb= items[j];
               itemABdiff=itemb-itema;
        out.write(itemA\t itemb, itemABdiff);
}
}
}

2.  计算itemPair的全局平均

Map:输出数据不做处理,将item相同的数据传递到同一个reduce中。
Reduce: 输入 key itemA\t itemb ,val itemABdiff
计算改组数据的平均值(FullRunningAverageAndStdDev)
输出:
key EntityEntityWritable ,valueFullRunningAverageAndStdDevWritable

四、       实例演示

(一) 单机模式

MemoryDiffStorage mds =new MemoryDiffStorage(new FileDataModel(new File("pereference")), Weighting.WEIGHTED, 1000);
               SlopeOneRecommender sr =new SlopeOneRecommender(new FileDataModel(new File("pereference")),Weighting.WEIGHTED,Weighting.WEIGHTED,mds);
        System.out.println(sr.recommend(1, 10,new IDRescorer() {
               
               @Override
               public double rescore(long id, double originalScore) {
                               int clickCount =10;//id的点击量
                               return originalScore*clickCount;
                       }
                       
               @Override
               public boolean isFiltered(long id) {
               //如果id和要推荐的item的id属于同一个类型,return false ,否则return true ;
                       return false;
               }
        }));

(二) MapReduce模式

String  [] arg ={"-i","p","-o","diff"};
SlopeOneAverageDiffsJob.main(arg);
DiffStorage ds  =new FileDiffStorage(new File("diff"), 1000);
SlopeOneRecommender sr =new SlopeOneRecommender(new FileDataModel(new File("pereference")),Weighting.WEIGHTED,Weighting.WEIGHTED,mds);
        System.out.println(sr.recommend(1, 10,new IDRescorer() {
               
               @Override
               public double rescore(long id, double originalScore) {
                               int clickCount =10;//id的点击量
                               return originalScore*clickCount;
                       }
                       
               @Override
               public boolean isFiltered(long id) {
               //如果id和要推荐的item的id属于同一个类型,return false ,否则return true ;
                       return false;
               }

}));

五、       参考文献

1.  http://en.wikipedia.org/wiki/Slope_One

2.  DanielLemire, Anna Maclachlan, SlopeOne Predictors for Online Rating-Based Collaborative Filtering

3.  PuWang, HongWu Ye, A Personalized Recommendation Algorithm Combining Slope OneScheme and User Based Collaborative Filtering

4.  DeJiaZhang, An Item-based Collaborative Filtering Recommendation AlgorithmUsing Slope One Scheme Smoothing

5.  Mi,Zhenzhen and Xu, Congfu, A Recommendation Algorithm Combining Clustering Methodand Slope One Scheme

1.  BadrulM. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Item-basedcollaborative filtering recommendation algorithms

2.  GregLinden, Brent Smith, Jeremy York, "Amazon.com Recommendations:Item-to-Item Collaborative Filterin

Mahout推荐算法之SlopOne的更多相关文章

  1. Mahout推荐算法API详解

    转载自:http://blog.fens.me/mahout-recommendation-api/ Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, ...

  2. Mahout推荐算法基础

    转载自(http://www.geek521.com/?p=1423) Mahout推荐算法分为以下几大类 GenericUserBasedRecommender 算法: 1.基于用户的相似度 2.相 ...

  3. 转】Mahout推荐算法API详解

    原博文出自于: http://blog.fens.me/mahout-recommendation-api/ 感谢! Posted: Oct 21, 2013 Tags: itemCFknnMahou ...

  4. [转]Mahout推荐算法API详解

    Mahout推荐算法API详解 Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeepe ...

  5. Mahout推荐算法API具体解释【一起学Mahout】

    阅读导读: 1.mahout单机内存算法实现和分布式算法实现分别存在哪些问题? 2.算法评判标准有哪些? 3.什么会影响算法的评分? 1. Mahout推荐算法介绍 Mahout推荐算法,从数据处理能 ...

  6. Mahout推荐算法ItemBased

    Mahout推荐的ItemBased 一.   算法原理 (一)    基本的 下面的例子,参见图评分矩阵:表现user,归类为item. 图(1) 该算法的原理: 1.  计算Item之间的相似度. ...

  7. Mahout推荐算法之ItemBased

    Mahout推荐之ItemBased 一.   算法原理 (一)    基本原理 如下图评分矩阵所示:行为user,列为item. 图(1) 该算法的原理: 1.  计算Item之间的相似度. 2.  ...

  8. 从源代码剖析Mahout推荐引擎

    转载自:http://blog.fens.me/mahout-recommend-engine/ Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pi ...

  9. 转】从源代码剖析Mahout推荐引擎

    原博文出自于: http://blog.fens.me/mahout-recommend-engine/ 感谢! 从源代码剖析Mahout推荐引擎 Hadoop家族系列文章,主要介绍Hadoop家族产 ...

随机推荐

  1. WPF TextBlock 判断 isTextTrimmed 文本是否超出

    WPF TextBlock 设置TextTrimming情况下 判断 isTextTrimmed(Text 文本是否超出 是否出现了省略号) private bool HasTextTrimmed(T ...

  2. 一个任务:(小甲鱼python视频第29讲) 代码整理与总结

    任务:将文件(record.txt)中的数据进行分割,并安装以下规则保存起来.  1.小甲鱼的对话单独保存为boy_*.txt的文件(去掉"小甲鱼:")  2.小客服的对话单独保存 ...

  3. 如何搭建samba服务?

    为了日后便于查询,本文所涉及到的所有命令集合如下: chkconfig iptables off #关闭防火墙命令 在Centos7中使用的是chkconfig firewalld off seten ...

  4. Node.js 控制台

    稳定性: 4 - 冻结 {Object} 用于打印输出字符到 stdout 和 stderr.和多数浏览器提供的 console 对象函数一样,Node 也是输出到 stdout 和 stderr. ...

  5. Docker内核能力机制

    能力机制(Capability)是 Linux 内核一个强大的特性,可以提供细粒度的权限访问控制. Linux 内核自 2.2 版本起就支持能力机制,它将权限划分为更加细粒度的操作能力,既可以作用在进 ...

  6. IP_ADD_MEMBERSHIP 失败

    /*将本机加入多播组*/ err = setsockopt(fd, IPPROTO_IP, IP_ADD_MEMBERSHIP,&mreq, sizeof(mreq)); if (err &l ...

  7. Web自动化框架LazyUI使用手册(5)--模板工程:LazyUI-template详解

    概述: LazyUI-template: 提供Maven管理的,基于Spring+Testng的,包含常用浏览器driver的,方便连接各种数据库的java模板工程,并提供以百度搜索为例的第一个测试用 ...

  8. django-redis

    linuxapt-get install redis-serverpip install django-redis vim /etc/redis/redis.conf maxmemory 20mb s ...

  9. [error]error while loading shared libraries: libpcre.so.1 解决

    nginx 安装好之后,启动的时候报错 [root@localhost nginx-1.6.2]# /usr/local/nginx/sbin/nginx  /usr/local/nginx/sbin ...

  10. EJB_开发消息驱动bean

    开发消息驱动bean Java消息服务(Java MessageService) Java 消息服务(Java Message Service,简称 JMS)是用于访问企业消息系统的开发商中立的API ...