（数据挖掘-入门-3）基于用户的协同过滤之k近邻

主要内容：

1、k近邻

2、python实现

1、什么是k近邻（KNN）

在入门-1中，简单地实现了基于用户协同过滤的最近邻算法，所谓最近邻，就是找到距离最近或最相似的用户，将他的物品推荐出来。

而这里，k近邻（K Nearest Neighbor）的意思就是，找出最近或最相似的k个用户，将他们的评分（相似度权重求和）最高的几个物品进行推荐。

2、python实现

代码中有两个数据集，

一个是直接写在的代码中的users；

一个是包含在BX-Book-Ratings.csv、BX-Books.csv、BX-Users.csv文件中；（下载地址：http://www.guidetodatamining.com/assets/data/BX-Dump.zip）

代码：

import codecs

from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,

                      "Norah Jones": 4.5, "Phoenix": 5.0,

                      "Slightly Stoopid": 1.5,

                      "The Strokes": 2.5, "Vampire Weekend": 2.0},

         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,

                 "Deadmau5": 4.0, "Phoenix": 2.0,

                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,

                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,

                  "Slightly Stoopid": 1.0},

         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,

                 "Deadmau5": 4.5, "Phoenix": 3.0,

                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,

                 "Vampire Weekend": 2.0},

         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,

                    "Norah Jones": 4.0, "The Strokes": 4.0,

                    "Vampire Weekend": 1.0},

         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,

                     "Norah Jones": 5.0, "Phoenix": 5.0,

                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,

                     "Vampire Weekend": 4.0},

         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,

                 "Norah Jones": 3.0, "Phoenix": 5.0,

                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},

         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,

                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,

                      "The Strokes": 3.0}

        }

class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):

        """ initialize recommender

        currently, if data is dictionary the recommender is initialized

        to it.

        For all other data types of data, no initialization occurs

        k is the k value for k nearest neighbor

        metric is which distance formula to use

        n is the maximum number of recommendations to make"""

        self.k = k

        self.n = n

        self.username2id = {}

        self.userid2name = {}

        self.productid2name = {}

        # for some reason I want to save the name of the metric

        self.metric = metric

        if self.metric == 'pearson':

            self.fn = self.pearson

        #

        # if data is dictionary set recommender data to it

        #

        if type(data).__name__ == 'dict':

            self.data = data

    def convertProductID2name(self, id):

        """Given product id number return product name"""

        if id in self.productid2name:

            return self.productid2name[id]

        else:

            return id

    def userRatings(self, id, n):

        """Return n top ratings for user with id"""

        print ("Ratings for " + self.userid2name[id])

        ratings = self.data[id]

        print(len(ratings))

        ratings = list(ratings.items())

        ratings = [(self.convertProductID2name(k), v)

                   for (k, v) in ratings]

        # finally sort and return

        ratings.sort(key=lambda artistTuple: artistTuple[1],

                     reverse = True)

        ratings = ratings[:n]

        for rating in ratings:

            print("%s\t%i" % (rating[0], rating[1]))

    def loadBookDB(self, path=''):

        """loads the BX book dataset. Path is where the BX files are

        located"""

        self.data = {}

        i = 0

        #

        # First load book ratings into self.data

        #

        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #separate line into fields

            fields = line.split(';')

            user = fields[0].strip('"')

            book = fields[1].strip('"')

            rating = int(fields[2].strip().strip('"'))

            if user in self.data:

                currentRatings = self.data[user]

            else:

                currentRatings = {}

            currentRatings[book] = rating

            self.data[user] = currentRatings

        f.close()

        #

        # Now load books into self.productid2name

        # Books contains isbn, title, and author among other fields

        #

        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #separate line into fields

            fields = line.split(';')

            isbn = fields[0].strip('"')

            title = fields[1].strip('"')

            author = fields[2].strip().strip('"')

            title = title + ' by ' + author

            self.productid2name[isbn] = title

        f.close()

        #

        #  Now load user info into both self.userid2name and

        #  self.username2id

        #

        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #print(line)

            #separate line into fields

            fields = line.split(';')

            userid = fields[0].strip('"')

            location = fields[1].strip('"')

            if len(fields) > 3:

                age = fields[2].strip().strip('"')

            else:

                age = 'NULL'

            if age != 'NULL':

                value = location + '  (age: ' + age + ')'

            else:

                value = location

            self.userid2name[userid] = value

            self.username2id[location] = userid

        f.close()

        print(i)

    def pearson(self, rating1, rating2):

        sum_xy = 0

        sum_x = 0

        sum_y = 0

        sum_x2 = 0

        sum_y2 = 0

        n = 0

        for key in rating1:

            if key in rating2:

                n += 1

                x = rating1[key]

                y = rating2[key]

                sum_xy += x * y

                sum_x += x

                sum_y += y

                sum_x2 += pow(x, 2)

                sum_y2 += pow(y, 2)

        if n == 0:

            return 0

        # now compute denominator

        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)

                       * sqrt(sum_y2 - pow(sum_y, 2) / n))

        if denominator == 0:

            return 0

        else:

            return (sum_xy - (sum_x * sum_y) / n) / denominator

    def computeNearestNeighbor(self, username):

        """creates a sorted list of users based on their distance to

        username"""

        distances = []

        for instance in self.data:

            if instance != username:

                distance = self.fn(self.data[username],

                                   self.data[instance])

                distances.append((instance, distance))

        # sort based on distance -- closest first

        distances.sort(key=lambda artistTuple: artistTuple[1],

                       reverse=True)

        return distances

    def recommend(self, user):

       """Give list of recommendations"""

       recommendations = {}

       # first get list of users  ordered by nearness

       nearest = self.computeNearestNeighbor(user)

       #

       # now get the ratings for the user

       #

       userRatings = self.data[user]

       #

       # determine the total distance

       totalDistance = 0.0

       for i in range(self.k):

          totalDistance += nearest[i][1]

       # now iterate through the k nearest neighbors

       # accumulating their ratings

       for i in range(self.k):

          # compute slice of pie

          weight = nearest[i][1] / totalDistance

          # get the name of the person

          name = nearest[i][0]

          # get the ratings for this person

          neighborRatings = self.data[name]

          # get the name of the person

          # now find bands neighbor rated that user didn't

          for artist in neighborRatings:

             if not artist in userRatings:

                if artist not in recommendations:

                   recommendations[artist] = (neighborRatings[artist]

                                              * weight)

                else:

                   recommendations[artist] = (recommendations[artist]

                                              + neighborRatings[artist]

                                              * weight)

       # now make list from dictionary

       recommendations = list(recommendations.items())

       recommendations = [(self.convertProductID2name(k), v)

                          for (k, v) in recommendations]

       # finally sort and return

       recommendations.sort(key=lambda artistTuple: artistTuple[1],

                            reverse = True)

       # Return the first n items

       return recommendations[:self.n]

if __name__ == '__main__':

    # users as dataset

    r=recommender(users)

    print r.recommend('Jordyn')

    print r.recommend('Hailey')

    # file as dataset

    r.loadBookDB('BX-Dump/BX-Dump/')

    print r.recommend('')

    print r.userRatings('', 5)

（数据挖掘-入门-3）基于用户的协同过滤之k近邻的更多相关文章

推荐召回--基于用户的协同过滤UserCF
目录 1. 前言 2. 原理 3. 数据及相似度计算 4. 根据相似度计算结果 5. 相关问题 5.1 如何提炼用户日志数据? 5.2 用户相似度计算很耗时,有什么好的方法? 5.3 有哪些改进措施? ...
基于用户的协同过滤电影推荐user-CF python
协同过滤包括基于物品的协同过滤和基于用户的协同过滤,本文基于电影评分数据做基于用户的推荐主要做三个部分:1.读取数据:2.构建用户与用户的相似度矩阵:3.进行推荐: 查看数据u.data 主要用到前 ...
Mahout实现基于用户的协同过滤算法
Mahout中对协同过滤算法进行了封装,看一个简单的基于用户的协同过滤算法. 基于用户:通过用户对物品的偏好程度来计算出用户的在喜好上的近邻,从而根据近邻的喜好推测出用户的喜好并推荐. 图片来源程序 ...
【推荐系统实战】：C++实现基于用户的协同过滤（UserCollaborativeFilter）
好早的时候就打算写这篇文章,可是还是參加阿里大数据竞赛的第一季三月份的时候实验就完毕了.硬生生是拖到了十一假期.自己也是醉了... 找工作不是非常顺利,希望写点东西回想一下知识.然后再攒点人品吧,仅仅 ...
基于用户的协同过滤的电影推荐算法(tensorflow)
数据集: https://grouplens.org/datasets/movielens/ ml-latest-small 协同过滤算法理论基础 https://blog.csdn.net/u012 ...
（数据挖掘-入门-6）十折交叉验证和K近邻
主要内容: 1.十折交叉验证 2.混淆矩阵 3.K近邻 4.python实现一.十折交叉验证前面提到了数据集分为训练集和测试集,训练集用来训练模型,而测试集用来测试模型的好坏,那么单一的测试是否就 ...
案例：Spark基于用户的协同过滤算法
https://mp.weixin.qq.com/s?__biz=MzA3MDY0NTMxOQ==&mid=2247484291&idx=1&sn=4599b4e31c2190 ...
基于用户的协同过滤（UserCF）
Music Recommendation System with User-based and Item-based Collaborative Filtering Technique(使用基于用户及基于物品的协同过滤技术的音乐推荐系统)【更新】
摘要: 大数据催生了互联网,电子商务,也导致了信息过载.信息过载的问题可以由推荐系统来解决.推荐系统可以提供选择新产品(电影,音乐等)的建议.这篇论文介绍了一个音乐推荐系统,它会根据用户的历史行为和口 ...

随机推荐

hdu 4540 dp
题意: 假设: 1.每一个时刻我们只能打一只地鼠,并且打完以后该时刻出现的所有地鼠都会立刻消失: 2.老鼠出现的位置在一条直线上,如果上一个时刻我们在x1位置打地鼠,下一个时刻我们在x2位置打地鼠,那 ...
hdu 4442 Physical Examination 贪心排序
Physical Examination Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others ...
git 的补丁使用方法
1.生成补丁 format-patch可以基于分支进行打包,也可以基于上几次更新内容打包. 基于上几次内容打包 git format-patch HEAD^ 有几个^就会打几个patch,从最近一次 ...
解决MySQL建立连接问题，快速回收复用TCP的TIME_WAIT
最近同事遇到一个问题,使用python开发的工具在执行的时候无法和MySQL建立连接,其最直接的现象就是满篇的TIME_WAIT,最后通过调整tcp_timestamps参数问题得以解决,再次记录一下 ...
Linux系统优势六大方面
Linux系统越来越受到电脑用户的欢迎,于是很多人开始学习Linux.Linux系统之所以会成为目前最受关注的系统之一,主要原因是它的免费,以及系统的开放性,可以随时取得程序的原代码,这对于程序开发人 ...
linux 内核大牛-谢宝友
http://blog.chinaunix.net/uid/25845340.html 谢宝友:毕业于四川省税务学校税收专业,现供职于中兴通讯操作系统团队,对操作系统内核有较强的兴趣.专职于操作系统内 ...
MVC动态添加文本框,后台使用FormCollection接收
在"MVC批量添加,增加一条记录的同时添加N条集合属性所对应的个体"中,对于前台传来的多个TextBox值,在控制器方法中通过强类型来接收.使用FormCollection也可以接 ...
java跨域解决
import java.util.ArrayList; import java.util.List; import org.springframework.context.annotation.Bea ...
oracle extract函数
oracle Extract 函数 //oracle中extract()函数从oracle 9i中引入,用于从一个date或者interval类型中截取到特定的部分 //语法如下: EXTRA ...
如何让两个div并排，并且div要看得见边框
<div style="float:left; width:100px; height:100px; border:2px solid #0000FF;"></d ...

（数据挖掘-入门-3）基于用户的协同过滤之k近邻

1、什么是k近邻（KNN）

2、python实现

（数据挖掘-入门-3）基于用户的协同过滤之k近邻的更多相关文章

随机推荐

热门专题