CS100.1x-lab4_machine_learning_student
这是这门课第一次接触机器学习,主题是Predicting Movie Ratings。难度比上一次作业要简单点。。上一次作业真的挺难。。。相关ipynb文件见我github。
这里我们会用到Spark MLlib的Alternating Least Squares方法去做一些比之前复杂的事情。这次lab的数据集是500000次电影打分,环境默认配置好了。数据集可以从这里下载。
Part 0 Preliminaries
这部分主要是读取数据,转换为RDD,解析每行的数据。打分的数据格式为:
UserID::MovieID::Rating::Timestamp
电影数据的格式为:
MovieID::Title::Genres
其中Genres(类型)的格式为:
Genres1|Genres2|Genres3|...
我们要做的是把打分数据解析成(UserID, MovieID, Rating),把电影数据解析成(MovieID, Title)。这里我们丢掉了类型特征,因为这个lab做的比较浅,不需要这个特征。
import sys
import os
from test_helper import Test
baseDir = os.path.join('data')
inputPath = os.path.join('cs100', 'lab4', 'small')
ratingsFilename = os.path.join(baseDir, inputPath, 'ratings.dat.gz')
moviesFilename = os.path.join(baseDir, inputPath, 'movies.dat')
numPartitions = 2
rawRatings = sc.textFile(ratingsFilename).repartition(numPartitions)
rawMovies = sc.textFile(moviesFilename)
def get_ratings_tuple(entry):
""" Parse a line in the ratings dataset
Args:
entry (str): a line in the ratings dataset in the form of UserID::MovieID::Rating::Timestamp
Returns:
tuple: (UserID, MovieID, Rating)
"""
items = entry.split('::')
return int(items[0]), int(items[1]), float(items[2])
def get_movie_tuple(entry):
""" Parse a line in the movies dataset
Args:
entry (str): a line in the movies dataset in the form of MovieID::Title::Genres
Returns:
tuple: (MovieID, Title)
"""
items = entry.split('::')
return int(items[0]), items[1]
ratingsRDD = rawRatings.map(get_ratings_tuple).cache()
moviesRDD = rawMovies.map(get_movie_tuple).cache()
ratingsCount = ratingsRDD.count()
moviesCount = moviesRDD.count()
print 'There are %s ratings and %s movies in the datasets' % (ratingsCount, moviesCount)
print 'Ratings: %s' % ratingsRDD.take(3)
print 'Movies: %s' % moviesRDD.take(3)
assert ratingsCount == 487650
assert moviesCount == 3883
assert moviesRDD.filter(lambda (id, title): title == 'Toy Story (1995)').count() == 1
assert (ratingsRDD.takeOrdered(1, key=lambda (user, movie, rating): movie)
== [(1, 1, 5.0)])
运行结果如下
There are 487650 ratings and 3883 movies in the datasets
Ratings: [(1, 1193, 5.0), (1, 914, 3.0), (1, 2355, 5.0)]
Movies: [(1, u'Toy Story (1995)'), (2, u'Jumanji (1995)'), (3, u'Grumpier Old Men (1995)')]
后面我们会涉及到很多排序的操作,一般我们用sortByKey()来搞定。但是实际中,当我们用这个方法的时候,结果可能是不确定的,因为当出现key相同的情况时,排序可能会不确定。
一个更好的方法是对key和value都排序,譬如实现如下函数,再用sortBy()方法,这样能确保排序的正确。
def sortFunction(tuple):
""" Construct the sort string (does not perform actual sorting)
Args:
tuple: (rating, MovieName)
Returns:
sortString: the value to sort with, 'rating MovieName'
"""
key = unicode('%.3f' % tuple[0])
value = tuple[1]
return (key + ' ' + value)
print oneRDD.sortBy(sortFunction, True).collect()
print twoRDD.sortBy(sortFunction, True).collect()
当我们只要看排序后的部分结果,我们可以用takeOrdered方法。
oneSorted1 = oneRDD.takeOrdered(oneRDD.count(),key=sortFunction)
twoSorted1 = twoRDD.takeOrdered(twoRDD.count(),key=sortFunction)
print 'one is %s' % oneSorted1
print 'two is %s' % twoSorted1
assert oneSorted1 == twoSorted1
Part 1 Basic Recommendations
一般推荐电影的时候,一个很好的办法是推荐评分最高的电影。我们这部分就要实现这个功能。
Number of Ratings and Average Ratings for a Movie
这里我们要实现一个函数,输入是(MovieID, (Rating1, Rating2, Rating3, ...)),输出是(MovieID, (number of ratings, averageRating))。
# TODO: Replace <FILL IN> with appropriate code
# First, implement a helper function `getCountsAndAverages` using only Python
def getCountsAndAverages(IDandRatingsTuple):
""" Calculate average rating
Args:
IDandRatingsTuple: a single tuple of (MovieID, (Rating1, Rating2, Rating3, ...))
Returns:
tuple: a tuple of (MovieID, (number of ratings, averageRating))
"""
Mid = IDandRatingsTuple[0]
ratingTuple = IDandRatingsTuple[1]
newTuple = (len(ratingTuple), float(sum(ratingTuple))/len(ratingTuple))
retuple = (Mid, newTuple)
return retuple
Movies with Highest Average Ratings
要实现这个功能分三步:从评分数据里提取(MovieID, Rating);计算每个电影的平均分数;把这个结果和电影数据join起来从而得到(average rating, movie name, number of ratings)
# TODO: Replace <FILL IN> with appropriate code
# From ratingsRDD with tuples of (UserID, MovieID, Rating) create an RDD with tuples of
# the (MovieID, iterable of Ratings for that MovieID)
movieIDsWithRatingsRDD = ratingsRDD.map(lambda tuple:(tuple[1],tuple[2])).groupByKey()
print 'movieIDsWithRatingsRDD: %s\n' % movieIDsWithRatingsRDD.take(3)
# Using `movieIDsWithRatingsRDD`, compute the number of ratings and average rating for each movie to
# yield tuples of the form (MovieID, (number of ratings, average rating))
movieIDsWithAvgRatingsRDD = movieIDsWithRatingsRDD.map(getCountsAndAverages)
print 'movieIDsWithAvgRatingsRDD: %s\n' % movieIDsWithAvgRatingsRDD.take(3)
# To `movieIDsWithAvgRatingsRDD`, apply RDD transformations that use `moviesRDD` to get the movie
# names for `movieIDsWithAvgRatingsRDD`, yielding tuples of the form
# (average rating, movie name, number of ratings)
movieNameWithAvgRatingsRDD = moviesRDD.join(movieIDsWithAvgRatingsRDD).map(lambda x : (x[1][1][1], x[1][0],x[1][1][0]))
print 'movieNameWithAvgRatingsRDD: %s\n' % movieNameWithAvgRatingsRDD.take(3)
Movies with Highest Average Ratings and more than 500 reviews
这里我们选出评论数超过500里的评分最高的20部电影。
# TODO: Replace <FILL IN> with appropriate code
# Apply an RDD transformation to `movieNameWithAvgRatingsRDD` to limit the results to movies with
# ratings from more than 500 people. We then use the `sortFunction()` helper function to sort by the
# average rating to get the movies in order of their rating (highest rating first)
movieLimitedAndSortedByRatingRDD = (movieNameWithAvgRatingsRDD
.filter(lambda x : x[2] > 500)
.sortBy(sortFunction, False))
print 'Movies with highest ratings: %s' % movieLimitedAndSortedByRatingRDD.take(20)
Part 2 Collaborative Filtering
现在开始要接触到Spark MLlib了,我们要用到一个方法叫协同过滤(Collaborative filtering)。协同过滤的主要思想是,我们通过找到其他和你类似用户后,给你做推荐。具体到电影推荐,我们从一个用户和电影的矩阵开始,每一行代表一个用户,每一列代表一部电影,它们的值就是该用户对该电影的评分。由于实际这样的矩阵是很稀疏的,这里一般采取的办法是把矩阵分解成两个矩阵,一个描述了用户的潜在性质,一个描述了电影的潜在性质。
那么我们如何求这两个分解后的矩阵呢。我们需要最小化平方误差损失函数,这里用到的方法是ALS(Alternating Least Squares)。这个方法会随机的填充这两个矩阵,然后计算损失函数,再进行迭代更新,这里alternating是指,我们进行迭代更新时,是保持其中一个矩阵不变,更新另外一个,这样“交替”进行。
Creating a Training Set
在开始机器学习之前,我们把数据分成三份:
- A training set (RDD), which we will use to train models
- A validation set (RDD), which we will use to choose the best model
- A test set (RDD), which we will use for our experiments
我们可以用randomSplit()函数来随机分。
trainingRDD, validationRDD, testRDD = ratingsRDD.randomSplit([6, 2, 2], seed=0L)
print 'Training: %s, validation: %s, test: %s\n' % (trainingRDD.count(),
validationRDD.count(),
testRDD.count())
print trainingRDD.take(3)
print validationRDD.take(3)
print testRDD.take(3)
assert trainingRDD.count() == 292716
assert validationRDD.count() == 96902
assert testRDD.count() == 98032
assert trainingRDD.filter(lambda t: t == (1, 914, 3.0)).count() == 1
assert trainingRDD.filter(lambda t: t == (1, 2355, 5.0)).count() == 1
assert trainingRDD.filter(lambda t: t == (1, 595, 5.0)).count() == 1
assert validationRDD.filter(lambda t: t == (1, 1287, 5.0)).count() == 1
assert validationRDD.filter(lambda t: t == (1, 594, 4.0)).count() == 1
assert validationRDD.filter(lambda t: t == (1, 1270, 5.0)).count() == 1
assert testRDD.filter(lambda t: t == (1, 1193, 5.0)).count() == 1
assert testRDD.filter(lambda t: t == (1, 2398, 4.0)).count() == 1
assert testRDD.filter(lambda t: t == (1, 1035, 5.0)).count() == 1
Root Mean Square Error (RMSE)
RMSE是衡量一个模型好坏的重要标志。在Spark 1.4中,有RegressionMetrics模块去计算RMSE,不过我们的环境是Spark 1.3,所以需要自己写函数来实现。
# TODO: Replace <FILL IN> with appropriate code
import math
def computeError(predictedRDD, actualRDD):
""" Compute the root mean squared error between predicted and actual
Args:
predictedRDD: predicted ratings for each movie and each user where each entry is in the form
(UserID, MovieID, Rating)
actualRDD: actual ratings where each entry is in the form (UserID, MovieID, Rating)
Returns:
RSME (float): computed RSME value
"""
# Transform predictedRDD into the tuples of the form ((UserID, MovieID), Rating)
predictedReformattedRDD = predictedRDD.map(lambda x :((x[0],x[1]),x[2]))
# Transform actualRDD into the tuples of the form ((UserID, MovieID), Rating)
actualReformattedRDD = actualRDD.map(lambda x :((x[0],x[1]),x[2]))
# Compute the squared error for each matching entry (i.e., the same (User ID, Movie ID) in each
# RDD) in the reformatted RDDs using RDD transformtions - do not use collect()
squaredErrorsRDD = predictedReformattedRDD.join(actualReformattedRDD).map(lambda x : (x[1][0]-x[1][1])**2)
# Compute the total squared error - do not use collect()
totalError = squaredErrorsRDD.reduce(lambda a,b : a+b)
# Count the number of entries for which you computed the total squared error
numRatings = squaredErrorsRDD.count()
# Using the total squared error and the number of entries, compute the RSME
return math.sqrt(float(totalError)/numRatings)
# sc.parallelize turns a Python list into a Spark RDD.
testPredicted = sc.parallelize([
(1, 1, 5),
(1, 2, 3),
(1, 3, 4),
(2, 1, 3),
(2, 2, 2),
(2, 3, 4)])
testActual = sc.parallelize([
(1, 2, 3),
(1, 3, 5),
(2, 1, 5),
(2, 2, 1)])
testPredicted2 = sc.parallelize([
(2, 2, 5),
(1, 2, 5)])
testError = computeError(testPredicted, testActual)
print 'Error for test dataset (should be 1.22474487139): %s' % testError
testError2 = computeError(testPredicted2, testActual)
print 'Error for test dataset2 (should be 3.16227766017): %s' % testError2
testError3 = computeError(testActual, testActual)
print 'Error for testActual dataset (should be 0.0): %s' % testError3
Using ALS.train()
这一部分,我们要用到MLlib的ALS.train()。我们会用这个方法训练多个模型,来选择一个最好的模型。主要的步骤是:
首先我们选择模型的参数,最重要的参数是rank,这个rank是指user矩阵里的行数,或者movie矩阵里的列数。一般来说一个比较小的rank值会欠拟合,一个较大的rank值会过拟合。我们在这里选择4,8和12。在确定了rank后,我们看ALS.train(trainingRDD, rank, seed=seed, iterations=iterations, lambda_=regularizationParameter),其他参数的选择是iterations为5,regularizationParameter为0.1。
而预测打分,我们用model.predictAll()函数。最后把validation data上表现最好的模型作为最终的模型。由于计算RMSE并不是分布式实现,所以计算起来会很慢。
# TODO: Replace <FILL IN> with appropriate code
from pyspark.mllib.recommendation import ALS
validationForPredictRDD = validationRDD.map(lambda x :(x[0],x[1]))
seed = 5L
iterations = 5
regularizationParameter = 0.1
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
tolerance = 0.03
minError = float('inf')
bestRank = -1
bestIteration = -1
for rank in ranks:
model = ALS.train(trainingRDD, rank, seed=seed, iterations=iterations,
lambda_=regularizationParameter)
predictedRatingsRDD = model.predictAll(validationForPredictRDD)
error = computeError(predictedRatingsRDD, validationRDD)
errors[err] = error
err += 1
print 'For rank %s the RMSE is %s' % (rank, error)
if error < minError:
minError = error
bestRank = rank
print 'The best model was trained with rank %s' % bestRank
这段代码是训练模型,比较关键。
Testing Your Model
这里把上面选择的模型和test data来算RMSE。
# TODO: Replace <FILL IN> with appropriate code
myModel = ALS.train(trainingRDD,rank,seed=seed,iterations=iterations,lambda_=regularizationParameter)
testForPredictingRDD = testRDD.map(lambda x: (x[0],x[1]))
predictedTestRDD = myModel.predictAll(testForPredictingRDD)
testRMSE = computeError(testRDD, predictedTestRDD)
print 'The model had a RMSE on the test set of %s' % testRMSE
Comparing Your Model
我们可以用训练好的模型求测试集上的RMSE来看一个模型的好坏,也可以求训练集的平均打分,再求这个RDD和测试集的RDD的RMSE的值。
# TODO: Replace <FILL IN> with appropriate code
trainingAvgRating = trainingRDD.map(lambda x:x[2]).mean()
print 'The average rating for movies in the training set is %s' % trainingAvgRating
testForAvgRDD = testRDD.map(lambda x: (x[0],x[1],trainingAvgRating))
testAvgRMSE = computeError(testRDD, testForAvgRDD)
print 'The RMSE on the average set is %s' % testAvgRMSE
Part 3 Predictions for Yourself
Your Movie Ratings
我们的最终目的是要推荐电影,假如我们要给user id为0的人推荐电影,下面随意构造这个用户10部电影评分。
# TODO: Replace <FILL IN> with appropriate code
myUserID = 0
# Note that the movie IDs are the *last* number on each line. A common error was to use the number of ratings as the movie ID.
myRatedMovies = [
(0,1,4),(0,2,3),(0,3,2),(0,4,5),(0,5,3),(0,6,2),(0,7,2),(0,8,3),(0,9,3),(0,10,3)
# The format of each line is (myUserID, movie ID, your rating)
# For example, to give the movie "Star Wars: Episode IV - A New Hope (1977)" a five rating, you would add the following line:
# (myUserID, 260, 5),
]
myRatingsRDD = sc.parallelize(myRatedMovies)
print 'My movie ratings: %s' % myRatingsRDD.take(10)
Add Your Movies to Training Dataset
现在我们把这个用户的数据和训练集合并。
# TODO: Replace <FILL IN> with appropriate code
trainingWithMyRatingsRDD = myRatingsRDD.union(trainingRDD)
print ('The training dataset now has %s more entries than the original training dataset' %
(trainingWithMyRatingsRDD.count() - trainingRDD.count()))
assert (trainingWithMyRatingsRDD.count() - trainingRDD.count()) == myRatingsRDD.count()
Train a Model with Your Ratings
# TODO: Replace <FILL IN> with appropriate code
myRatingsModel = ALS.train(trainingWithMyRatingsRDD, bestRank, seed=seed,iterations=iterations,lambda_=regularizationParameter)
Check RMSE for the New Model with Your Ratings
# TODO: Replace <FILL IN> with appropriate code
predictedTestMyRatingsRDD = myRatingsModel.predictAll(testForPredictingRDD)
testRMSEMyRatings = computeError(testRDD,predictedTestMyRatingsRDD)
print 'The model had a RMSE on the test set of %s' % testRMSEMyRatings
Predict Your Ratings
# TODO: Replace <FILL IN> with appropriate code
# Use the Python list myRatedMovies to transform the moviesRDD into an RDD with entries that are pairs of the form (myUserID, Movie ID) and that does not contain any movies that you have rated.
myUnratedMoviesRDD = moviesRDD.map(lambda (x,y):(myUserID,x)).filter(lambda x:x[1] not in [i[1] for i in myRatedMovies])
# Use the input RDD, myUnratedMoviesRDD, with myRatingsModel.predictAll() to predict your ratings for the movies
predictedRatingsRDD = myRatingsModel.predictAll(myUnratedMoviesRDD)
Predict Your Ratings
# TODO: Replace <FILL IN> with appropriate code
# Transform movieIDsWithAvgRatingsRDD from part (1b), which has the form (MovieID, (number of ratings, average rating)), into and RDD of the form (MovieID, number of ratings)
movieCountsRDD = movieIDsWithAvgRatingsRDD.map(lambda x:(x[0],x[1][0]))
# Transform predictedRatingsRDD into an RDD with entries that are pairs of the form (Movie ID, Predicted Rating)
predictedRDD = predictedRatingsRDD.map(lambda x:(x[1],x[2]))
# Use RDD transformations with predictedRDD and movieCountsRDD to yield an RDD with tuples of the form (Movie ID, (Predicted Rating, number of ratings))
predictedWithCountsRDD = (predictedRDD
.join(movieCountsRDD))
# Use RDD transformations with PredictedWithCountsRDD and moviesRDD to yield an RDD with tuples of the form (Predicted Rating, Movie Name, number of ratings), for movies with more than 75 ratings
ratingsWithNamesRDD = (predictedWithCountsRDD
.join(moviesRDD).map(lambda x:(x[1][0][0],x[1][1],x[1][0][1])).filter(lambda x: x[2]>75))
predictedHighestRatedMovies = ratingsWithNamesRDD.takeOrdered(20, key=lambda x: -x[0])
print ('My highest rated movies as predicted (for movies with more than 75 reviews):\n%s' %
'\n'.join(map(str, predictedHighestRatedMovies)))
CS100.1x-lab4_machine_learning_student的更多相关文章
- CS100.1x Introduction to Big Data with Apache Spark
CS100.1x简介 这门课主要讲数据科学,也就是data science以及怎么用Apache Spark去分析大数据. Course Software Setup 这门课主要介绍如何编写和调试Py ...
- CS190.1x Scalable Machine Learning
这门课是CS100.1x的后续课,看课程名字就知道这门课主要讲机器学习.难度也会比上一门课大一点.如果你对这门课感兴趣,可以看看我这篇博客,如果对PySpark感兴趣,可以看我分析作业的博客. Cou ...
- CS100.1x-lab1_word_count_student
这是CS100.1x第一个提交的有意义的作业,自己一遍做下来对PySpark的基本应用应该是可以掌握的.相关ipynb文件见我github. 这次作业的目的如题目一样--word count,作业分成 ...
- CS100.1x-lab0_student
这是CS100.1x第一个提交的作业,是给我们测试用的.相关ipynb文件见我github.本来没什么好说的.我在这里简单讲一下,后面会更详细的讲解.主要分成5个部分. Part 1: Test Sp ...
- Introduction to Big Data with PySpark
起因 大数据时代 大数据最近太热了,其主要有数据量大(Volume),数据类别复杂(Variety),数据处理速度快(Velocity)和数据真实性高(Veracity)4个特点,合起来被称为4V. ...
- Ubuntu16.04 802.1x 有线连接 输入账号密码,为什么连接不上?
ubuntu16.04,在网络配置下找到802.1x安全性,输入账号密码,为什么连接不上? 这是系统的一个bug解决办法:假设你有一定的ubuntu基础,首先你先建立好一个不能用的协议,就是按照之 ...
- 解压版MySQL5.7.1x的安装与配置
解压版MySQL5.7.1x的安装与配置 MySQL安装文件分为两种,一种是msi格式的,一种是zip格式的.如果是msi格式的可以直接点击安装,按照它给出的安装提示进行安装(相信大家的英文可以看懂英 ...
- RTImageAssets 自动生成 AppIcon 和 @2x @1x 比例图片
下载地址:https://github.com/rickytan/RTImageAssets 此插件用来生成 @3x 的图片资源对应的 @2x 和 @1x 版本,只要拖拽高清图到 @3x 的位置上,然 ...
- 802.1x协议&eap类型
EAP: 0,扩展认证协议 1,一个灵活的传输协议,用来承载任意的认证信息(不包括认证方式) 2,直接运行在数据链路层,如ppp或以太网 3,支持多种类型认证 注:EAP 客户端---服务器之间一个协 ...
- 脱壳脚本_手脱壳ASProtect 2.1x SKE -> Alexey Solodovnikov
脱壳ASProtect 2.1x SKE -> Alexey Solodovnikov 用脚本.截图 1:查壳 2:od载入 3:用脚本然后打开脚本文件Aspr2.XX_unpacker_v1. ...
随机推荐
- 阿里云rds实例恢复到本地
摘要: 前提: 1,阿里云数据库备份实例,恢复数据的时候需要将数据恢复到本地数据库,是不能直接恢复到RDS上的. 2,需要在本地服务器上下载一个数据库,尽量和RDS数据库版本保持一致.(我现在用的是5 ...
- 几种模型文件(CDM、LDM、PDM、OOM、BPM)
概念数据模型 (CDM): 帮助你分析信息系统的概念结构,识别主要实体.实体的属性及实体之间的联系.概念数据模型(CDM)比逻辑数据模型 (LDM)和物理数据模型(PDM)抽象.CDM 表现数据库的全 ...
- VS 代码整理插件推荐:CodeMaid
一直在用,觉得很不错,其他插件基本上不用了,所以拿来记录并分享一下.CodeMaid 说明文档CodeMaid 下载安装不用说明了,使用看说明文档就好. CodeMaid和ReSharp类似,开源且免 ...
- 【Alpha】团队项目测试报告与用户反馈
测试报告 一 . WEB端测试 测试页面 测试功能/界面 功能/界面简述 测试预期效果 测试目的 是否完成(Y/N) Internet Explorer Google chrome Firefox S ...
- Python3编写网络爬虫01-基本请求库urllib的使用
安装python后 自带urllib库 模块篇 分为几个模块如下: 1. urllib.request 请求模块 2. urllib.parse 分析模块 3. urllib.error 异常处理模块 ...
- Windows Server 2012上安装.NET Framework 3.5
引用:https://jingyan.baidu.com/article/14bd256e26b714bb6d26128a.html 装不成功后网上搜到很多相同的问题,都尝试过没解决到 用PowerS ...
- BBS论坛博客系统
目录 BBS网站需求分析 BBS数据库设计 BBS用户登录 BBS用户注册 BBS网站首页 BBS个人首页 后台管理系统搭建 网站全部源码
- Angular2学习笔记(1)——Hello World
1. 写在前面 之前基于Electron写过一个Markdown编辑器.就其功能而言,主要功能已经实现,一些小的不影响使用的功能由于时间关系还没有完成:但就代码而言,之前主要使用的是jQuery,由于 ...
- Kafka学习之路 (一)Kafka的简介
一.简介 1.1 概述 Kafka是最初由Linkedin公司开发,是一个分布式.分区的.多副本的.多订阅者,基于zookeeper协调的分布式日志系统(也可以当做MQ系统),常见可以用于web/ng ...
- MetaMask/eth-block-tracker
https://github.com/MetaMask/eth-block-tracker A JS module for keeping track of the latest Ethereum b ...