Event Recommendation Engine Challenge分步解析第六步
一、请知晓
本文是基于:
Event Recommendation Engine Challenge分步解析第一步
Event Recommendation Engine Challenge分步解析第二步
Event Recommendation Engine Challenge分步解析第三步
Event Recommendation Engine Challenge分步解析第四步
Event Recommendation Engine Challenge分步解析第五步
需要读者先阅读前五篇文章解析
二、特征构建
前五步我们已经将需要的数据进行了结构的存储,这一部分我们将利用前五步的数据
1)生成训练数据
dr = DataRewriter()
print('生成训练数据...\n')
dr.rewriteData(train=True, start=2, header=True)
我们先来解析这个DataRewriter类的rewriteData方法:该方法把前面user-based协同过滤和item-based协同过滤及各种热度和影响度作为特征组合在一起生成新的训练数据,用于分类器使用
def rewriteData(self, start=1, train=True, header=True):
"""
把前面user-based协同过滤和item-based协同过滤以及各种热度和影响度作为特征组合在一起
生成新的train,用于分类器分类使用
"""
fn = 'train.csv' if train else 'test.csv'
fin = open(fn)
fout = open('data_' + fn, 'w')
#write output header
if header:
ocolnames = ['invited', 'user_reco', 'evt_p_reco', 'evt_c_eco', 'user_pop', 'frnd_infl', 'evt_pop']
if train:
ocolnames.append('interested')
ocolnames.append('not_interested')
fout.write( ','.join(ocolnames) + '\n' )
fn:即为train.csv或者test.csv
fout:即为我们要写入保存的文件,data_train.csv或者data_test.csv
ocolnames:即为我们的特征,如果是train.csv的话应该还有标签-interested或not_interested
这里以train.csv为例讲解代码,其中train.csv文件如下所示:
def rewriteData(self, start=1, train=True, header=True):
"""
把前面user-based协同过滤和item-based协同过滤以及各种热度和影响度作为特征组合在一起
生成新的train,用于分类器分类使用
"""
fn = 'train.csv' if train else 'test.csv'
fin = open(fn)
fout = open('data_' + fn, 'w')
#write output header
if header:
ocolnames = ['invited', 'user_reco', 'evt_p_reco', 'evt_c_eco', 'user_pop', 'frnd_infl', 'evt_pop']
if train:
ocolnames.append('interested')
ocolnames.append('not_interested')
fout.write( ','.join(ocolnames) + '\n' ) ln = 0
for line in fin:
ln += 1
if ln < start:
continue
cols = line.strip().split(',')
#user,event,invited,timestamp,interested,not_interested
userId = cols[0]
eventId = cols[1]
invited = cols[2]
if ln % 500 == 0:
print("%s : %d (userId, eventId) = (%s, %s)" % (fn, ln, userId, eventId))
a)逐行读取train.csv或者test.csv,逗号分隔后获取userId,eventId,和invited,即前三列信息,然后调用self.userReco( userId, eventId)方法计算user_reco:
#这是特征构建部分 #import cPickle
#From python3, cPickle has beed replaced by _pickle
import _pickle as cPickle
import scipy.io as sip class DataRewriter:
def __init__(self):
#读入数据做初始化
self.userIndex = cPickle.load( open('PE_userIndex.pkl','rb') )
self.eventIndex = cPickle.load( open('PE_eventIndex.pkl', 'rb') )
self.userEventScores = sio.mmread('PE_userEventScores').todense()
self.userSimMatrix = sio.mmread('US_userSimMatrix').todense()
self.eventPropSim = sio.mmread('EV_eventPropSim').todense()
self.eventContSim = sio.mmread('EV_eventContSim').todense()
self.numFriends = sio.mmread('UF_numFriends')
self.userFriends = sio.mmread('UF_userFriends').todense()
self.eventPopularity = sio.mmread('EA_eventPopularity').todense() def userReco(self, userId, eventId):
"""
根据User-based协同过滤,得到event的推荐度
基本的伪代码思路如下:
for item in i
for every other user v that has a preference for i
compute similarity s between u and v
incorporate v's preference for i weighted by s into running average
return top items ranked by weighted average """
i = self.userIndex[userId]
j = self.eventIndex[eventId]
vs = self.userEventScores[:, j]
sims = self.userSimMatrix[i, :]
prod = sims * vs
try:
return prod[0, 0] - self.userEventScores[i, j]
except IndexError:
return 0
如在处理train.csv的第一行时,userId = 3044012, eventId = 1918771225
#import cPickle
#From python3, cPickle has beed replaced by _pickle
import _pickle as cPickle
userIndex = cPickle.load( open('PE_userIndex.pkl','rb') )
eventIndex = cPickle.load( open('PE_eventIndex.pkl', 'rb') )
userEventScores = sio.mmread('PE_userEventScores').todense()
userSimMatrix = sio.mmread('US_userSimMatrix').todense() userId = '3044012'
eventId = '1918771225'
i = userIndex[userId]
j = eventIndex[eventId] print('The first line in train.csv: userIndex of (userId = %s) is (i = %d) ' %(userId, i) )
print('The first line in train.csv: eventIndex of (eventId = %s) is (j = %d) ' %(eventId, j) ) vs = userEventScores[:, j]#获得所有user对event j兴趣分,即userEventScores的第j+1列
sims = userSimMatrix[i, :]#获得userSimMatrix的第i+1行,即每个user对该user的相似度
prod = sims * vs
try:
print(prod[0, 0] - userEventScores[i, j])
except IndexError:
print(0)
代码示例结果:
这样我们得到该user的user_reco值
b)evt_p_reco和evt_c_reco的计算
过程和上面的userReco()类似,读者可以参考eventPropSim和eventContSim的结构信息
def eventReco(self, userId, eventId):
"""
根据基于物品的协同过滤,得到Event的推荐度
基本的伪代码思路:
for item i:
for every item j that u has a preference for
compute similarity s between i and j
add u's preference for j weighted by s to a running average
return top items, ranked by weighted average
"""
i = self.userIndex[userId]
j = self.eventIndex[eventId]
js = self.userEventScores[i, :]#user i对每个event的兴趣分
psim = self.eventPropSim[:, j]
csim = self.eventContSim[:, j]
pprod = js * psim
cprod = js * csim
pscore = 0
cscore = 0
try:
pscore = pprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass try:
cscore = cprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass return pscore, cscore
c)user_pop计算:调用self.userPop()方法
这里需要用户的朋友数(已经用占比表示):
def userPop(self, userId):
"""
基于用户的朋友个数来推断用户的社交程度
主要的考量是如果用户的朋友非常多,可能会更倾向于参加各种社交活动
""" if userId in self.userIndex:
i = self.userIndex[userId]
try:
return self.numFriends[0, i]
except IndexError:
return 0
else:
return 0
d)frnd_infl计算:调用self.friendInfluence()方法,朋友对该用户的影响,即用户的所有朋友中,有多少是非常喜欢参加各种社交活动(event)的
这里需要变量self.userFriends
def friendInfluence(self, userId):
"""
朋友对用户的影响
主要考虑用户的所有朋友中,有多少是非常喜欢参加各种社交活动(event)的
用户的朋友圈如果都是积极参加各种event,可能会对当前用户有一定的影响
"""
nusers = np.shape(self.userFriends)[1]
i = self.userIndex[userId]
#下面的一行代码是不是有问题呢?
#是不是应该为某个用户的所有朋友的兴趣分之和,然后除以nusers,也就是axis应该=1
return (self.userFriends[i, :].sum(axis=0) / nusers)[0, 0]
e)evt_pop的计算:调用self.eventPop()方法,某个event的热度,主要通过参与的人数来界定的
需要用到变量self.eventPopularity
def eventPop(self, eventId):
"""
活动本身的热度
主要通过参与的参数来界定的
"""
i = self.eventIndex[eventId]
return self.eventPopularity[i, 0]
f)然后就是将该行的信息写入文件保存
文件信息包含:[invited, user_reco, evt_p_reco, evt_c_reco, user_pop, frnd_infl, evt_pop],如果读取的是train.csv,则还需要append 标签interested和not_interested
#读取一行,处理后,将该行写入,保存
fout.write(','.join( map(lambda x: str(x), ocols)) + '\n')
g)构建特征完整代码
#这是特征构建部分 #import cPickle
#From python3, cPickle has beed replaced by _pickle
import _pickle as cPickle
import scipy.io as sio
import numpy as np class DataRewriter:
def __init__(self):
#读入数据做初始化
self.userIndex = cPickle.load( open('PE_userIndex.pkl','rb') )
self.eventIndex = cPickle.load( open('PE_eventIndex.pkl', 'rb') )
self.userEventScores = sio.mmread('PE_userEventScores').todense()
self.userSimMatrix = sio.mmread('US_userSimMatrix').todense()
self.eventPropSim = sio.mmread('EV_eventPropSim').todense()
self.eventContSim = sio.mmread('EV_eventContSim').todense()
self.numFriends = sio.mmread('UF_numFriends')
self.userFriends = sio.mmread('UF_userFriends').todense()
self.eventPopularity = sio.mmread('EA_eventPopularity').todense() def userReco(self, userId, eventId):
"""
根据User-based协同过滤,得到event的推荐度
基本的伪代码思路如下:
for item in i
for every other user v that has a preference for i
compute similarity s between u and v
incorporate v's preference for i weighted by s into running average
return top items ranked by weighted average """
i = self.userIndex[userId]
j = self.eventIndex[eventId]
vs = self.userEventScores[:, j]
sims = self.userSimMatrix[i, :]
prod = sims * vs
try:
return prod[0, 0] - self.userEventScores[i, j]
except IndexError:
return 0 def eventReco(self, userId, eventId):
"""
根据基于物品的协同过滤,得到Event的推荐度
基本的伪代码思路:
for item i:
for every item j that u has a preference for
compute similarity s between i and j
add u's preference for j weighted by s to a running average
return top items, ranked by weighted average
"""
i = self.userIndex[userId]
j = self.eventIndex[eventId]
js = self.userEventScores[i, :]
psim = self.eventPropSim[:, j]
csim = self.eventContSim[:, j]
pprod = js * psim
cprod = js * csim
pscore = 0
cscore = 0
try:
pscore = pprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass try:
cscore = cprod[0, 0] - self.userEventScores[i, j]
except IndexError:
pass return pscore, cscore def userPop(self, userId):
"""
基于用户的朋友个数来推断用户的社交程度
主要的考量是如果用户的朋友非常多,可能会更倾向于参加各种社交活动
""" if userId in self.userIndex:
i = self.userIndex[userId]
try:
return self.numFriends[0, i]
except IndexError:
return 0
else:
return 0 def friendInfluence(self, userId):
"""
朋友对用户的影响
主要考虑用户的所有朋友中,有多少是非常喜欢参加各种社交活动(event)的
用户的朋友圈如果都是积极参加各种event,可能会对当前用户有一定的影响
"""
nusers = np.shape(self.userFriends)[1]
i = self.userIndex[userId]
#下面的一行代码是不是有问题呢?
#是不是应该为某个用户的所有朋友的兴趣分之和,然后除以nusers,也就是axis应该=1
return (self.userFriends[i, :].sum(axis=0) / nusers)[0, 0] def eventPop(self, eventId):
"""
活动本身的热度
主要通过参与的参数来界定的
"""
i = self.eventIndex[eventId]
return self.eventPopularity[i, 0] def rewriteData(self, start=1, train=True, header=True):
"""
把前面user-based协同过滤和item-based协同过滤以及各种热度和影响度作为特征组合在一起
生成新的train,用于分类器分类使用
"""
fn = 'train.csv' if train else 'test.csv'
fin = open(fn)
fout = open('data_' + fn, 'w')
#write output header
if header:
ocolnames = ['invited', 'user_reco', 'evt_p_reco', 'evt_c_reco', 'user_pop', 'frnd_infl', 'evt_pop']
if train:
ocolnames.append('interested')
ocolnames.append('not_interested')
fout.write( ','.join(ocolnames) + '\n' ) ln = 0
for line in fin:
ln += 1
if ln < start:
continue
cols = line.strip().split(',')
#user,event,invited,timestamp,interested,not_interested
userId = cols[0]
eventId = cols[1]
invited = cols[2]
if ln % 500 == 0:
print("%s : %d (userId, eventId) = (%s, %s)" % (fn, ln, userId, eventId)) user_reco = self.userReco( userId, eventId )
evt_p_reco, evt_c_reco = self.eventReco( userId, eventId )
user_pop = self.userPop( userId )
frnd_infl = self.friendInfluence( userId )
evt_pop = self.eventPop( eventId )
ocols = [invited, user_reco, evt_p_reco, evt_c_reco, user_pop, frnd_infl, evt_pop] if train:
ocols.append( cols[4] )#interested
ocols.append( cols[5] )#not_interested fout.write(','.join( map(lambda x: str(x), ocols)) + '\n') fin.close()
fout.close() def rewriteTrainingSet(self):
self.rewriteData(True) def rewriteTestSet(self):
self.rewriteData(False) dr = DataRewriter()
print('生成训练数据...\n')
dr.rewriteData(train=True, start=2, header=True) print('生成预测数据...\n')
dr.rewriteData(train=False, start=2, header=True)
print('done')
2)生成测试数据:过程和生成训练数据类似
至此,第六步完成,哪里有不明白的请留言
在特征构建好了之后,我们有很多办法去训练得到模型和完成预测
我们来看看Event Recommendation Engine Challenge分步解析第七步
Event Recommendation Engine Challenge分步解析第六步的更多相关文章
- Event Recommendation Engine Challenge分步解析第七步
一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
- Event Recommendation Engine Challenge分步解析第五步
一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
- Event Recommendation Engine Challenge分步解析第四步
一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
- Event Recommendation Engine Challenge分步解析第三步
一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
- Event Recommendation Engine Challenge分步解析第二步
一.请知晓 本文是基于Event Recommendation Engine Challenge分步解析第一步,需要读者先阅读上篇文章解析 二.用户相似度计算 第二步:计算用户相似度信息 由于用到:u ...
- Event Recommendation Engine Challenge分步解析第一步
一.简介 此项目来自kaggle:https://www.kaggle.com/c/event-recommendation-engine-challenge/ 数据集的下载需要账号,并且需要手机验证 ...
- SpringBoot 源码解析 (六)----- Spring Boot的核心能力 - 内置Servlet容器源码分析(Tomcat)
Spring Boot默认使用Tomcat作为嵌入式的Servlet容器,只要引入了spring-boot-start-web依赖,则默认是用Tomcat作为Servlet容器: <depend ...
- (转) Quick Guide to Build a Recommendation Engine in Python
本文转自:http://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ Int ...
- 卷积神经网络 cnnff.m程序 中的前向传播算法 数据 分步解析
最近在学习卷积神经网络,哎,真的是一头雾水!最后决定从阅读CNN程序下手! 程序来源于GitHub的DeepLearnToolbox 由于确实缺乏理论基础,所以,先从程序的数据流入手,虽然对高手来讲, ...
随机推荐
- Nginx http2.0
109/110 HTTP2.0协议 优势必须使用TLS加密 传输数据量大幅减少 1:以二进制格式传输 2:标头压缩(header做压缩) 多路复用及相关功能 : 消息优先级 (比如样式表先渲染页面那 ...
- C语言itoa()函数和atoi()函数
以下是用itoa()函数将整数转换为字符串的一个例子: # include <stdio.h> # include <stdlib.h> void main (void) { ...
- 常用的redis服务命令。
卸载服务:redis-server --service-uninstall 开启服务:redis-server --service-start 停止服务:redis-server --service- ...
- 洛谷P2722总分题解
题目 这个题是一个裸的完全背包问题,但是数组需要开大, 代码 #include<iostream> using namespace std; int n,m,v,i; int c[1000 ...
- HDU1506 Largest Rectangle in a Histogram(算竞进阶习题)
单调栈裸题 如果矩形高度从左到右是依次递增,那我们枚举每个矩形高度,宽度拉到最优,计算最大面积即可 当有某个矩形比前一个矩形要矮的时候,这块面积的高度就不能大于他本身,所以之前的所有高于他的矩形多出来 ...
- OpenCv的CV2一些函数总结
- 【XSY1591】卡片游戏 DP
题目描述 有标有数字为\(1\)~\(9\)的卡片各\(a_1,a_2\cdots a_9\)张,还有标有乘号的卡片\(m\)张.从中取出\(n\)张按任意顺序排列,取出两个乘号相邻和乘法在边界上的非 ...
- SQL中使用循环结构
解答 FOR,LOOP,WHILE,REPEAT是UDB/400的一种内部循环控制,用于遍历表中符合条件的每一行记录. 例如:目的:更新employee库,把所有北京籍员工的工资提高10% 例一:使用 ...
- 测试工作常用SQL查询语句
一.查询数值型数据: SELECT * FROM tb_name WHERE sum > 100; 查询谓词:>,=,<,<>,!=,!>,!<,=>, ...
- [HEOI2014]逻辑翻译(分治)
题目描述 在人类的神经系统中,每个信号都可以用?1或+1来表示.这些信号组合起来最后形成 了喜怒哀乐,酸甜苦辣,红黄绿蓝等各种各样的复杂信息.纳米探测科技的突破让生物学家 可以测量大脑中特定区域的完整 ...