一 准备实验数据

1.1.下载数据

wget http://snap.stanford.edu/data/amazon/all.txt.gz

  

1.2.数据分析

1.2.1.数据格式

product/productId: B00006HAXW
product/title: Rock Rhythm & Doo Wop: Greatest Early Rock
product/price: unknown
review/userId: A1RSDE90N6RSZF
review/profileName: Joseph M. Kotow
review/helpfulness: /
review/score: 5.0
review/time:
review/summary: Pittsburgh - Home of the OLDIES
review/text: I have all of the doo wop DVD's and this one is as good or better than the
1st ones. Remember once these performers are gone, we'll never get to see them again.
Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE
this DVD !!

而,

  • product/productIdasin, e.g. amazon.com/dp/B00006HAXW  #亚马逊标准识别号码(英语:Amazon Standard Identification Number),简称ASIN(productId),是一个由十个字符(字母或数字)组成的唯一识别号码。由亚马逊及其伙伴分配,并用于亚马逊上的产品标识。
  • product/title: title of the product
  • product/price: price of the product
  • review/userId: id of the user, e.g. A1RSDE90N6RSZF
  • review/profileName: name of the user
  • review/helpfulness: fraction of users who found the review helpful
  • review/score: rating of the product
  • review/time: time of the review (unix time)
  • review/summary: review summary
  • review/text: text of the review

1.2.2.数据格式转换

首先,我们需要把原始数据格式转换成dictionary

import pandas as pd
import numpy as np
import datetime
import gzip
import json
from sklearn.decomposition import PCA
from myria import *
import simplejson def parse(filename):
f = gzip.open(filename, 'r')
entry = {}
for l in f:
l = l.strip()
colonPos = l.find(':')
if colonPos == -1:
yield entry
entry = {}
continue
eName = l[:colonPos]
rest = l[colonPos+2:]
entry[eName] = rest
yield entry f = gzip.open('somefile.gz', 'w')
#review_data = parse('kcore_5.json.gz')
for e in parse("kcore_5.json.gz"):
f.write(str(e))
f.close()

py文件执行时报错: string indices must be intergers 

分析原因:

在.py文件中写的data={"a":"123","b":"456"},data类型为dict

而在.py文件中通过data= arcpy.GetParameter(0) 获取在GP中传过来的参数{"a":"123","b":"456"},data类型为字符串!!!

所以在后续的.py中用到的data['a']就会报如上错误!!!

解决方案:

data= arcpy.GetParameter(0)

data=json.loads(data)  //将字符串转成json格式

data=eval(data)  #本程序中我们采用eval()的方式,将字符串转成dict格式 

二.数据预处理

思路:

#import libraries

# Helper functions

# Prepare the review data for training and testing the algorithms

# Preprocess product data for Content-based Recommender System

# Upload the data to the MySQL Database on an Amazon Web Services ( AWS) EC2 instance

2.1创建DataFrame

f parse(path):
f = gzip.open(path, 'r')
for l in f:
yield eval(l) review_data = parse('/kcore_5.json.gz')
productID = []
userID = []
score = []
reviewTime = []
rowCount = 0 while True:
try:
entry = next(review_data)
productID.append(entry['asin'])
userID.append(entry['reviewerID'])
score.append(entry['overall'])
reviewTime.append(entry['reviewTime'])
rowCount += 1
if rowCount % 1000000 == 0:
print 'Already read %s observations' % rowCount
except StopIteration, e:
print 'Read %s observations in total' % rowCount
entry_list = pd.DataFrame({'productID': productID,
'userID': userID,
'score': score,
'reviewTime': reviewTime})
filename = 'review_data.csv'
entry_list.to_csv(filename, index=False)
print 'Save the data in the file %s' % filename
break entry_list = pd.read_csv('review_data.csv')

2.2数据过滤

def filterReviewsByField(reviews, field, minNumReviews):
reviewsCountByField = reviews.groupby(field).size()
fieldIDWithNumReviewsPlus = reviewsCountByField[reviewsCountByField >= minNumReviews].index
#print 'The number of qualified %s: ' % field, fieldIDWithNumReviewsPlus.shape[0]
if len(fieldIDWithNumReviewsPlus) == 0:
print 'The filtered reviews have become empty'
return None
else:
return reviews[reviews[field].isin(fieldIDWithNumReviewsPlus)] def checkField(reviews, field, minNumReviews):
return np.mean(reviews.groupby(field).size() >= minNumReviews) == 1 def filterReviews(reviews, minItemNumReviews, minUserNumReviews):
filteredReviews = filterReviewsByField(reviews, 'productID', minItemNumReviews)
if filteredReviews is None:
return None
if checkField(filteredReviews, 'userID', minUserNumReviews):
return filteredReviews filteredReviews = filterReviewsByField(filteredReviews, 'userID', minUserNumReviews)
if filteredReviews is None:
return None
if checkField(filteredReviews, 'productID', minItemNumReviews):
return filteredReviews
else:
return filterReviews(filteredReviews, minItemNumReviews, minUserNumReviews) def filteredReviewsInfo(reviews, minItemNumReviews, minUserNumReviews):
t1 = datetime.datetime.now()
filteredReviews = filterReviews(reviews, minItemNumReviews, minUserNumReviews)
print 'Mininum num of reviews in each item: ', minItemNumReviews
print 'Mininum num of reviews in each user: ', minUserNumReviews
print 'Dimension of filteredReviews: ', filteredReviews.shape if filteredReviews is not None else '(0, 4)'
print 'Num of unique Users: ', filteredReviews['userID'].unique().shape[0]
print 'Num of unique Product: ', filteredReviews['productID'].unique().shape[0]
t2 = datetime.datetime.now()
print 'Time elapsed: ', t2 - t1
return filteredReviews allReviewData = filteredReviewsInfo(entry_list, 100, 10)
smallReviewData = filteredReviewsInfo(allReviewData, 150, 15)

理论知识

1. Combining predictions for accurate recommender systems

So, for practical applications we recommend to use a neural network in combination with bagging due to the fast prediction speed.

Collaborative ltering(协同过滤,筛选相似的推荐):电子商务推荐系统的主要算法,利用某兴趣相投、拥有共同经验之群体的喜好来推荐用户感兴趣的信息

web-Amazon的更多相关文章

  1. Amazon AWS EC2开启Web服务器配置

    在Amazon AWS EC2申请了一年的免费使用权,安装了CentOS + Mono + Jexus环境做一个Web Server使用. 在上述系统安装好之后,把TCP 80端口开启(iptable ...

  2. Summary of Amazon Marketplace Web Service

    Overview Here I want to summarize Amazon marketplace web service (MWS or AMWS) that can be used for ...

  3. Getting Started with Amazon EC2 (1 year free AWS VPS web hosting)

    from: http://blog.coolaj86.com/articles/getting-started-with-amazon-ec2-1-year-free-aws-vps-web-host ...

  4. 注册 Amazon Web Services(AWS) 账号,助园一臂之力

    感谢大家去年的大力支持,今年园子继续和 Amazon Web Services(AWS) 合作,只要您通过 博客园专属链接 注册一个账号(建议使用手机4G网络注册),亚马逊就会给园子收入,期待您的支持 ...

  5. AWS(0) - Amazon Web Services

    Computer EC2 – Virtual Servers in the Cloud EC2 Container Service – Run and Manage Docker Containers ...

  6. Amazon Web Services (目录)

    一.官方声明 AWS云全球服务基础设施区域列表 AWS产品定价国外区 AWS产品定价中国区 (注意!需要登陆账户才能查看) AWS产品费用预算 AWS区域和终端节点 二.计算 Amazon学习:如何启 ...

  7. Amazon Web Services

  8. 亚马逊记AWS(Amazon Web Services)自由EC2应用

    很长时间,我听到AWS能够应用,但是需要结合信用卡,最近申请了. 说是免费的,我还是扣6.28,后来我上网查了.认为是通过进行验证.像服务期满将返回. 关键是不要让我进入全抵扣信用卡支付passwor ...

  9. 网络爬虫: 从allitebooks.com抓取书籍信息并从amazon.com抓取价格(3): 抓取amazon.com价格

    通过上一篇随笔的处理,我们已经拿到了书的书名和ISBN码.(网络爬虫: 从allitebooks.com抓取书籍信息并从amazon.com抓取价格(2): 抓取allitebooks.com书籍信息 ...

  10. 网络爬虫: 从allitebooks.com抓取书籍信息并从amazon.com抓取价格(1): 基础知识Beautiful Soup

    开始学习网络数据挖掘方面的知识,首先从Beautiful Soup入手(Beautiful Soup是一个Python库,功能是从HTML和XML中解析数据),打算以三篇博文纪录学习Beautiful ...

随机推荐

  1. 如何利用FPGA进行时序分析设计

    FPGA(Field-Programmable Gate Array),即现场可编程门阵列,它是作为专用集成电路(ASIC)领域中的一种半定制电路而出现的,既解决了定制电路的不足,又克服了原有可编程器 ...

  2. Java基本语法之动手动脑

    1.枚举类型 运行EnumTest.java 运行结果:false,false,true,SMALL,MEDIUM,LARGE 结论:枚举类型是引用类型,枚举不属于原始数据类型,它的每个具体值都引用一 ...

  3. day 10 函数名的运用,闭包,迭代器

    函数名的本质 函数名本质上就是函数的内存地址 函数名的五种运用: 1.函数名是一个变量 def func(): print(666) print(func) # 函数的内存地址 <functio ...

  4. 转 移动端-webkit-user-select:none导致input/textarea输入框无法输入

    移动端webview中写页面的时候发现个别Android机型会导致input.textareat输入框无法输入(键盘可以弹起,不是webView.requestFocus(View.FOCUS_DOW ...

  5. Linux移植之子目录下的built-in.o生成过程分析

    在Linux移植之make uImage编译过程分析中罗列出了最后链接生成vmlinux的过程.可以看到在每个子目录下都有一个built-in.o文件.对于此产生了疑问built-in.o文件是根据什 ...

  6. 动态加载JS脚本到HTML

    如果用原生态的js 有2中方法  1.直接document.write  <script language="javascript">      document.wr ...

  7. Pycharm小知识

    1)  重新更改文件名称:(Shift + F6) 2) 设置IDE皮肤主题 File -> Settings ->  Appearance -> Theme -> 选择“Al ...

  8. mybatis入门--mybatis和hibernate比较

    mybatis和hibernate的比较 Mybatis和hibernate不同,它不完全是一个ORM框架,因为MyBatis需要程序员自己编写Sql语句,不过mybatis可以通过XML或注解方式灵 ...

  9. PAT 1005 继续(3n+1)猜想 (25)(代码)

    1005 继续(3n+1)猜想 (25)(25 分) 卡拉兹(Callatz)猜想已经在1001中给出了描述.在这个题目里,情况稍微有些复杂. 当我们验证卡拉兹猜想的时候,为了避免重复计算,可以记录下 ...

  10. Linux CentOS 7 下 Apache Tomcat 7 安装与配置

    前言 记录一下Linux CentOS 7安装Tomcat7的完整步骤. 下载 首先需要下载tomcat7的安装文件,地址如下: http://mirror.bit.edu.cn/apache/tom ...