将 Book-Crossing Dataset 书籍推荐算法中 CVS 格式测试数据集导入到MySQL数据库
本文内容
最近看《写给程序员的数据挖掘指南》,研究推荐算法,书中的测试数据集是 Book-Crossing Dataset 提供的亚马逊用户对书籍评分的真实数据。推荐大家看本书,写得不错,立刻就能对推荐算法上手,甚至应用到你的项目中。
Book-Crossing Dataset 提供两种格式的数据集:CVS 格式和 SQL dump,问题是:
如果你有 UE 打开 cvs 文件,有乱码。无论如何转换编码,都不行~因为,这个文件是亚马逊通过程序持久化后,再导出来的。你还会发现,文件中有 html 标记,另外,关于用户名,书名等等信息,基本都是德文的(看域名就知道了)~
虽然,作者提供了加载测试数据集的 python 代码,不过不能导入到 MySQL 数据库中,其中,作者只是简单地按分号来分割字段内容(虽然推荐算法并不需要全部字段),可数据集中包含类似“ऩ”或“\“”这样的字符,不可能导入到 MySQL 数据库中~
你也许会问,作者都不导入到数据库,你为什么要导?因为,作者提供的推荐算法属于内存模型,也就是一次性把数据加载到内存,但之前,总还是要持久化吧~
因此,只能改造一下作者的 Python 代码~
Github Demo
改造后测试数据集
Python
# -*- coding: utf-8 -*-
import mysql.connector
import codecs
import string
import os
import sys
import ConfigParser
from collections import OrderedDict
import re
class MysqlPythonFacotry(object):
"""
Python Class for connecting with MySQL server.
"""
__instance = None
__host = None
__user = None
__password = None
__database = None
__session = None
__connection = None
def __init__(self, host='localhost', user='root', password='', database=''):
self.__host = host
self.__user = user
self.__password = password
self.__database = database
## End def __init__
def open(self):
try:
cnx = mysql.connector.connect(host=self.__host,\
user= self.__user,\
password= self.__password,\
database= self.__database)
self.__connection = cnx
self.__session = cnx.cursor()
except mysql.connector.Error as e:
print('connect fails!{}'.format(e))
## End def open
def close(self):
self.__session.close()
self.__connection.close()
## End def close
def select(self, table, where=None, *args, **kwargs):
result = None
query = 'SELECT '
keys = args
values = tuple(kwargs.values())
l = len(keys) - 1
for i, key in enumerate(keys):
query += "`" + key + "`"
if i <; l:
query += ","
## End for keys
query += 'FROM %s' % table
if where:
query += " WHERE %s" % where
## End if where
self.__session.execute(query, values)
number_rows = self.__session.rowcount
number_columns = len(self.__session.description)
result = self.__session.fetchall()
return result
## End def select
def update(self, table, where=None, *args, **kwargs):
try:
query = "UPDATE %s SET " % table
keys = kwargs.keys()
values = tuple(kwargs.values()) + tuple(args)
l = len(keys) - 1
for i, key in enumerate(keys):
query += "`" + key + "` = %s"
if i <; l:
query += ","
## End if i less than 1
## End for keys
query += " WHERE %s" % where
self.__session.execute(query, values)
self.__connection.commit()
# Obtain rows affected
update_rows = self.__session.rowcount
except mysql.connector.Error as e:
print(e.value)
return update_rows
## End function update
def insert(self, table, *args, **kwargs):
values = None
query = "INSERT INTO %s " % table
if kwargs:
keys = kwargs.keys()
values = tuple(kwargs.values())
query += "(" + ",".join(["`%s`"] * len(keys)) % tuple(keys) + ") VALUES (" + ",".join(["%s"] * len(values)) + ")"
elif args:
values = args
query += " VALUES(" + ",".join(["%s"] * len(values)) + ")"
self.__session.execute(query, values)
self.__connection.commit()
cnt = self.__session.rowcount
return cnt
## End def insert
def delete(self, table, where=None, *args):
query = "DELETE FROM %s" % table
if where:
query += ' WHERE %s' % where
values = tuple(args)
self.__session.execute(query, values)
self.__connection.commit()
delete_rows = self.__session.rowcount
return delete_rows
## End def delete
def select_advanced(self, sql, *args):
od = OrderedDict(args)
query = sql
values = tuple(od.values())
self.__session.execute(query, values)
number_rows = self.__session.rowcount
number_columns = len(self.__session.description)
result = self.__session.fetchall()
return result
## End def select_advanced
## End class
class ErrorMyProgram(Exception):
"""
My Exception Error Class
"""
def __init__(self, value):
self.value = value
##End def __init__
def __str__(self):
return repr(self.value)
##End def __str__
## End class ErrorMyProgram
class LoadAppConf(object):
"""
Load app.conf Config File Class
"""
__configFileName = "app.conf"
def __init__(self):
config = ConfigParser.ConfigParser()
config.read(self.__configFileName)
self.biz_db_host = config.get("biz_db","host")
self.biz_db_user = config.get("biz_db","user")
self.biz_db_password = config.get("biz_db","password")
self.biz_db_database = config.get("biz_db","database")
## End def __init__
## End class LoadAppConf
class Biz_Base(object):
"""
biz base class
"""
def __init__(self, db):
self.db = db
## End def __init__
## End class Biz_Base
class Biz_bx_book_ratings(Biz_Base):
"""
bx_book_ratings table
"""
__tableName = "bx_book_ratings"
def __init__(self, db):
Biz_Base.__init__(self, db)
## End def __init__
def insert(self, userid, isbn, bookrating):
cnt = self.db.insert(self.__tableName,\
userid = userid, \
isbn = isbn,\
bookrating = bookrating)
return cnt >; 0
## End def insert
## End class Biz_bx_book_ratings
class Biz_bx_books(Biz_Base):
"""
bx_books table
"""
__tableName = "bx_books"
def __init__(self, db):
Biz_Base.__init__(self, db)
## End def __init__
def insert(self, isbn, booktitle, bookauthor, yearofpublication, publisher, imageurls, imageurlm, imageurll):
cnt = self.db.insert(self.__tableName,\
isbn = isbn, \
booktitle = booktitle, \
bookauthor = bookauthor,\
yearofpublication = yearofpublication, \
publisher = publisher, \
imageurls = imageurls, \
imageurlm = imageurlm, \
imageurll = imageurll)
return cnt >; 0
## End def insert
## End class Biz_bx_books
class Biz_bx_users(Biz_Base):
"""
bx_users table
"""
__tableName = "bx_users"
def __init__(self, db):
Biz_Base.__init__(self, db)
## End def __init__
def insert(self, userid, location, age):
cnt = self.db.insert(self.__tableName,\
userid = userid, \
location = location,\
age = age)
return cnt >; 0
## End def insert
## End class Biz_bx_users
def regx(l):
"""
split line by regex
"""
p = re.compile(r'"[^"]*"')
return p.findall(l)
## End def regx
class LoadDataset(object):
"""
bx_books table
"""
__loadConf = None
__users = None
__books = None
__book_ratings = None
__bizDb = None
def __init__(self):
self.__loadConf = LoadAppConf()
self.__bizDb = MysqlPythonFacotry(self.__loadConf.biz_db_host,\
self.__loadConf.biz_db_user, \
self.__loadConf.biz_db_password,\
self.__loadConf.biz_db_database)
self.__users = Biz_bx_users(self.__bizDb)
self.__books = Biz_bx_books(self.__bizDb)
self.__book_ratings = Biz_bx_book_ratings(self.__bizDb)
self.__bizDb.open()
## End def __init__
def toDB(self, path=''):
"""
loads the BX book dataset. Path is where the BX files are
located
"""
self.data = {}
i = 0
j = 0
try:
#
# First load book ratings into self.data
#
f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
for line in f:
i += 1
j += 1
print(j)
print(line)
#separate line into fields
fields = line.split(';')
user = fields[0].strip('"')
book = fields[1].strip('"')
rating = int(fields[2].strip().strip('"'))
self.__book_ratings.insert(user, book, rating)
f.close()
#
# Now load books into self.productid2name
# Books contains isbn, title, and author among other fields
#
j = 0
f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
for line in f:
i += 1
j += 1
print(j)
print(line)
#separate line into fields
fields = regx(line)
isbn = fields[0].strip('"')
title = fields[1].strip('"')
author = fields[2].strip().strip('"')
yearOfPublication = fields[3].strip().strip('"')
publisher = fields[4].strip().strip('"')
imageUrlS = fields[5].strip().strip('"')
imageUrlM = fields[6].strip().strip('"')
imageUrlL = fields[7].strip().strip('"')
self.__books.insert(isbn, title, author, yearOfPublication, publisher, imageUrlS, imageUrlM, imageUrlL)
f.close()
#
# Now load user info into both self.userid2name and
# self.username2id
#
j = 0
f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
for line in f:
i += 1
j += 1
print(j)
print(line)
#separate line into fields
fields = regx(line)
userid = fields[0].strip('"')
location = fields[1].strip('"')
if len(fields) >; 2:
age = fields[2].strip().strip('"')
else:
age = None
if age != None:
value = location + ' (age: ' + age + ')'
else:
value = location
if age == None:
age =0
self.__users.insert(userid, location, age)
f.close()
except ErrorMyProgram as e:
print(e.value)
finally:
self.__bizDb.close()
print(i)
## End def toDB
## End class LoadData
Github Demo
测试数据集
将 Book-Crossing Dataset 书籍推荐算法中 CVS 格式测试数据集导入到MySQL数据库的更多相关文章
- 用JDBC把Excel中的数据导入到Mysql数据库中
步骤:0.在Mysql数据库中先建好table 1.从Excel表格读数据 2.用JDBC连接Mysql数据库 3.把读出的数据导入到Mysql数据库的相应表中 其中,步骤0的table我是先在Mys ...
- SQL自连接(源于推荐算法中的反查表问题)
”基于用户的协同过滤算法“是推荐算法的一种,这类算法强调的是:把和你有相似爱好的其他的用户的物品推荐给你. 要实现该推荐算法,就需要计算和你有交集的用户,这就要用到物品到用户的反查表. 先举个例子说明 ...
- Attention机制在深度学习推荐算法中的应用(转载)
AFM:Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Ne ...
- Access数据库导入到mysql数据库中
做项目时需要查询手机号归属地的,用网上提供的接口,耗时太长,反应慢,只能自己在网上搜了一个包含所有手机号归属地的Access数据库,导入到自己的mysql数据库中 Access数据库导入到mysql中 ...
- 虚拟机中ubuntu-16.04 Linux系统下配置mysql数据库,并在windows下使用navicat远程连接
Linux系统下mysql数据库安装配置步骤: 1.在服务器上安装mysql:sudo apt-get install mysql-server sudo apt-get install mysql- ...
- 如何用java POI将word中的内容导入到mysql数据库中
由于作业需要,要求我们将word文档中的数据直接导入到mysql中,在网上找了很常时间,终于将其解决. 由于比较初级,所以处理的word文档是那种比较规范的那种,条例比较清晰,设计的思路也比较简单,就 ...
- MySQL中 如何查询表名中包含某字段的表 ,查询MySql数据库架构信息:数据库,表,表字段
--查询tablename 数据库中 以"_copy" 结尾的表 select table_name from information_schema.tables where ta ...
- 将Hive统计分析结果导入到MySQL数据库表中(一)——Sqoop导入方式
https://blog.csdn.net/niityzu/article/details/45190787 交通流的数据分析,需求是对于海量的城市交通数据,需要使用MapReduce清洗后导入到HB ...
- php中ip转int 并存储在mysql数据库
遇到一个问题,于是百度一下. 得到最佳答案 http://blog.163.com/metlive@126/blog/static/1026327120104232330131/ 如何将四个字 ...
随机推荐
- LeetCode(68) Text Justification
题目 Given an array of words and a length L, format the text such that each line has exactly L charact ...
- MySql自动分区
自动分区需要开启MySql中的事件调度器,可以通过如下命令查看是否开启了调度器 show variables like '%scheduler%'; 如果没开启的话通过如下指令开启 ; 1.创建一个分 ...
- 上线踩坑引发的处理方式---lsof,strace
两个跟踪进程的linux命令 lsof -p pidstrace -p pid快速跟踪进程出现问题的地方.由于进程本身运行良好,但是内部一直等待第三方哪个应答,导致进程假死.引用自:http://li ...
- c# 高效读写文件
一.同步读写文件(在并发情况下不会发生文件被占用异常) static void Main(string[] args) { Parallel.For(0, 10000, e => { strin ...
- WPF快速入门系列(6)——WPF资源和样式
一.引言 WPF资源系统可以用来保存一些公有对象和样式,从而实现重用这些对象和样式的作用.而WPF样式是重用元素的格式的重要手段,可以理解样式就如CSS一样,尽管我们可以在每个控件中定义格式,但是如果 ...
- 深入理解java虚拟机【Java Class类文件结构】
Java语言从诞生之时就宣称一次编写,到处运行的跨平台特性,其实现原理是源码文件并没有直接编译成机器指令,而是编译成Java虚拟机可以识别和运行的字节码文件(Class类文件,*.class),字节码 ...
- AngularJS 初印象------对比 Asp.net MVC
之前就早耳闻前端MVC的一些框架,微软自家的Knockout.js,google家的AngularJs,还有Backone.但未曾了解,也不解为什么前端也要这么分.这两天看了AngularJs的官方教 ...
- 通用对象池ObjectPool的一种简易设计和实现方案
对象池,最简单直接的作用当然是通过池来减少创建和销毁对象次数,实现对象的缓存和复用.我们熟知的线程池.数据库连接池.TCP连接池等等都是非常典型的对象池. 一个基本的简易对象池的主要功能实现我认为应该 ...
- com组件远程桌面rdp模块的调用
rdp(remote desktop protocol)是一个多通道的协议,包括客户端视音传输.文件传输和通讯端口转向等等功能,通过压缩处理的数据网络传输也是相当快.我们在windows操作系统下面, ...
- Sublime Text 常用快捷键和优秀插件
SublimeText3常用快捷键和优秀插件 SublimeText是前端的一个神器,以其精简和可DIY而让广大fans疯狂.好吧不吹了直入正题 -_-!! 首先是安装,如果你有什么软件管家的话搜一下 ...