python pandas 豆瓣电影 top250 数据分析

豆瓣电影top250数据分析

数据来源（豆瓣电影top250）
爬虫代码比较简单
数据较为真实，可以进行初步的数据分析
可以将前面的几篇文章中的介绍的数据预处理的方法进行实践
最后用matplotlib与pyecharts两种可视化包进行部分数据展示
数据仍需深挖，有待加强

#首先按照惯例导入python 数据分析的两个包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyecharts import Bar

names=['num','title',"director","role","init_year","area","genre","rating_num","comment_num","comment","url"]
#"num#title#director#role#init_year#area#genre#rating_num#comment_num#comment#url"
df_1 = pd.read_excel("top250_f1.xls",index=None,header=None)
df_1.columns=names
df_1.head()

df_1.dtypes

num              int64
title           object
director        object
role            object
init_year       object
area            object
genre           object
rating_num     float64
comment_num      int64
comment         object
url             object
dtype: object

names1=["num","rank","alt_title","title","pubdate","language","writer","director","cast","movie_duration","year","movie_type","tags","image"]
df_2 = pd.read_excel("top250_f2.xlsx",index=None,header=None)
df_2.columns=names1
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 14 columns):
num               250 non-null int64
rank              250 non-null float64
alt_title         250 non-null object
title             250 non-null object
pubdate           250 non-null object
language          250 non-null object
writer            250 non-null object
director          250 non-null object
cast              250 non-null object
movie_duration    250 non-null object
year              250 non-null object
movie_type        250 non-null object
tags              250 non-null object
image             250 non-null object
dtypes: float64(1), int64(1), object(12)
memory usage: 27.4+ KB

df_1_cut = df_1[['num','title','init_year','area','genre','rating_num','comment_num']]
df_2_cut = df_2[['num','language','director','cast','movie_duration','tags']]
df = pd.merge(df_1_cut,df_2_cut,how = 'outer',on = 'num')   #外连接，合并标准on = 'num'
# df.to_excel("all_data_movie.xls",index=False)         #查看前五条信息

# 查看重复数据
df.duplicated()
df.duplicated().value_counts()

False    250
dtype: int64

df.title.unique()

array(['肖申克的救赎', '霸王别姬', '这个杀手不太冷', '阿甘正传', '美丽人生', '泰坦尼克号', '千与千寻',
       '辛德勒的名单', '盗梦空间', '机器人总动员', '三傻大闹宝莱坞', '忠犬八公的故事', '海上钢琴师', '放牛班的春天',
       '大话西游之大圣娶亲', '楚门的世界', '教父', '龙猫', '星际穿越', '熔炉', '触不可及', '无间道',
       '乱世佳人', '当幸福来敲门', '怦然心动', '天堂电影院', '十二怒汉', '鬼子来了', '蝙蝠侠：黑暗骑士',
       '疯狂动物城', '少年派的奇幻漂流', '活着', '搏击俱乐部', '指环王3：王者无敌', '天空之城',
       '大话西游之月光宝盒', '飞屋环游记', '罗马假日', '控方证人', '窃听风暴', '两杆大烟枪', '飞越疯人院',
       '闻香识女人', '哈尔的移动城堡', '辩护人', '海豚湾', 'V字仇杀队', '死亡诗社', '摔跤吧！爸爸', '教父2',
       '指环王2：双塔奇兵', '美丽心灵', '指环王1：魔戒再现', '饮食男女', '情书', '美国往事', '狮子王', '素媛',
       '钢琴家', '小鞋子', '七宗罪', '天使爱美丽', '被嫌弃的松子的一生', '致命魔术', '本杰明·巴顿奇事',
       '音乐之声', '西西里的美丽传说', '勇敢的心', '拯救大兵瑞恩', '黑客帝国', '低俗小说', '剪刀手爱德华',
       '让子弹飞', '看不见的客人', '沉默的羔羊', '蝴蝶效应', '入殓师', '大闹天宫', '春光乍泄', '末代皇帝',
       '心灵捕手', '玛丽和马克思', '阳光灿烂的日子', '哈利·波特与魔法石', '布达佩斯大饭店', '幽灵公主', '第六感',
       '禁闭岛', '重庆森林', '猫鼠游戏', '狩猎', '致命ID', '大鱼', '断背山', '甜蜜蜜',
       '射雕英雄传之东成西就', '告白', '一一', '加勒比海盗', '穿条纹睡衣的男孩', '阳光姐妹淘', '摩登时代',
       '阿凡达', '上帝之城', '爱在黎明破晓前', '消失的爱人', '风之谷', '爱在日落黄昏时', '侧耳倾听', '超脱',
       '倩女幽魂', '恐怖直播', '红辣椒', '小森林 夏秋篇', '喜剧之王', '菊次郎的夏天', '驯龙高手', '幸福终点站',
       '萤火虫之墓', '借东西的小人阿莉埃蒂', '岁月神偷', '神偷奶爸', '七武士', '杀人回忆', '贫民窟的百万富翁',
       '电锯惊魂', '喜宴', '谍影重重3', '真爱至上', '怪兽电力公司', '东邪西毒', '记忆碎片', '海洋',
       '黑天鹅', '雨人', '疯狂原始人', '卢旺达饭店', '小森林 冬春篇', '英雄本色', '哈利·波特与死亡圣器(下)',
       '燃情岁月', '7号房的礼物', '虎口脱险', '心迷宫', '萤火之森', '傲慢与偏见', '荒蛮故事', '海边的曼彻斯特',
       '请以你的名字呼唤我', '教父3', '恋恋笔记本', '完美的世界', '纵横四海', '花样年华', '唐伯虎点秋香',
       '超能陆战队', '玩具总动员3', '蝙蝠侠：黑暗骑士崛起', '时空恋旅人', '魂断蓝桥', '猜火车', '穿越时空的少女',
       '雨中曲', '二十二', '达拉斯买家俱乐部', '我是山姆', '人工智能', '冰川时代', '浪潮', '朗读者',
       '爆裂鼓手', '香水', '罗生门', '未麻的部屋', '阿飞正传', '血战钢锯岭', '一次别离', '被解救的姜戈',
       '可可西里', '追随', '恐怖游轮', '撞车', '战争之王', '头脑特工队', '地球上的星星', '房间', '无人知晓',
       '梦之安魂曲', '牯岭街少年杀人事件', '魔女宅急便', '谍影重重', '谍影重重2', '忠犬八公物语', '模仿游戏',
       '你的名字。', '惊魂记', '青蛇', '一个叫欧维的男人决定去死', '再次出发之纽约遇见你', '哪吒闹海', '完美陌生人',
       '东京物语', '小萝莉的猴神大叔', '黑客帝国3：矩阵革命', '源代码', '新龙门客栈', '终结者2：审判日',
       '末路狂花', '碧海蓝天', '秒速5厘米', '绿里奇迹', '这个男人来自地球', '海盗电台', '勇闯夺命岛',
       '城市之光', '初恋这件小事', '无耻混蛋', '卡萨布兰卡', '变脸', 'E.T. 外星人', '爱在午夜降临前',
       '发条橙', '步履不停', '黄金三镖客', '无敌破坏王', '疯狂的石头', '美国丽人', '荒野生存', '迁徙的鸟',
       '英国病人', '海街日记', '彗星来的那一夜', '国王的演讲', '非常嫌疑犯', '血钻', '燕尾蝶', '聚焦',
       '勇士', '叫我第一名', '穆赫兰道', '遗愿清单', '枪火', '上帝也疯狂', '我爱你', '黑鹰坠落', '荒岛余生',
       '大卫·戈尔的一生', '千钧一发', '蓝色大门', '2001太空漫游'], dtype=object)

#  数据格式的初步清洗
df['genre']=df['genre'].str[2:-2]
df["language"]= df['language'].str[2:-2]
df["director"]= df['director'].str[2:-2]
df["cast"]= df['cast'].str[2:-2]
df["movie_duration"]= df['movie_duration'].str[2:-2]
# df[["genre","language","director","cast","movie_duration"]]=df[["genre","language","director","cast","movie_duration"]].apply(lambda x: x.replace("['","").replace("']",""))

# 地区的数据清理
area_split = df['area'].str.split(expand=True)
area_split.head()
all_area = area_split.apply(pd.value_counts).fillna(0)
all_area.columns = ['area_1','area_2','area_3','area_4','area_5']
all_area = all_area.astype("int")
all_area.dtypes

area_1    int32
area_2    int32
area_3    int32
area_4    int32
area_5    int32
dtype: object

all_area.head()

all_area['Col_sum'] = all_area.apply(lambda x: x.sum(), axis=1)

all_area.head()

categories = df['genre'].str.split(" ",expand=True)
categories = categories.apply(pd.value_counts).fillna(0).astype("int")
categories.head()

categories['count']= categories.apply(lambda x:x.sum(),axis=1)
categories.sort_values('count',ascending=False)
categories.head()

# 对于language处理
df['language'].head(10)

0                         英语
1                      汉语普通话
2           英语', '意大利语', '法语
3                         英语
4           意大利语', '德语', '英语
5     英语', '意大利语', '德语', '俄语
6                         日语
7    英语', '希伯来语', '德语', '波兰语
8             英语', '日语', '法语
9                         英语
Name: language, dtype: object

language_all = df['language'].str.replace("\', \'"," ").str.split(" ",expand=True)
language_all.head()

language_all = language_all.apply(pd.value_counts).fillna(0).astype("int")
language_all.head()

language_all['count']= language_all.apply(lambda x:x.sum(),axis=1)
language_all.sort_values('count',ascending=False)
language_all.head()

df.director.head()

0    弗兰克·德拉邦特 Frank Darabont
1             陈凯歌 Kaige Chen
2           吕克·贝松 Luc Besson
3            Robert Zemeckis
4    罗伯托·贝尼尼 Roberto Benigni
Name: director, dtype: object

director_all = df['director'].str.replace("\', \'","~").str.split("~",expand=True)
director_all.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	0	1	2
0	弗兰克·德拉邦特 Frank Darabont	None	None
1	陈凯歌 Kaige Chen	None	None
2	吕克·贝松 Luc Besson	None	None
3	Robert Zemeckis	None	None
4	罗伯托·贝尼尼 Roberto Benigni	None	None

#  演员
df['cast'].head()

0    蒂姆·罗宾斯 Tim Robbins', '摩根·弗里曼 Morgan Freeman', ...
1    张国荣 Leslie Cheung', '张丰毅 Fengyi Zhang', '巩俐 Li...
2    让·雷诺 Jean Reno', '娜塔莉·波特曼 Natalie Portman', '加...
3    Tom Hanks', 'Robin Wright Penn', 'Gary Sinise'...
4    罗伯托·贝尼尼 Roberto Benigni', '尼可莱塔·布拉斯基 Nicoletta...
Name: cast, dtype: object

cast_all = df['cast'].str.replace("\', \'","~").str.split("~",expand=True)
cast_all.head(2)

main_dr= list(director_all[0])
second_dr= list(director_all[1])
thrid_dr= list(director_all[2])
directors=pd.Series(main_dr+second_dr+thrid_dr)
directors.value_counts().head()

宫崎骏 Hayao Miyazaki            7
克里斯托弗·诺兰 Christopher Nolan    7
王家卫 Kar Wai Wong              5
史蒂文·斯皮尔伯格 Steven Spielberg    5
大卫·芬奇 David Fincher           4
dtype: int64

df['init_year'].head()

0    1994
1    1993
2    1994
3    1994
4    1997
Name: init_year, dtype: object

year_= df['init_year'].str.split('/').apply(lambda x:x[0].strip()).replace(regex={'\(中国大陆\)':''})
year_split = pd.to_datetime(year_).dt.year
year_split.head()

0    1994
1    1993
2    1994
3    1994
4    1997
Name: init_year, dtype: int64

df['movie_duration'].head()

0                       142分钟
1                      171 分钟
2    110分钟(剧场版)', '133分钟(国际版)
3                      142 分钟
4         116分钟', '125分钟(加长版)
Name: movie_duration, dtype: object

# 观影时间
movie_duration_split = df['movie_duration'].str.replace("\', \'","~").str.split("~",expand=True).fillna(0)
movie_duration_split =movie_duration_split.replace(regex={'分钟.*': ''})
df['movie_duration']=movie_duration_split[0].astype("int")
df['movie_duration'].head()

0    142
1    171
2    110
3    142
4    116
Name: movie_duration, dtype: int32

# 标签 tags
# 查看第一部电影的的tag
# pd.DataFrame(eval(df['tags'][0]))
df['tags'][0]

"[{'count': 220591, 'name': '经典'}, {'count': 191014, 'name': '励志'}, {'count': 173587, 'name': '信念'}, {'count': 159939, 'name': '自由'}, {'count': 115024, 'name': '人性'}, {'count': 111430, 'name': '美国'}, {'count': 93721, 'name': '人生'}, {'count': 72602, 'name': '剧情'}]"

此处使用eval方法将以字符串形式保存的列表转化为列表，进而使用列表的方法进行处理该字段，使用字符串的处理过于繁琐(正则匹配）

交错合并两列表元素，此处使用的是itertools中的chain.from_iterable 函数，将两个列表的元素使用zip组成生成器，然后将两个列表元素交叉合并组成新的一个大的列表

all_tags = [ pd.DataFrame(eval(i)) for i in df["tags"]]
import itertools
all_tags=[list(itertools.chain.from_iterable(zip(df_['name'],df_['count']))) for df_ in all_tags]
all_tags_df=pd.DataFrame(all_tags)
all_tags_df.head()

# 数据分析与可视化部分 matplotlib 与 pyecharts
import matplotlib.pyplot as plt
import matplotlib
from pyecharts import Bar
import seaborn as sns
matplotlib.rcParams["font.family"]=["simsunb"]
matplotlib.rcParams['font.size'] =15

plt.figure(figsize=(15,6))
plt.style.use('seaborn-whitegrid')
plt.subplot(1,2,1)
plt.scatter(df['rating_num'],df['num'])
plt.xlabel("reating_num")
plt.ylabel("ranking")
plt.gca().invert_yaxis()
plt.subplot(1,2,2)
plt.hist(df['rating_num'],bins=12)
plt.xlabel("rating_num")
plt.show()

plt.figure(1)
plt.figure(figsize=(15,6))
plt.style.use('seaborn-whitegrid')
plt.subplot(1,2,1)
plt.scatter(df['movie_duration'],df['num'])
plt.xlabel("movie_duration")
plt.ylabel("ranking")
plt.gca().invert_yaxis()
plt.subplot(1,2,2)
plt.hist(df['movie_duration'],bins=15)
plt.xlabel("movie_duration")
plt.show()

# 观影时长与 电影排名之间的相关性，从常识来判断，基本没有啥关系，因为好的电影不一定时间长，时间长的不一定是好电影
df['num'].corr(df['movie_duration'])

-0.19979596696001942

df['init_year']=year_split
plt.figure(1)
plt.figure(figsize=(15,6))
plt.style.use('seaborn-whitegrid')
plt.subplot(1,2,1)
plt.scatter(df['init_year'],df['num'])
plt.xlabel("init_year")
plt.ylabel("ranking")
plt.gca().invert_yaxis()
plt.subplot(1,2,2)
plt.hist(df['init_year'],bins=30)
plt.xlabel("init_year")
plt.show()

df['num'].corr(df['init_year'])
# 从结果来看，更没有什么相关性

0.041157240822869007

# import matplotlib.font_manager as fm
# fpath = 'C:\\Windows\\Fonts\\simsunb.ttf'
# prop=fm.FontProperties(fname=fpath)
# print(prop)
matplotlib.rcParams["font.family"]=["SimHei"]
plt.figure(figsize=(24,6))
all_area_new = all_area['Col_sum'].sort_values(ascending=False)
plt.bar(list(all_area_new.index),list(all_area_new))
plt.xticks(rotation=45)  #坐标轴刻度倾斜45°
plt.legend(labels=["count"],loc='upper center')
plt.show()

language_all['count'].sort_values(ascending=False).head()

英语       170
法语        40
日语        40
汉语普通话     34
德语        24
Name: count, dtype: int64

language_all['count'].sort_values(ascending=False).plot(kind='bar',figsize=(22,6))
plt.legend(labels=["language_count"],loc='upper center')
plt.show()

categories["count"].sort_values(ascending=False).plot(kind='bar',figsize=(22,6))
plt.legend(labels=["category_count"],loc='upper center')
plt.show()

all_tag_name = all_tags_df.loc[:,[0,2,4,6,8,10,12,14]].values.flatten()
# 此处的flatten就是使用numpy中的将多维数组降为一维数组，与评论中的代码意思一致
all_tag_name = pd.Series(all_tag_name).value_counts()


from pyecharts import WordCloud
wordcloud = WordCloud(width=1000,height=600)
wordcloud.add("",list(all_tag_name.index),list(all_tag_name.values),word_size_range=[20,100])
# 可以直接使用all_tag_name.index，all_tag_name.values，可以不转化为list
wordcloud

from wordcloud import WordCloud
font = r'C:\Windows\Fonts\simfang.ttf'
wordcloud = WordCloud(font_path=font,max_font_size = 35).generate(str(list(all_tag_name.index)))
plt.figure(figsize=(9,6))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

from pyecharts import  Bar
mybar= Bar("电影类型分析")
cate=categories['count'].sort_values(ascending=False)
mybar.add("电影类型",cate.index,cate.values,mark_line=['max'],mark_point=['average'])
mybar

from pyecharts import  Pie
Top30_rating_num=df[['rating_num','title']].sort_values(['rating_num'],ascending=False).head(30)['rating_num'].value_counts()
Top30_rating_num
pie = Pie('排名前30电影评分占比',title_pos = 'center')
pie.add('',list(Top30_rating_num.index),Top30_rating_num.values,is_label_show = True,legend_orient = 'vertical',legend_pos = 'right')
pie

python pandas 豆瓣电影 top250 数据分析的更多相关文章

python爬虫: 豆瓣电影top250数据分析
转载博客 https://segmentfault.com/a/1190000005920679 根据自己的环境修改并配置mysql数据库系统:Mac OS X 10.11 python 2.7 m ...
[Python]计算豆瓣电影TOP250的平均得分
用python写的爬虫练习,感觉比golang要好写一点. import re import urllib origin_url = 'https://movie.douban.com/top250? ...
【转】爬取豆瓣电影top250提取电影分类进行数据分析
一.爬取网页,获取需要内容我们今天要爬取的是豆瓣电影top250页面如下所示: 我们需要的是里面的电影分类,通过查看源代码观察可以分析出我们需要的东西.直接进入主题吧! 知道我们需要的内容在哪里了, ...
Python：python抓取豆瓣电影top250
一直对爬虫感兴趣,学了python后正好看到某篇关于爬取的文章,就心血来潮实战一把吧. 实现目标:抓取豆瓣电影top250,并输出到文件中 1.找到对应的url:https://movie.douba ...
Python小爬虫——抓取豆瓣电影Top250数据
python抓取豆瓣电影Top250数据 1.豆瓣地址:https://movie.douban.com/top250?start=25&filter= 2.主要流程是抓取该网址下的Top25 ...
python爬虫 Scrapy2-- 爬取豆瓣电影TOP250
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
Python爬虫----抓取豆瓣电影Top250
有了上次利用python爬虫抓取糗事百科的经验,这次自己动手写了个爬虫抓取豆瓣电影Top250的简要信息. 1.观察url 首先观察一下网址的结构 http://movie.douban.com/to ...
[Python] 豆瓣电影top250爬虫
1.分析 <li><div class="item">电影信息</div></li> 每个电影信息都是同样的格式,毕竟在服务器端是用 ...
Python抓取豆瓣电影top250!
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理.作者:404notfound 一直对爬虫感兴趣,学了python后正好看到 ...

随机推荐

基金、社保和QFII等机构的重仓股排名评测
来源:基金前20大重仓股持仓股排名基金前15大重仓股持仓股排名基金重仓前15大个股,相较于同期沪深300的平均收益,近1月:-1.05%,近3月:-0.49%,近6月:1.45%,近1年:3.92 ...
linux 中搜索命令的对比
1.find find是最常用和最强大的查找命令.它能做到实时查找,精确查找,但速度慢. find的使用格式如下: #find [指定目录] [指定条件] [指定动作] 指定目录:是指所要搜索的目录和 ...
java okhttp发送post请求
java的httpclient和okhttp请求网络,构造一个基本的post get请求,都比py的requests步骤多很多,也比py的自带包urllib麻烦些. 先封装成get post工具类,工 ...
feed流拉取，读扩散，究竟是啥？
from:https://mp.weixin.qq.com/s?__biz=MjM5ODYxMDA5OQ==&mid=2651961214&idx=1&sn=5e80ad6f2 ...
MySQL---insert into select from
INSERT INTO perf_week(node_id,perf_time,pm25,pm10,temp,humi) SELECT node_id,'2016-12-22 11:55:00' AS ...
backbone学习笔记：模型（Model）（1）基础知识
backbone为复杂Javascript应用程序提供MVC(Model View Controller)框架,框架里最基本的是Model(模型),它用来处理数据,对数据进行验证,完成后台数据与前台数 ...
php-fpm 配置进程池
什么是 php-fpm :php 是作为一个独立服务存在的,这个服务叫做 php-fpm什么是 php-fpm pool :也就是 php-fpm 的进程池,这个进程池中运行了多个子进程,用来并发处理 ...
Splash runjs() 方法
runjs() 方法可以执行 JavaScript 代码,它与 evaljs() 功能类似,但是更偏向于执行某些动作或声明某些方法 function main(splash, args) splash ...
[JS] 如何自定义字符串格式化输出
在其他语言中十分常见的字符串格式化输出,居然在 Javascript 中不见踪影,于是决定自己实现该方法,以下就是个人编写的最简洁实现: String.prototype.format = funct ...
html主要笔记
1.用title属性作为工具提示 2.链接到锚点 <a href="http://wickedlysmart.com/buzz#Coffee"> 3.<em> ...

python pandas 豆瓣电影 top250 数据分析

豆瓣电影top250数据分析

此处使用eval方法将以字符串形式保存的列表转化为列表，进而使用列表的方法进行处理该字段，使用字符串的处理过于繁琐(正则匹配）

python pandas 豆瓣电影 top250 数据分析的更多相关文章

随机推荐

热门专题