【Python爬虫】:使用高性能异步多进程爬虫获取豆瓣电影Top250
在本篇博文当中,将会教会大家如何使用高性能爬虫,快速爬取并解析页面当中的信息。一般情况下,如果我们请求网页的次数太多,每次都要发出一次请求,进行串行执行的话,那么请求将会占用我们大量的时间,这样得不偿失。因此我们可以i使用高性能爬虫,也就是采用多进程,异步的方式对数据进行爬取和解析,这样就可以在更快的时间内得到我们想要的结果。本篇博文给出有关爬取豆瓣电影的例子,以此来教会大家如何使用高性能爬虫。
一.网页分析
首先我们来分析豆瓣电影的网页代码,在本次的案例当中。我们需要爬取豆瓣电影top250当中的标题title和星数star。
发现,豆瓣电影当中的所有有关电影的信息全部都隐藏在< ol class="grid view">这个标签,当中,因此我们在编写xpath的时候,可以利用对它做一个循环。然后又发现,对于电影的title而言,有两个地方出现,一个地方是在图片上,另一个地方是在span标签下的class = title处,但是在span标签下具有多个标题,为了以免引起混,因此我们使用图片当中所暗含的标题title文字,使用xpath进行定位即可。
对于star而言,就更加简单了。我们发现每次一个star的分数出现,就会有又一个<div class="star">的标签在前面,然后再出现了与span有关的标签,因此我们编写xpath表达式为://ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()
这样就可以得到一整个页面的star的数值了。当然这样我们只能获取第一页的我们想要得到的数据,怎么得到第二页的数据呢?
二.翻页处理
翻页处理对于豆瓣电影这个网站还是比较简单的。我们分别查看第一,二,三页的url,就会惊奇的发现它的网址如下:
https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=
十分明显,这个网址后面有问号说明想要获取页面内容肯定需要发起get请求,都没有做有关post请求的加密,这样看来这也太简单了吧!
同样的我们发现里面的参数start在不断的变化,而filter却保持不变。因此我们只需要得到start参数的规律就知道该怎么编写爬虫了。
对于start而言,每跳转一页,就会增加25的数值,因为每一个页面里面均仅有25部电影。这样我们就找到了start参数的规律,开始编写爬虫。
三.爬虫代码的编写
在编写的代码时候,我们导入了多进程的库,使用这个库进行爬虫,也就只需要在原本代码的基础之上多添加两行代码即可,如下所示:
pool=Pool(4)
pool.map(get_information,number_ls)
这两行代码当中,第一个参数的4表示了我们使用4个进程的进程池进行数据的抓取。数值越大,爬取的效率就越高,这取决于你CPU的数量,数值不能超过CPU核心数的数量,因为一个一个CPU核心同时只能够运行单个进程。
第二行代码使用了map函数,第一个参数填写我们进行爬虫的函数,第二个参数填写爬虫函数所需要的参数。把这两个东西放到map函数里,就可以开始高性能爬虫了。
Remark:
在进行进程池爬虫的时候,我们放入的参数number_ls一定是一个列表,同时我们在get_imformation函数里使用得到的参数时,每次系统会调用这个参数列表当中的任意一个数值,而不是对整个列表进行调用。
由于整个原因,因此我们编写整个的代码·如下所示:
import requests
from lxml import etree
from multiprocessing.dummy import Pool
cookie='bid=N3Zqe_FFUKc; douban-fav-remind=1; viewed="27093751"; _vwo_uuid_v2=D401F17C96234AE149C4E04B78C3C8066|6fcc3cefe576bff2b89cdf28c4c5f597; __gads=ID=21cdec44606b00df-2250ba4d7ac4009b:T=1604034713:RT=1604034713:S=ALNI_Mb6iYJKYfbUjLxlisTQX5HCODTGKg; gr_user_id=fb6ac40c-94c3-400e-b170-47e126a9b78a; _gid=GA1.2.1520341169.1612004212; _ga=GA1.2.645228582.1602221486; ll="108288"; UM_distinctid=17752f076e4530-0b6eef25ebabba-f7b1332-1fa400-17752f076e57f0; Hm_lvt_19fc7b106453f97b6a84d64302f21a04=1612004228; Hm_lpvt_19fc7b106453f97b6a84d64302f21a04=1612004253; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1612004299%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.645228582.1602221486.1611225800.1612004300.9; __utmb=30149280.0.10.1612004300; __utmc=30149280; __utmz=30149280.1612004300.9.9.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utma=223695111.645228582.1602221486.1612004300.1612004300.1; __utmb=223695111.0.10.1612004300; __utmc=223695111; __utmz=223695111.1612004300.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); _pk_id.100001.4cf6=9a1bb1df4597b334.1612004299.1.1612005471.1612004299.' headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
} url='https://movie.douban.com/top250' number_ls=[]
for i in range(0,251,25):
number_ls.append(i) print(number_ls) def get_information(number_ls):
param={
'start':number_ls,
'filter' :''
}
page_content=requests.get(url=url,headers=headers,params=param).text with open('douban.html','w',encoding='utf-8') as fp:
fp.write(page_content) tree=etree.HTML(page_content)
vedio_title=tree.xpath('//ol[@class="grid_view"]//div[@class="pic"]//a/img/@alt')
star=tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()') vedio_title_ls=[]
star_ls=[]
for i in vedio_title:
vedio_title_ls.append(i)
for i in star:
star_ls.append(i) j=0
while j<len(star_ls):
print("the movie is ",vedio_title_ls[j])
print("the star is ",star_ls[j])
print()
j+=1 pool=Pool(4)
pool.map(get_information,number_ls)
四.输出的结果
输出的结果十分完美,一共有250份电影,如下图所示:
[0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250]
the movie is 搏击俱乐部
the star is 9.0 the movie is 教父2
the star is 9.2 the movie is 狮子王
the star is 9.0 the movie is 指环王2:双塔奇兵
the star is 9.1 the movie is 死亡诗社
the star is 9.1 the movie is 钢琴家
the star is 9.2 the movie is 黑客帝国
the star is 9.0 the movie is 指环王1:魔戒再现
the star is 9.0 the movie is 饮食男女
the star is 9.1 the movie is 窃听风暴
the star is 9.1 the movie is 美丽心灵
the star is 9.0 the movie is 让子弹飞
the star is 8.8 the movie is 绿皮书
the star is 8.9 the movie is 两杆大烟枪
the star is 9.1 the movie is 本杰明·巴顿奇事
the star is 8.9 the movie is 海蒂和爷爷
the star is 9.2 the movie is 飞越疯人院
the star is 9.1 the movie is 看不见的客人
the star is 8.8 the movie is 西西里的美丽传说
the star is 8.9 the movie is 拯救大兵瑞恩
the star is 9.0 the movie is 穿条纹睡衣的男孩
the star is 9.1 the movie is 小鞋子
the star is 9.2 the movie is 音乐之声
the star is 9.0 the movie is 情书
the star is 8.9 the movie is 海豚湾
the star is 9.3 the movie is 美国往事
the star is 9.2 the movie is 致命魔术
the star is 8.9 the movie is 沉默的羔羊
the star is 8.9 the movie is 低俗小说
the star is 8.9 the movie is 禁闭岛
the star is 8.8 the movie is 蝴蝶效应
the star is 8.8 the movie is 七宗罪
the star is 8.8 the movie is 心灵捕手
the star is 8.9 the movie is 布达佩斯大饭店
the star is 8.9 the movie is 春光乍泄
the star is 8.9 the movie is 摩登时代
the star is 9.3 the movie is 被嫌弃的松子的一生
the star is 8.9 the movie is 哈利·波特与死亡圣器(下)
the star is 8.9 the movie is 阿凡达
the star is 8.7 the movie is 喜剧之王
the star is 8.8 the movie is 致命ID
the star is 8.8 the movie is 剪刀手爱德华
the star is 8.7 the movie is 勇敢的心
the star is 8.9 the movie is 加勒比海盗
the star is 8.8 the movie is 杀人回忆
the star is 8.9 the movie is 狩猎
the star is 9.1 the movie is 请以你的名字呼唤我
the star is 8.9 the movie is 天使爱美丽
the star is 8.7 the movie is 断背山
the star is 8.8 the movie is 红辣椒
the star is 9.0 the movie is 触不可及
the star is 9.2 the movie is 蝙蝠侠:黑暗骑士
the star is 9.2 the movie is 末代皇帝
the star is 9.3 the movie is 活着
the star is 9.3 the movie is 寻梦环游记
the star is 9.1 the movie is 乱世佳人
the star is 9.3 the movie is 何以为家
the star is 9.1 the movie is 指环王3:王者无敌
the star is 9.2 the movie is 飞屋环游记
the star is 9.0 the movie is 摔跤吧!爸爸
the star is 9.0 the movie is 哈利·波特与魔法石
the star is 9.1 the movie is 素媛
the star is 9.3 the movie is 少年派的奇幻漂流
the star is 9.1 the movie is 十二怒汉
the star is 9.4 the movie is 哈尔的移动城堡
the star is 9.1 the movie is 鬼子来了
the star is 9.3 the movie is 天空之城
the star is 9.1 the movie is 大话西游之月光宝盒
the star is 9.0 the movie is 我不是药神
the star is 9.0 the movie is 闻香识女人
the star is 9.1 the movie is 罗马假日
the star is 9.0 the movie is 天堂电影院
the star is 9.2 the movie is 辩护人
the star is 9.2 the movie is 猫鼠游戏
the star is 9.0 the movie is 大闹天宫
the star is 9.4 the movie is 肖申克的救赎
the star is 9.7 the movie is 霸王别姬
the star is 9.6 the movie is 阿甘正传
the star is 9.5 the movie is 这个杀手不太冷
the star is 9.4 the movie is 泰坦尼克号
the star is 9.4 the movie is 美丽人生
the star is 9.5 the movie is 千与千寻
the star is 9.4 the movie is 辛德勒的名单
the star is 9.5 the movie is 盗梦空间
the star is 9.3 the movie is 忠犬八公的故事
the star is 9.4 the movie is 星际穿越
the star is 9.3 the movie is 海上钢琴师
the star is 9.3 the movie is 楚门的世界
the star is 9.3 the movie is 三傻大闹宝莱坞
the star is 9.2 the movie is 机器人总动员
the star is 9.3 the movie is 放牛班的春天
the star is 9.3 the movie is 大话西游之大圣娶亲
the star is 9.2 the movie is 疯狂动物城
the star is 9.2 the movie is 无间道
the star is 9.2 the movie is 熔炉
the star is 9.3 the movie is 教父
the star is 9.3 the movie is 当幸福来敲门
the star is 9.1 the movie is 龙猫
the star is 9.2 the movie is 怦然心动
the star is 9.1 the movie is 控方证人
the star is 9.6 the movie is 7号房的礼物
the star is 8.9 the movie is 幽灵公主
the star is 8.9 the movie is 小森林 夏秋篇
the star is 9.0 the movie is 阳光灿烂的日子
the star is 8.8 the movie is 第六感
the star is 8.9 the movie is 重庆森林
the star is 8.8 the movie is 入殓师
the star is 8.9 the movie is 唐伯虎点秋香
the star is 8.7 the movie is 小森林 冬春篇
the star is 9.0 the movie is 爱在黎明破晓前
the star is 8.8 the movie is 超脱
the star is 8.9 the movie is 消失的爱人
the star is 8.7 the movie is 一一
the star is 9.0 the movie is 菊次郎的夏天
the star is 8.8 the movie is 蝙蝠侠:黑暗骑士崛起
the star is 8.8 the movie is 侧耳倾听
the star is 8.9 the movie is 倩女幽魂
the star is 8.7 the movie is 功夫
the star is 8.6 the movie is 超能陆战队
the star is 8.7 the movie is 无人知晓
the star is 9.1 the movie is 人生果实
the star is 9.5 the movie is 萤火之森
the star is 8.9 the movie is 甜蜜蜜
the star is 8.8 the movie is 借东西的小人阿莉埃蒂
the star is 8.8 the movie is 玛丽和马克思
the star is 8.9 the movie is 爱在日落黄昏时
the star is 8.8 the movie is 驯龙高手
the star is 8.7 the movie is 完美的世界
the star is 9.1 the movie is 幸福终点站
the star is 8.8 the movie is 告白
the star is 8.7 the movie is 大鱼
the star is 8.8 the movie is 阳光姐妹淘
the star is 8.8 the movie is 射雕英雄传之东成西就
the star is 8.7 the movie is 哈利·波特与阿兹卡班的囚徒
the star is 8.8 the movie is 恐怖直播
the star is 8.8 the movie is 天书奇谭
the star is 9.2 the movie is 怪兽电力公司
the star is 8.7 the movie is 神偷奶爸
the star is 8.6 the movie is 玩具总动员3
the star is 8.8 the movie is 傲慢与偏见
the star is 8.6 the movie is 时空恋旅人
the star is 8.8 the movie is 哈利·波特与密室
the star is 8.7 the movie is 教父3
the star is 8.9 the movie is 釜山行
the star is 8.6 the movie is 血战钢锯岭
the star is 8.7 the movie is 哪吒闹海
the star is 9.1 the movie is 被解救的姜戈
the star is 8.7 the movie is 七武士
the star is 9.3 the movie is 喜宴
the star is 8.9 the movie is 电锯惊魂
the star is 8.7 the movie is 爆裂鼓手
the star is 8.7 the movie is 贫民窟的百万富翁
the star is 8.6 the movie is 萤火虫之墓
the star is 8.7 the movie is 东邪西毒
the star is 8.6 the movie is 海街日记
the star is 8.8 the movie is 黑天鹅
the star is 8.6 the movie is 惊魂记
the star is 9.0 the movie is 无敌破坏王
the star is 8.7 the movie is 你看起来好像很好吃
the star is 8.9 the movie is 冰川时代
the star is 8.6 the movie is 雨人
the star is 8.7 the movie is 小偷家族
the star is 8.7 the movie is 绿里奇迹
the star is 8.9 the movie is 恋恋笔记本
the star is 8.5 the movie is 爱在午夜降临前
the star is 8.8 the movie is 疯狂的石头
the star is 8.5 the movie is 哈利·波特与火焰杯
the star is 8.6 the movie is 寄生虫
the star is 8.7 the movie is 恐怖游轮
the star is 8.5 the movie is 奇迹男孩
the star is 8.6 the movie is 雨中曲
the star is 9.0 the movie is 魔女宅急便
the star is 8.7 the movie is 二十二
the star is 8.7 the movie is 海边的曼彻斯特
the star is 8.6 the movie is 房间
the star is 8.8 the movie is 风之谷
the star is 8.9 the movie is 一个叫欧维的男人决定去死
the star is 8.9 the movie is 我是山姆
the star is 8.9 the movie is 头号玩家
the star is 8.7 the movie is 英雄本色
the star is 8.7 the movie is 上帝之城
the star is 9.0 the movie is 谍影重重3
the star is 8.8 the movie is 疯狂原始人
the star is 8.7 the movie is 未麻的部屋
the star is 9.0 the movie is 岁月神偷
the star is 8.7 the movie is 卢旺达饭店
the star is 8.9 the movie is 纵横四海
the star is 8.8 the movie is 三块广告牌
the star is 8.7 the movie is 达拉斯买家俱乐部
the star is 8.8 the movie is 花样年华
the star is 8.7 the movie is 心迷宫
the star is 8.7 the movie is 记忆碎片
the star is 8.6 the movie is 模仿游戏
the star is 8.7 the movie is 黑客帝国3:矩阵革命
the star is 8.8 the movie is 新世界
the star is 8.8 the movie is 头脑特工队
the star is 8.7 the movie is 荒蛮故事
the star is 8.8 the movie is 你的名字。
the star is 8.4 the movie is 真爱至上
the star is 8.6 the movie is 忠犬八公物语
the star is 9.2 the movie is 谍影重重2
the star is 8.7 the movie is 阿飞正传
the star is 8.5 the movie is 地球上的星星
the star is 8.9 the movie is 彗星来的那一夜
the star is 8.5 the movie is 完美陌生人
the star is 8.5 the movie is 战争之王
the star is 8.7 the movie is 谍影重重
the star is 8.6 the movie is 香水
the star is 8.5 the movie is 东京教父
the star is 9.0 the movie is 东京物语
the star is 9.2 the movie is 朗读者
the star is 8.6 the movie is 千钧一发
the star is 8.8 the movie is 再次出发之纽约遇见你
the star is 8.6 the movie is 驴得水
the star is 8.3 the movie is 猜火车
the star is 8.5 the movie is 黑客帝国2:重装上阵
the star is 8.6 the movie is 无间道2
the star is 8.6 the movie is 我爱你
the star is 9.1 the movie is 浪潮
the star is 8.7 the movie is 崖上的波妞
the star is 8.5 the movie is 聚焦
the star is 8.8 the movie is 小萝莉的猴神大叔
the star is 8.4 the movie is 追随
the star is 8.9 the movie is 黑鹰坠落
the star is 8.7 the movie is 网络谜踪
the star is 8.6 the movie is 虎口脱险
the star is 8.9 the movie is 人工智能
the star is 8.7 the movie is 九品芝麻官
the star is 8.6 the movie is 2001太空漫游
the star is 8.8 the movie is 可可西里
the star is 8.8 the movie is 罗生门
the star is 8.8 the movie is 色,戒
the star is 8.5 the movie is 终结者2:审判日
the star is 8.7 the movie is 城市之光
the star is 9.3 the movie is 初恋这件小事
the star is 8.4 the movie is 魂断蓝桥
the star is 8.8 the movie is 牯岭街少年杀人事件
the star is 8.9 the movie is 遗愿清单
the star is 8.7 the movie is 大佛普拉斯
the star is 8.7 the movie is 新龙门客栈
the star is 8.6 the movie is 波西米亚狂想曲
the star is 8.7 the movie is 源代码
the star is 8.5 the movie is 青蛇
the star is 8.6 the movie is 海洋
the star is 9.1 the movie is 燃情岁月
the star is 8.8 the movie is 无耻混蛋
the star is 8.6 the movie is 疯狂的麦克斯4:狂暴之路
the star is 8.6 the movie is 血钻
the star is 8.7 the movie is 穿越时空的少女
the star is 8.6 the movie is 步履不停
the star is 8.8
【Python爬虫】:使用高性能异步多进程爬虫获取豆瓣电影Top250的更多相关文章
- python爬虫 Scrapy2-- 爬取豆瓣电影TOP250
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
- Python爬虫教程-17-ajax爬取实例(豆瓣电影)
Python爬虫教程-17-ajax爬取实例(豆瓣电影) ajax: 简单的说,就是一段js代码,通过这段代码,可以让页面发送异步的请求,或者向服务器发送一个东西,即和服务器进行交互 对于ajax: ...
- Python爬虫----抓取豆瓣电影Top250
有了上次利用python爬虫抓取糗事百科的经验,这次自己动手写了个爬虫抓取豆瓣电影Top250的简要信息. 1.观察url 首先观察一下网址的结构 http://movie.douban.com/to ...
- Python小爬虫——抓取豆瓣电影Top250数据
python抓取豆瓣电影Top250数据 1.豆瓣地址:https://movie.douban.com/top250?start=25&filter= 2.主要流程是抓取该网址下的Top25 ...
- Python爬虫入门:爬取豆瓣电影TOP250
一个很简单的爬虫. 从这里学习的,解释的挺好的:https://xlzd.me/2015/12/16/python-crawler-03 分享写这个代码用到了的学习的链接: BeautifulSoup ...
- [Python] 豆瓣电影top250爬虫
1.分析 <li><div class="item">电影信息</div></li> 每个电影信息都是同样的格式,毕竟在服务器端是用 ...
- scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250
scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250 前言 经过上一篇教程我们已经大致了解了Scrapy的基本情况,并写了一个简单的小demo.这次我会以爬取豆瓣电影TOP250为例进一步为大 ...
- 一起学爬虫——通过爬取豆瓣电影top250学习requests库的使用
学习一门技术最快的方式是做项目,在做项目的过程中对相关的技术查漏补缺. 本文通过爬取豆瓣top250电影学习python requests的使用. 1.准备工作 在pycharm中安装request库 ...
- Scrapy爬虫(4)爬取豆瓣电影Top250图片
在用Python的urllib和BeautifulSoup写过了很多爬虫之后,本人决定尝试著名的Python爬虫框架--Scrapy. 本次分享将详细讲述如何利用Scrapy来下载豆瓣电影To ...
随机推荐
- idea提交svn不显示新建文件
在idea中,使用svn提交时可能会出现 预期文件没出现在提交目录里. 是因为没有把新建文件添加到版本控制里. 解决办法:右键选择文件→subversion→add to vcs. 自动把新文件添加 ...
- Spring Cloud Config原码篇(十)
上篇中说到通过@Value注解获取配置中心的内容进行注入,要想了解这个就要知道spring Environment原理,关于这原理我看了下网上分析的文章:https://blog.csdn.net/t ...
- java内部类对象使用.this,.new
public class InnerClass { class Content { private int i; public int getVlaue() { return i; } } class ...
- 漫谈JSON Web Token(JWT)
一.背景 传统的单体应用基于cookie-session的身份验证流程一般是这样的: 用户向服务器发送账户和密码. 服务器验证账号密码成功后,相关数据(用户角色.登录时间等)都保存到当前会话中. 服务 ...
- 浅析pagehelper分页原理
原文链接 https://blog.csdn.net/qq_21996541/article/details/79796117 之前项目一直使用的是普元框架,最近公司项目搭建了新框架,主要是由公司的大 ...
- 24位PCM采样数据转成16位算法,已实现PCM转WAV在线工具源码支持24bits、16bits、8bits
目录 算法来源 js版24位PCM转8位.16位代码 js版8位.16位PCM转成24位 附:浏览器控制台下载数据文件代码 相关实现 最近收到几个24位的PCM录音源文件,Recoder库原有的PCM ...
- C语言中一维数组
(1)输出数组元素 #include<stdio.h> int main() { int index; /*定义循环变量*/ int iArray[6]={0,1,2,3,4,5}; /* ...
- 一次mongo查询不存在字段引发的事故
话说今天的一个小小的查询失误给了我比较深刻的教训,也让我对mongo有了更深刻的理解,下面我们来说说这个事情的原委: 我们经常使用阿里云子账号在DMS上查询线上数据库数据,今天也是平常的一次操作 集合 ...
- Hive数据导入Hbase
方案一:Hive关联HBase表方式 适用场景:数据量不大4T以下(走hbase的api导入数据) 一.hbase表不存在的情况 创建hive表hive_hbase_table映射hbase表hbas ...
- jdbc事务、连接池概念、c3p0、Driud、JDBC Template、DBUtils
JDBC 事务控制 什么是事务:一个包含多个步骤或者业务操作.如果这个业务或者多个步骤被事务管理,则这多个步骤要么同时成功,要么回滚(多个步骤同时执行失败),这多个步骤是一个整体,不可分割的. 操作: ...