在本篇博文当中,将会教会大家如何使用高性能爬虫,快速爬取并解析页面当中的信息。一般情况下,如果我们请求网页的次数太多,每次都要发出一次请求,进行串行执行的话,那么请求将会占用我们大量的时间,这样得不偿失。因此我们可以i使用高性能爬虫,也就是采用多进程,异步的方式对数据进行爬取和解析,这样就可以在更快的时间内得到我们想要的结果。本篇博文给出有关爬取豆瓣电影的例子,以此来教会大家如何使用高性能爬虫。

一.网页分析

首先我们来分析豆瓣电影的网页代码,在本次的案例当中。我们需要爬取豆瓣电影top250当中的标题title和星数star。

发现,豆瓣电影当中的所有有关电影的信息全部都隐藏在< ol class="grid view">这个标签,当中,因此我们在编写xpath的时候,可以利用对它做一个循环。然后又发现,对于电影的title而言,有两个地方出现,一个地方是在图片上,另一个地方是在span标签下的class = title处,但是在span标签下具有多个标题,为了以免引起混,因此我们使用图片当中所暗含的标题title文字,使用xpath进行定位即可。

对于star而言,就更加简单了。我们发现每次一个star的分数出现,就会有又一个<div class="star">的标签在前面,然后再出现了与span有关的标签,因此我们编写xpath表达式为://ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()

这样就可以得到一整个页面的star的数值了。当然这样我们只能获取第一页的我们想要得到的数据,怎么得到第二页的数据呢?

二.翻页处理

翻页处理对于豆瓣电影这个网站还是比较简单的。我们分别查看第一,二,三页的url,就会惊奇的发现它的网址如下:

https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=

十分明显,这个网址后面有问号说明想要获取页面内容肯定需要发起get请求,都没有做有关post请求的加密,这样看来这也太简单了吧!

同样的我们发现里面的参数start在不断的变化,而filter却保持不变。因此我们只需要得到start参数的规律就知道该怎么编写爬虫了。

对于start而言,每跳转一页,就会增加25的数值,因为每一个页面里面均仅有25部电影。这样我们就找到了start参数的规律,开始编写爬虫。

三.爬虫代码的编写

在编写的代码时候,我们导入了多进程的库,使用这个库进行爬虫,也就只需要在原本代码的基础之上多添加两行代码即可,如下所示:

pool=Pool(4)
pool.map(get_information,number_ls)

这两行代码当中,第一个参数的4表示了我们使用4个进程的进程池进行数据的抓取。数值越大,爬取的效率就越高,这取决于你CPU的数量,数值不能超过CPU核心数的数量,因为一个一个CPU核心同时只能够运行单个进程。

第二行代码使用了map函数,第一个参数填写我们进行爬虫的函数,第二个参数填写爬虫函数所需要的参数。把这两个东西放到map函数里,就可以开始高性能爬虫了。

Remark:
在进行进程池爬虫的时候,我们放入的参数number_ls一定是一个列表,同时我们在get_imformation函数里使用得到的参数时,每次系统会调用这个参数列表当中的任意一个数值,而不是对整个列表进行调用。

由于整个原因,因此我们编写整个的代码·如下所示:

import requests
from lxml import etree
from multiprocessing.dummy import Pool
cookie='bid=N3Zqe_FFUKc; douban-fav-remind=1; viewed="27093751"; _vwo_uuid_v2=D401F17C96234AE149C4E04B78C3C8066|6fcc3cefe576bff2b89cdf28c4c5f597; __gads=ID=21cdec44606b00df-2250ba4d7ac4009b:T=1604034713:RT=1604034713:S=ALNI_Mb6iYJKYfbUjLxlisTQX5HCODTGKg; gr_user_id=fb6ac40c-94c3-400e-b170-47e126a9b78a; _gid=GA1.2.1520341169.1612004212; _ga=GA1.2.645228582.1602221486; ll="108288"; UM_distinctid=17752f076e4530-0b6eef25ebabba-f7b1332-1fa400-17752f076e57f0; Hm_lvt_19fc7b106453f97b6a84d64302f21a04=1612004228; Hm_lpvt_19fc7b106453f97b6a84d64302f21a04=1612004253; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1612004299%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.645228582.1602221486.1611225800.1612004300.9; __utmb=30149280.0.10.1612004300; __utmc=30149280; __utmz=30149280.1612004300.9.9.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utma=223695111.645228582.1602221486.1612004300.1612004300.1; __utmb=223695111.0.10.1612004300; __utmc=223695111; __utmz=223695111.1612004300.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); _pk_id.100001.4cf6=9a1bb1df4597b334.1612004299.1.1612005471.1612004299.' headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
} url='https://movie.douban.com/top250' number_ls=[]
for i in range(0,251,25):
number_ls.append(i) print(number_ls) def get_information(number_ls):
param={
'start':number_ls,
'filter' :''
}
page_content=requests.get(url=url,headers=headers,params=param).text with open('douban.html','w',encoding='utf-8') as fp:
fp.write(page_content) tree=etree.HTML(page_content)
vedio_title=tree.xpath('//ol[@class="grid_view"]//div[@class="pic"]//a/img/@alt')
star=tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()') vedio_title_ls=[]
star_ls=[]
for i in vedio_title:
vedio_title_ls.append(i)
for i in star:
star_ls.append(i) j=0
while j<len(star_ls):
print("the movie is ",vedio_title_ls[j])
print("the star is ",star_ls[j])
print()
j+=1 pool=Pool(4)
pool.map(get_information,number_ls)

四.输出的结果

输出的结果十分完美,一共有250份电影,如下图所示:

[0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250]
the movie is 搏击俱乐部
the star is 9.0 the movie is 教父2
the star is 9.2 the movie is 狮子王
the star is 9.0 the movie is 指环王2:双塔奇兵
the star is 9.1 the movie is 死亡诗社
the star is 9.1 the movie is 钢琴家
the star is 9.2 the movie is 黑客帝国
the star is 9.0 the movie is 指环王1:魔戒再现
the star is 9.0 the movie is 饮食男女
the star is 9.1 the movie is 窃听风暴
the star is 9.1 the movie is 美丽心灵
the star is 9.0 the movie is 让子弹飞
the star is 8.8 the movie is 绿皮书
the star is 8.9 the movie is 两杆大烟枪
the star is 9.1 the movie is 本杰明·巴顿奇事
the star is 8.9 the movie is 海蒂和爷爷
the star is 9.2 the movie is 飞越疯人院
the star is 9.1 the movie is 看不见的客人
the star is 8.8 the movie is 西西里的美丽传说
the star is 8.9 the movie is 拯救大兵瑞恩
the star is 9.0 the movie is 穿条纹睡衣的男孩
the star is 9.1 the movie is 小鞋子
the star is 9.2 the movie is 音乐之声
the star is 9.0 the movie is 情书
the star is 8.9 the movie is 海豚湾
the star is 9.3 the movie is 美国往事
the star is 9.2 the movie is 致命魔术
the star is 8.9 the movie is 沉默的羔羊
the star is 8.9 the movie is 低俗小说
the star is 8.9 the movie is 禁闭岛
the star is 8.8 the movie is 蝴蝶效应
the star is 8.8 the movie is 七宗罪
the star is 8.8 the movie is 心灵捕手
the star is 8.9 the movie is 布达佩斯大饭店
the star is 8.9 the movie is 春光乍泄
the star is 8.9 the movie is 摩登时代
the star is 9.3 the movie is 被嫌弃的松子的一生
the star is 8.9 the movie is 哈利·波特与死亡圣器(下)
the star is 8.9 the movie is 阿凡达
the star is 8.7 the movie is 喜剧之王
the star is 8.8 the movie is 致命ID
the star is 8.8 the movie is 剪刀手爱德华
the star is 8.7 the movie is 勇敢的心
the star is 8.9 the movie is 加勒比海盗
the star is 8.8 the movie is 杀人回忆
the star is 8.9 the movie is 狩猎
the star is 9.1 the movie is 请以你的名字呼唤我
the star is 8.9 the movie is 天使爱美丽
the star is 8.7 the movie is 断背山
the star is 8.8 the movie is 红辣椒
the star is 9.0 the movie is 触不可及
the star is 9.2 the movie is 蝙蝠侠:黑暗骑士
the star is 9.2 the movie is 末代皇帝
the star is 9.3 the movie is 活着
the star is 9.3 the movie is 寻梦环游记
the star is 9.1 the movie is 乱世佳人
the star is 9.3 the movie is 何以为家
the star is 9.1 the movie is 指环王3:王者无敌
the star is 9.2 the movie is 飞屋环游记
the star is 9.0 the movie is 摔跤吧!爸爸
the star is 9.0 the movie is 哈利·波特与魔法石
the star is 9.1 the movie is 素媛
the star is 9.3 the movie is 少年派的奇幻漂流
the star is 9.1 the movie is 十二怒汉
the star is 9.4 the movie is 哈尔的移动城堡
the star is 9.1 the movie is 鬼子来了
the star is 9.3 the movie is 天空之城
the star is 9.1 the movie is 大话西游之月光宝盒
the star is 9.0 the movie is 我不是药神
the star is 9.0 the movie is 闻香识女人
the star is 9.1 the movie is 罗马假日
the star is 9.0 the movie is 天堂电影院
the star is 9.2 the movie is 辩护人
the star is 9.2 the movie is 猫鼠游戏
the star is 9.0 the movie is 大闹天宫
the star is 9.4 the movie is 肖申克的救赎
the star is 9.7 the movie is 霸王别姬
the star is 9.6 the movie is 阿甘正传
the star is 9.5 the movie is 这个杀手不太冷
the star is 9.4 the movie is 泰坦尼克号
the star is 9.4 the movie is 美丽人生
the star is 9.5 the movie is 千与千寻
the star is 9.4 the movie is 辛德勒的名单
the star is 9.5 the movie is 盗梦空间
the star is 9.3 the movie is 忠犬八公的故事
the star is 9.4 the movie is 星际穿越
the star is 9.3 the movie is 海上钢琴师
the star is 9.3 the movie is 楚门的世界
the star is 9.3 the movie is 三傻大闹宝莱坞
the star is 9.2 the movie is 机器人总动员
the star is 9.3 the movie is 放牛班的春天
the star is 9.3 the movie is 大话西游之大圣娶亲
the star is 9.2 the movie is 疯狂动物城
the star is 9.2 the movie is 无间道
the star is 9.2 the movie is 熔炉
the star is 9.3 the movie is 教父
the star is 9.3 the movie is 当幸福来敲门
the star is 9.1 the movie is 龙猫
the star is 9.2 the movie is 怦然心动
the star is 9.1 the movie is 控方证人
the star is 9.6 the movie is 7号房的礼物
the star is 8.9 the movie is 幽灵公主
the star is 8.9 the movie is 小森林 夏秋篇
the star is 9.0 the movie is 阳光灿烂的日子
the star is 8.8 the movie is 第六感
the star is 8.9 the movie is 重庆森林
the star is 8.8 the movie is 入殓师
the star is 8.9 the movie is 唐伯虎点秋香
the star is 8.7 the movie is 小森林 冬春篇
the star is 9.0 the movie is 爱在黎明破晓前
the star is 8.8 the movie is 超脱
the star is 8.9 the movie is 消失的爱人
the star is 8.7 the movie is 一一
the star is 9.0 the movie is 菊次郎的夏天
the star is 8.8 the movie is 蝙蝠侠:黑暗骑士崛起
the star is 8.8 the movie is 侧耳倾听
the star is 8.9 the movie is 倩女幽魂
the star is 8.7 the movie is 功夫
the star is 8.6 the movie is 超能陆战队
the star is 8.7 the movie is 无人知晓
the star is 9.1 the movie is 人生果实
the star is 9.5 the movie is 萤火之森
the star is 8.9 the movie is 甜蜜蜜
the star is 8.8 the movie is 借东西的小人阿莉埃蒂
the star is 8.8 the movie is 玛丽和马克思
the star is 8.9 the movie is 爱在日落黄昏时
the star is 8.8 the movie is 驯龙高手
the star is 8.7 the movie is 完美的世界
the star is 9.1 the movie is 幸福终点站
the star is 8.8 the movie is 告白
the star is 8.7 the movie is 大鱼
the star is 8.8 the movie is 阳光姐妹淘
the star is 8.8 the movie is 射雕英雄传之东成西就
the star is 8.7 the movie is 哈利·波特与阿兹卡班的囚徒
the star is 8.8 the movie is 恐怖直播
the star is 8.8 the movie is 天书奇谭
the star is 9.2 the movie is 怪兽电力公司
the star is 8.7 the movie is 神偷奶爸
the star is 8.6 the movie is 玩具总动员3
the star is 8.8 the movie is 傲慢与偏见
the star is 8.6 the movie is 时空恋旅人
the star is 8.8 the movie is 哈利·波特与密室
the star is 8.7 the movie is 教父3
the star is 8.9 the movie is 釜山行
the star is 8.6 the movie is 血战钢锯岭
the star is 8.7 the movie is 哪吒闹海
the star is 9.1 the movie is 被解救的姜戈
the star is 8.7 the movie is 七武士
the star is 9.3 the movie is 喜宴
the star is 8.9 the movie is 电锯惊魂
the star is 8.7 the movie is 爆裂鼓手
the star is 8.7 the movie is 贫民窟的百万富翁
the star is 8.6 the movie is 萤火虫之墓
the star is 8.7 the movie is 东邪西毒
the star is 8.6 the movie is 海街日记
the star is 8.8 the movie is 黑天鹅
the star is 8.6 the movie is 惊魂记
the star is 9.0 the movie is 无敌破坏王
the star is 8.7 the movie is 你看起来好像很好吃
the star is 8.9 the movie is 冰川时代
the star is 8.6 the movie is 雨人
the star is 8.7 the movie is 小偷家族
the star is 8.7 the movie is 绿里奇迹
the star is 8.9 the movie is 恋恋笔记本
the star is 8.5 the movie is 爱在午夜降临前
the star is 8.8 the movie is 疯狂的石头
the star is 8.5 the movie is 哈利·波特与火焰杯
the star is 8.6 the movie is 寄生虫
the star is 8.7 the movie is 恐怖游轮
the star is 8.5 the movie is 奇迹男孩
the star is 8.6 the movie is 雨中曲
the star is 9.0 the movie is 魔女宅急便
the star is 8.7 the movie is 二十二
the star is 8.7 the movie is 海边的曼彻斯特
the star is 8.6 the movie is 房间
the star is 8.8 the movie is 风之谷
the star is 8.9 the movie is 一个叫欧维的男人决定去死
the star is 8.9 the movie is 我是山姆
the star is 8.9 the movie is 头号玩家
the star is 8.7 the movie is 英雄本色
the star is 8.7 the movie is 上帝之城
the star is 9.0 the movie is 谍影重重3
the star is 8.8 the movie is 疯狂原始人
the star is 8.7 the movie is 未麻的部屋
the star is 9.0 the movie is 岁月神偷
the star is 8.7 the movie is 卢旺达饭店
the star is 8.9 the movie is 纵横四海
the star is 8.8 the movie is 三块广告牌
the star is 8.7 the movie is 达拉斯买家俱乐部
the star is 8.8 the movie is 花样年华
the star is 8.7 the movie is 心迷宫
the star is 8.7 the movie is 记忆碎片
the star is 8.6 the movie is 模仿游戏
the star is 8.7 the movie is 黑客帝国3:矩阵革命
the star is 8.8 the movie is 新世界
the star is 8.8 the movie is 头脑特工队
the star is 8.7 the movie is 荒蛮故事
the star is 8.8 the movie is 你的名字。
the star is 8.4 the movie is 真爱至上
the star is 8.6 the movie is 忠犬八公物语
the star is 9.2 the movie is 谍影重重2
the star is 8.7 the movie is 阿飞正传
the star is 8.5 the movie is 地球上的星星
the star is 8.9 the movie is 彗星来的那一夜
the star is 8.5 the movie is 完美陌生人
the star is 8.5 the movie is 战争之王
the star is 8.7 the movie is 谍影重重
the star is 8.6 the movie is 香水
the star is 8.5 the movie is 东京教父
the star is 9.0 the movie is 东京物语
the star is 9.2 the movie is 朗读者
the star is 8.6 the movie is 千钧一发
the star is 8.8 the movie is 再次出发之纽约遇见你
the star is 8.6 the movie is 驴得水
the star is 8.3 the movie is 猜火车
the star is 8.5 the movie is 黑客帝国2:重装上阵
the star is 8.6 the movie is 无间道2
the star is 8.6 the movie is 我爱你
the star is 9.1 the movie is 浪潮
the star is 8.7 the movie is 崖上的波妞
the star is 8.5 the movie is 聚焦
the star is 8.8 the movie is 小萝莉的猴神大叔
the star is 8.4 the movie is 追随
the star is 8.9 the movie is 黑鹰坠落
the star is 8.7 the movie is 网络谜踪
the star is 8.6 the movie is 虎口脱险
the star is 8.9 the movie is 人工智能
the star is 8.7 the movie is 九品芝麻官
the star is 8.6 the movie is 2001太空漫游
the star is 8.8 the movie is 可可西里
the star is 8.8 the movie is 罗生门
the star is 8.8 the movie is 色,戒
the star is 8.5 the movie is 终结者2:审判日
the star is 8.7 the movie is 城市之光
the star is 9.3 the movie is 初恋这件小事
the star is 8.4 the movie is 魂断蓝桥
the star is 8.8 the movie is 牯岭街少年杀人事件
the star is 8.9 the movie is 遗愿清单
the star is 8.7 the movie is 大佛普拉斯
the star is 8.7 the movie is 新龙门客栈
the star is 8.6 the movie is 波西米亚狂想曲
the star is 8.7 the movie is 源代码
the star is 8.5 the movie is 青蛇
the star is 8.6 the movie is 海洋
the star is 9.1 the movie is 燃情岁月
the star is 8.8 the movie is 无耻混蛋
the star is 8.6 the movie is 疯狂的麦克斯4:狂暴之路
the star is 8.6 the movie is 血钻
the star is 8.7 the movie is 穿越时空的少女
the star is 8.6 the movie is 步履不停
the star is 8.8

【Python爬虫】:使用高性能异步多进程爬虫获取豆瓣电影Top250的更多相关文章

  1. python爬虫 Scrapy2-- 爬取豆瓣电影TOP250

    sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...

  2. Python爬虫教程-17-ajax爬取实例(豆瓣电影)

    Python爬虫教程-17-ajax爬取实例(豆瓣电影) ajax: 简单的说,就是一段js代码,通过这段代码,可以让页面发送异步的请求,或者向服务器发送一个东西,即和服务器进行交互 对于ajax: ...

  3. Python爬虫----抓取豆瓣电影Top250

    有了上次利用python爬虫抓取糗事百科的经验,这次自己动手写了个爬虫抓取豆瓣电影Top250的简要信息. 1.观察url 首先观察一下网址的结构 http://movie.douban.com/to ...

  4. Python小爬虫——抓取豆瓣电影Top250数据

    python抓取豆瓣电影Top250数据 1.豆瓣地址:https://movie.douban.com/top250?start=25&filter= 2.主要流程是抓取该网址下的Top25 ...

  5. Python爬虫入门:爬取豆瓣电影TOP250

    一个很简单的爬虫. 从这里学习的,解释的挺好的:https://xlzd.me/2015/12/16/python-crawler-03 分享写这个代码用到了的学习的链接: BeautifulSoup ...

  6. [Python] 豆瓣电影top250爬虫

    1.分析 <li><div class="item">电影信息</div></li> 每个电影信息都是同样的格式,毕竟在服务器端是用 ...

  7. scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250

    scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250 前言 经过上一篇教程我们已经大致了解了Scrapy的基本情况,并写了一个简单的小demo.这次我会以爬取豆瓣电影TOP250为例进一步为大 ...

  8. 一起学爬虫——通过爬取豆瓣电影top250学习requests库的使用

    学习一门技术最快的方式是做项目,在做项目的过程中对相关的技术查漏补缺. 本文通过爬取豆瓣top250电影学习python requests的使用. 1.准备工作 在pycharm中安装request库 ...

  9. Scrapy爬虫(4)爬取豆瓣电影Top250图片

      在用Python的urllib和BeautifulSoup写过了很多爬虫之后,本人决定尝试著名的Python爬虫框架--Scrapy.   本次分享将详细讲述如何利用Scrapy来下载豆瓣电影To ...

随机推荐

  1. Redis基础篇(五)AOF与RDB比较和选择策略

    RDB和AOF对比 关于RDB和AOF的优缺点,官网上面也给了比较详细的说明redis.io/topics/pers- RDB 优点: RDB快照是一个压缩过的非常紧凑的文件,保存着某个时间点的数据集 ...

  2. Base 128 Varints 编码(压缩算法)

    Base 128 Varint可以说是一种编码方式,也可以说是一种压缩算法.这种压缩算法是用来压缩数字的传输的,压缩的依据是基于一个现实:越小的数字,越经常使用 我们来看看一个例子: 如果我们要网络传 ...

  3. 最全Java面试题(一)

    一.基础篇 1.1 java基础 面向对象的特征:继承.封装和多态 final, finally, finalize 的区别 final用于声明属性.方法和类,分别表示属性不可变.方法不可覆盖.类不可 ...

  4. Hbase性能调优(一)

    转自:https://blog.csdn.net/yueyedeai/article/details/14648111 1.修改Linux配置 Linux系统最大可打开文件数一般默认的参数值是1024 ...

  5. 【函数分享】每日PHP函数分享(2021-1-8)

    explode() 使用一个字符串分割另一个字符串. array explode( string $delimiter , string $string [, int $limit ]) 参数描述de ...

  6. 利用Python下载:You-Get的安装及使用方法

    You-Get是一个非常优秀的网站视频下载工具.使用You-Get可以很轻松的下载到网络上的视频.图片及音乐. 1.打开这个网址https://www.python.org/ 下载并安装Python, ...

  7. .netcore利用perf分析高cpu使用率

    目录 一 在宿主机运行perf 二 容器内安装perf 1,重新构建镜像 2,下载火焰图生成脚本 3,安装linux-perf 三 CPU占用分析 1,perf record捕获进程 2,生成火焰图 ...

  8. Netty与NIO

    初识Netty Netty是由JBoss提供的一个Java的开源框架,是GitHub上的独立项目. Netty是一个异步的,基于事件驱动的网络应用框架,用于快速开发高性能.高可靠的网络IO程序. Ne ...

  9. mmall商城购物车模块总结

    购物车模块的设计思想 购物车的实现方式有很多,但是最常见的就三种:Cookie,Session,数据库.三种方法各有优劣,适合的场景各不相同.Cookie方法:通过把购物车中的商品数据写入Cookie ...

  10. 【Python】用字母生成图像

    用字母生成图像会用到matplotlib.pyplot库 所以需要安装这个库 pip install matplotlib 等待安装完成即可 ps:由于网络原因,会出现多次的timeout,可以使用国 ...