scrapy-redis 分布式哔哩哔哩网站用户爬虫
scrapy里面,对每次请求的url都有一个指纹,这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据,为了让爬虫的数据不重复,我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用redis来保持指纹,并且用redis里面的set集合来判断是否重复。
setting.py
# -*- coding: utf-8 -*- # Scrapy settings for bilibili project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'bilibili' SPIDER_MODULES = ['bilibili.spiders']
NEWSPIDER_MODULE = 'bilibili.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'bilibili (+http://www.yourdomain.com)' # Obey robots.txt rules
# ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'bilibili.middlewares.BilibiliSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'bilibili.middlewares.BilibiliDownloaderMiddleware': 543,
'bilibili.middlewares.randomUserAgentMiddleware':400
} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'bilibili.pipelines.BilibiliPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline':300
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
REDIS_URL = 'redis://@127.0.0.1:6379'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
spider.py
# -*- coding: utf-8 -*-
import scrapy
import json,re
from bilibili.items import BilibiliItem class BilibiliappSpider(scrapy.Spider):
name = 'bilibiliapp'
# allowed_domains = ['www.bilibili.com']
# start_urls = ['http://www.bilibili.com/']
def start_requests(self):
for i in range(1, 300): url = 'https://api.bilibili.com/x/relation/stat?vmid={}&jsonp=jsonp&callback=__jp3'.format(i)
url_ajax = 'https://space.bilibili.com/{}/'.format(i)
# get的时候是这个东东, scrapy.Request(url=, callback=)
req = scrapy.Request(url=url,callback=self.parse,meta={'id':i})
req.headers['referer'] = url_ajax yield req def parse(self, response):
# print(response.text)
comm = re.compile(r'({.*})')
text = re.findall(comm,response.text)[0]
data = json.loads(text)
# print(data)
follower = data['data']['follower']
following = data['data']['following']
id = response.meta.get('id')
url = 'https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&page=1&pagesize=25'.format(id)
yield scrapy.Request(url=url,callback=self.getsubmit,meta={
'id':id,
'follower':follower,
'following':following
}) def getsubmit(self, response):
# print(response.text)
data = json.loads(response.text)
tilst = data['data']['tlist']
tlist_list = []
if tilst != []:
# print(tilst)
for tils in tilst.values():
# print(tils['name'])
tlist_list.append(tils['name'])
else:
tlist_list = ['无爱好']
follower = response.meta.get('follower')
following = response.meta.get('following')
id = response.meta.get('id')
url = 'https://api.bilibili.com/x/space/acc/info?mid={}&jsonp=jsonp'.format(id)
yield scrapy.Request(url=url,callback=self.space,meta={
'id':id,
'follower':follower,
'following':following,
'tlist_list':tlist_list
}) def space(self, respinse):
# print(respinse.text)
data = json.loads(respinse.text)
name = data['data']['name']
sex = data['data']['sex']
level = data['data']['level']
birthday = data['data']['birthday']
tlist_list = respinse.meta.get('tlist_list')
animation = 0
Life = 0
Music = 0
Game = 0
Dance = 0
Documentary = 0
Ghost = 0
science = 0
Opera = 0
entertainment = 0
Movies = 0
National = 0
Digital = 0
fashion = 0
for tlist in tlist_list:
if tlist == '动画':
animation = 1
elif tlist == '生活':
Life = 1
elif tlist == '音乐':
Music = 1
elif tlist == '游戏':
Game = 1
elif tlist == '舞蹈':
Dance = 1
elif tlist == '纪录片':
Documentary = 1
elif tlist == '鬼畜':
Ghost = 1
elif tlist == '科技':
science = 1
elif tlist == '番剧':
Opera =1
elif tlist == '娱乐':
entertainment = 1
elif tlist == '影视':
Movies = 1
elif tlist == '国创':
National = 1
elif tlist == '数码':
Digital = 1
elif tlist == '时尚':
fashion = 1
item = BilibiliItem()
item['name'] = name
item['sex'] = sex
item['level'] = level
item['birthday'] = birthday
item['follower'] = respinse.meta.get('follower')
item['following'] = respinse.meta.get('following')
item['animation'] = animation
item['Life'] = Life
item['Music'] = Music
item['Game'] = Game
item['Dance'] = Dance
item['Documentary'] = Documentary
item['Ghost'] = Ghost
item['science'] = science
item['Opera'] = Opera
item['entertainment'] = entertainment
item['Movies'] = Movies
item['National'] = National
item['Digital'] = Digital
item['fashion'] = fashion
yield item
设置ua池
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random class randomUserAgentMiddleware(UserAgentMiddleware): def __init__(self,user_agent=''):
self.user_agent = user_agent def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
user_agent_list = [ \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
git地址:https://github.com/18370652038/scrapy-bilibili
scrapy-redis 分布式哔哩哔哩网站用户爬虫的更多相关文章
- 爬虫--scrapy+redis分布式爬取58同城北京全站租房数据
作业需求: 1.基于Spider或者CrawlSpider进行租房信息的爬取 2.本机搭建分布式环境对租房信息进行爬取 3.搭建多台机器的分布式环境,多台机器同时进行租房数据爬取 建议:用Pychar ...
- 如何下载B站哔哩哔哩(bilibili)弹幕网站上的视频呢?小白教你个简单方法
对于90后.00后来说,B站肯定听过吧.小编有一个苦恼的地方,有时候想把哔哩哔哩(bilibili)上看到的视频保存到手机相册,不知道咋操作啊.网上百度了下,都是要下载电脑软件的,有些还得要付费的.前 ...
- scrapy之分布式
分布式爬虫 概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取. 原生的scrapy是不可以实现分布式爬虫? a) 调度器无法共享 b) 管道无法共享 工具 scrapy-redis组件: ...
- 2019 哔哩哔哩java面试笔试题 (含面试题解析)
本人5年开发经验.18年年底开始跑路找工作,在互联网寒冬下成功拿到阿里巴巴.今日头条.哔哩哔哩等公司offer,岗位是Java后端开发,因为发展原因最终选择去了哔哩哔哩,入职一年时间了,也成为了面 ...
- 最新 哔哩哔哩java校招面经 (含整理过的面试题大全)
从6月到10月,经过4个月努力和坚持,自己有幸拿到了网易雷火.京东.去哪儿.哔哩哔哩等10家互联网公司的校招Offer,因为某些自身原因最终选择了哔哩哔哩.6.7月主要是做系统复习.项目复盘.Leet ...
- j2ee分布式架构 dubbo + springmvc + mybatis + ehcache + redis 分布式架构
介绍 <modules> <!-- jeesz 工具jar --> <module>jeesz-utils</module> ...
- 面试官问我,Redis分布式锁如何续期?懵了。
前言 上一篇[面试官问我,使用Dubbo有没有遇到一些坑?我笑了.]之后,又有一位粉丝和我说在面试过程中被虐了.鉴于这位粉丝是之前肥朝的粉丝,而且周一又要开启新一轮的面试,为了回馈他长期以来的支持,所 ...
- Redis 分布式缓存 Java 框架
为什么要在 Java 分布式应用程序中使用缓存? 在提高应用程序速度和性能上,每一毫秒都很重要.根据谷歌的一项研究,假如一个网站在3秒钟或更短时间内没有加载成功,会有 53% 的手机用户会离开. 缓存 ...
- scrapy简单分布式爬虫
经过一段时间的折腾,终于整明白scrapy分布式是怎么个搞法了,特记录一点心得. 虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘.有能人改变了scrapy的队列调度,将起始的网 ...
随机推荐
- 写个sleep玩玩
static void sig_when_weakup(int no){ printf("weakup weakup\n"); longjmp(buf, ); } void wea ...
- OpenCV——PS 滤镜算法之极坐标变换到平面坐标
// define head function #ifndef PS_ALGORITHM_H_INCLUDED #define PS_ALGORITHM_H_INCLUDED #include < ...
- 使用django-extension扩展django的manage――runscript命令
摘要:1.下载安装 1)$easy_installdjango-extensions 2)在INSTALLED_APP中添加'django_extensions'[python]INSTALL ...
- myeclipse 2017破解安装教程+开发环境部署(jdk+tomcat)
点击安装包,进入安装界面,点击next 选择接受协议,点击next 选择安装目录,点击next 格局自己电脑的机型选择32bit或64bit,点击next 安装完成后不要运行MyEclipse,将 & ...
- Hive操作笔记
hive库清表,删除数据 insert overwrite table lorry.bigdata select * from lorry.bigdata where 1=0 hive的simple模 ...
- 关于分支和主干Merge时要注意的事项
现在我们同时在主干和分支上进行开发, 当你需要将主干上某一工程代码 Merge到分支上(或者相反)时, 不要用check out 然后全部覆盖的方法, 这样不会关联源上的任何 history, 而且需 ...
- Mysql常用命令行大全(三)
/**操作数据库*/ SHOW DATABASES; CREATE DATABASE db; SHOW DATABASES; DROP DATABASE db; /**操作表*/ USE db; S ...
- 13 vue学习 package.json
一:package.json文件详解 管理你本地安装的npm包 .定义了这个项目所需要的各种模块,以及项目的配置信息(比如名称.版本.许可证等元数据).npm install命令根据这个配置文件,自动 ...
- qpython 读入数据问题: EOF error with input / raw_input
直接使用input会报错 EOF error with input / raw_input 原因是在qpy里console mode 命令行模式不是完全和pc上的命令行一致,所以input和raw_i ...
- vmware中centos6.5无法启动拷贝出里面的资料的方法
先说一下我的环境:windows7-x64位机器下安装的vmware虚拟机,里面安装的是centos6.5-x64位的系统. 系统崩溃的原因:从cenos拖拽一个文件到win7下,结果就卡死了.整个系 ...