scrapy-redis 分布式哔哩哔哩网站用户爬虫
scrapy里面,对每次请求的url都有一个指纹,这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据,为了让爬虫的数据不重复,我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用redis来保持指纹,并且用redis里面的set集合来判断是否重复。
setting.py
# -*- coding: utf-8 -*- # Scrapy settings for bilibili project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'bilibili' SPIDER_MODULES = ['bilibili.spiders']
NEWSPIDER_MODULE = 'bilibili.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'bilibili (+http://www.yourdomain.com)' # Obey robots.txt rules
# ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'bilibili.middlewares.BilibiliSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'bilibili.middlewares.BilibiliDownloaderMiddleware': 543,
'bilibili.middlewares.randomUserAgentMiddleware':400
} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'bilibili.pipelines.BilibiliPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline':300
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
REDIS_URL = 'redis://@127.0.0.1:6379'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
spider.py
# -*- coding: utf-8 -*-
import scrapy
import json,re
from bilibili.items import BilibiliItem class BilibiliappSpider(scrapy.Spider):
name = 'bilibiliapp'
# allowed_domains = ['www.bilibili.com']
# start_urls = ['http://www.bilibili.com/']
def start_requests(self):
for i in range(1, 300): url = 'https://api.bilibili.com/x/relation/stat?vmid={}&jsonp=jsonp&callback=__jp3'.format(i)
url_ajax = 'https://space.bilibili.com/{}/'.format(i)
# get的时候是这个东东, scrapy.Request(url=, callback=)
req = scrapy.Request(url=url,callback=self.parse,meta={'id':i})
req.headers['referer'] = url_ajax yield req def parse(self, response):
# print(response.text)
comm = re.compile(r'({.*})')
text = re.findall(comm,response.text)[0]
data = json.loads(text)
# print(data)
follower = data['data']['follower']
following = data['data']['following']
id = response.meta.get('id')
url = 'https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&page=1&pagesize=25'.format(id)
yield scrapy.Request(url=url,callback=self.getsubmit,meta={
'id':id,
'follower':follower,
'following':following
}) def getsubmit(self, response):
# print(response.text)
data = json.loads(response.text)
tilst = data['data']['tlist']
tlist_list = []
if tilst != []:
# print(tilst)
for tils in tilst.values():
# print(tils['name'])
tlist_list.append(tils['name'])
else:
tlist_list = ['无爱好']
follower = response.meta.get('follower')
following = response.meta.get('following')
id = response.meta.get('id')
url = 'https://api.bilibili.com/x/space/acc/info?mid={}&jsonp=jsonp'.format(id)
yield scrapy.Request(url=url,callback=self.space,meta={
'id':id,
'follower':follower,
'following':following,
'tlist_list':tlist_list
}) def space(self, respinse):
# print(respinse.text)
data = json.loads(respinse.text)
name = data['data']['name']
sex = data['data']['sex']
level = data['data']['level']
birthday = data['data']['birthday']
tlist_list = respinse.meta.get('tlist_list')
animation = 0
Life = 0
Music = 0
Game = 0
Dance = 0
Documentary = 0
Ghost = 0
science = 0
Opera = 0
entertainment = 0
Movies = 0
National = 0
Digital = 0
fashion = 0
for tlist in tlist_list:
if tlist == '动画':
animation = 1
elif tlist == '生活':
Life = 1
elif tlist == '音乐':
Music = 1
elif tlist == '游戏':
Game = 1
elif tlist == '舞蹈':
Dance = 1
elif tlist == '纪录片':
Documentary = 1
elif tlist == '鬼畜':
Ghost = 1
elif tlist == '科技':
science = 1
elif tlist == '番剧':
Opera =1
elif tlist == '娱乐':
entertainment = 1
elif tlist == '影视':
Movies = 1
elif tlist == '国创':
National = 1
elif tlist == '数码':
Digital = 1
elif tlist == '时尚':
fashion = 1
item = BilibiliItem()
item['name'] = name
item['sex'] = sex
item['level'] = level
item['birthday'] = birthday
item['follower'] = respinse.meta.get('follower')
item['following'] = respinse.meta.get('following')
item['animation'] = animation
item['Life'] = Life
item['Music'] = Music
item['Game'] = Game
item['Dance'] = Dance
item['Documentary'] = Documentary
item['Ghost'] = Ghost
item['science'] = science
item['Opera'] = Opera
item['entertainment'] = entertainment
item['Movies'] = Movies
item['National'] = National
item['Digital'] = Digital
item['fashion'] = fashion
yield item
设置ua池
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random class randomUserAgentMiddleware(UserAgentMiddleware): def __init__(self,user_agent=''):
self.user_agent = user_agent def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
user_agent_list = [ \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
git地址:https://github.com/18370652038/scrapy-bilibili
scrapy-redis 分布式哔哩哔哩网站用户爬虫的更多相关文章
- 爬虫--scrapy+redis分布式爬取58同城北京全站租房数据
作业需求: 1.基于Spider或者CrawlSpider进行租房信息的爬取 2.本机搭建分布式环境对租房信息进行爬取 3.搭建多台机器的分布式环境,多台机器同时进行租房数据爬取 建议:用Pychar ...
- 如何下载B站哔哩哔哩(bilibili)弹幕网站上的视频呢?小白教你个简单方法
对于90后.00后来说,B站肯定听过吧.小编有一个苦恼的地方,有时候想把哔哩哔哩(bilibili)上看到的视频保存到手机相册,不知道咋操作啊.网上百度了下,都是要下载电脑软件的,有些还得要付费的.前 ...
- scrapy之分布式
分布式爬虫 概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取. 原生的scrapy是不可以实现分布式爬虫? a) 调度器无法共享 b) 管道无法共享 工具 scrapy-redis组件: ...
- 2019 哔哩哔哩java面试笔试题 (含面试题解析)
本人5年开发经验.18年年底开始跑路找工作,在互联网寒冬下成功拿到阿里巴巴.今日头条.哔哩哔哩等公司offer,岗位是Java后端开发,因为发展原因最终选择去了哔哩哔哩,入职一年时间了,也成为了面 ...
- 最新 哔哩哔哩java校招面经 (含整理过的面试题大全)
从6月到10月,经过4个月努力和坚持,自己有幸拿到了网易雷火.京东.去哪儿.哔哩哔哩等10家互联网公司的校招Offer,因为某些自身原因最终选择了哔哩哔哩.6.7月主要是做系统复习.项目复盘.Leet ...
- j2ee分布式架构 dubbo + springmvc + mybatis + ehcache + redis 分布式架构
介绍 <modules> <!-- jeesz 工具jar --> <module>jeesz-utils</module> ...
- 面试官问我,Redis分布式锁如何续期?懵了。
前言 上一篇[面试官问我,使用Dubbo有没有遇到一些坑?我笑了.]之后,又有一位粉丝和我说在面试过程中被虐了.鉴于这位粉丝是之前肥朝的粉丝,而且周一又要开启新一轮的面试,为了回馈他长期以来的支持,所 ...
- Redis 分布式缓存 Java 框架
为什么要在 Java 分布式应用程序中使用缓存? 在提高应用程序速度和性能上,每一毫秒都很重要.根据谷歌的一项研究,假如一个网站在3秒钟或更短时间内没有加载成功,会有 53% 的手机用户会离开. 缓存 ...
- scrapy简单分布式爬虫
经过一段时间的折腾,终于整明白scrapy分布式是怎么个搞法了,特记录一点心得. 虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘.有能人改变了scrapy的队列调度,将起始的网 ...
随机推荐
- codeforces 651E E. Table Compression(贪心+并查集)
题目链接: E. Table Compression time limit per test 4 seconds memory limit per test 256 megabytes input s ...
- C语言的内存四区模型和函数调用模型
首先是操作系统将代码程序加载到内存中 然后将内存分为4个区 栈区,程序的局部变量区,函数传递的参数,由编译器自动进行内存资源的释放. 堆区,动态内存申请,如果不手动释放内存,则这块内存不会进行析构. ...
- CSDN不登陆看博文
做前端的朋友说,手动改太Low,给了段JS代码: javascript: void((function() {var divElement = document.getElementById('art ...
- CodeForces - 592D: Super M(虚树+树的直径)
Ari the monster is not an ordinary monster. She is the hidden identity of Super M, the Byteforces’ s ...
- 蓝桥杯训练 2n皇后问题
给定一个n*n的棋盘,棋盘中有一些位置不能放皇后.现在要向棋盘中放入n个黑皇后和n个白皇后,使任意的两个黑皇后都不在同一行.同一列或同一条对角线上,任意的两个白皇后都不在同一行.同一列或同一条对角线上 ...
- BZOJ3784:树上的路径
浅谈树分治:https://www.cnblogs.com/AKMer/p/10014803.html 题目传送门:https://www.lydsy.com/JudgeOnline/problem. ...
- bzoj 5281 [Usaco2018 Open]Talent Show——0/1分数规划
题目:https://www.lydsy.com/JudgeOnline/problem.php?id=5281 把分子乘1000,就能在整数里做了. 这种水题也花了这么久…… #include< ...
- bootstrap 全局样式
reset.css html { font-family: sans-serif; -webkit-text-size-adjust: 100%; -ms-text-size-adjust: 100% ...
- sum(sum(abs(y))) 中 sum(sum())什么意思?
>> y=[1 3;2 5] y = 1 3 2 5 >> sum(y) ans = 3 8 >> sum(s ...
- Django 中ORM 的使用
一:Django 中 orm 的使用 1:手动新建一个数据库 2 :告诉Django连接哪个数据库 settings.py里配置数据库连接信息: #数据库相关的配置项 DATABASES ={ 'de ...