scrapy-redis 分布式哔哩哔哩网站用户爬虫

scrapy里面，对每次请求的url都有一个指纹，这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据，为了让爬虫的数据不重复，我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用redis来保持指纹，并且用redis里面的set集合来判断是否重复。

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for bilibili project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'bilibili'

SPIDER_MODULES = ['bilibili.spiders']

NEWSPIDER_MODULE = 'bilibili.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'bilibili (+http://www.yourdomain.com)'

# Obey robots.txt rules

# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 1

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  'Accept-Language': 'en',

}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'bilibili.middlewares.BilibiliSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

    'bilibili.middlewares.BilibiliDownloaderMiddleware': 543,

    'bilibili.middlewares.randomUserAgentMiddleware':400

}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'bilibili.pipelines.BilibiliPipeline': 300,

    'scrapy_redis.pipelines.RedisPipeline':300

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

REDIS_URL = 'redis://@127.0.0.1:6379'

SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

　　spider.py

# -*- coding: utf-8 -*-

import scrapy

import json,re

from bilibili.items import BilibiliItem

class BilibiliappSpider(scrapy.Spider):

    name = 'bilibiliapp'

    # allowed_domains = ['www.bilibili.com']

    # start_urls = ['http://www.bilibili.com/']

    def start_requests(self):

            for i in range(1, 300):

                url = 'https://api.bilibili.com/x/relation/stat?vmid={}&jsonp=jsonp&callback=__jp3'.format(i)

                url_ajax = 'https://space.bilibili.com/{}/'.format(i)

                # get的时候是这个东东, scrapy.Request(url=, callback=)

                req = scrapy.Request(url=url,callback=self.parse,meta={'id':i})

                req.headers['referer'] = url_ajax

                yield req

    def parse(self, response):

        # print(response.text)

        comm = re.compile(r'({.*})')

        text = re.findall(comm,response.text)[0]

        data = json.loads(text)

        # print(data)

        follower = data['data']['follower']

        following = data['data']['following']

        id = response.meta.get('id')

        url = 'https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&page=1&pagesize=25'.format(id)

        yield scrapy.Request(url=url,callback=self.getsubmit,meta={

            'id':id,

            'follower':follower,

            'following':following

        })

    def getsubmit(self, response):

        # print(response.text)

        data = json.loads(response.text)

        tilst = data['data']['tlist']

        tlist_list = []

        if tilst != []:

            # print(tilst)

            for tils in tilst.values():

                # print(tils['name'])

                tlist_list.append(tils['name'])

        else:

            tlist_list = ['无爱好']

        follower = response.meta.get('follower')

        following = response.meta.get('following')

        id = response.meta.get('id')

        url = 'https://api.bilibili.com/x/space/acc/info?mid={}&jsonp=jsonp'.format(id)

        yield scrapy.Request(url=url,callback=self.space,meta={

            'id':id,

            'follower':follower,

            'following':following,

            'tlist_list':tlist_list

        })

    def space(self, respinse):

        # print(respinse.text)

        data = json.loads(respinse.text)

        name = data['data']['name']

        sex = data['data']['sex']

        level = data['data']['level']

        birthday = data['data']['birthday']

        tlist_list = respinse.meta.get('tlist_list')

        animation = 0

        Life = 0

        Music = 0

        Game = 0

        Dance = 0

        Documentary = 0

        Ghost = 0

        science = 0

        Opera = 0

        entertainment = 0

        Movies = 0

        National = 0

        Digital = 0

        fashion = 0

        for tlist in tlist_list:

            if tlist == '动画':

                animation = 1

            elif tlist == '生活':

                Life = 1

            elif tlist == '音乐':

                Music = 1

            elif tlist == '游戏':

                Game = 1

            elif tlist == '舞蹈':

                Dance = 1

            elif tlist == '纪录片':

                Documentary = 1

            elif tlist == '鬼畜':

                Ghost = 1

            elif tlist == '科技':

                science = 1

            elif tlist == '番剧':

                Opera =1

            elif tlist == '娱乐':

                entertainment = 1

            elif tlist == '影视':

                Movies = 1

            elif tlist == '国创':

                National = 1

            elif tlist == '数码':

                Digital = 1

            elif tlist == '时尚':

                fashion = 1

        item = BilibiliItem()

        item['name'] = name

        item['sex'] = sex

        item['level'] = level

        item['birthday'] = birthday

        item['follower'] = respinse.meta.get('follower')

        item['following'] = respinse.meta.get('following')

        item['animation'] = animation

        item['Life'] = Life

        item['Music'] = Music

        item['Game'] = Game

        item['Dance'] = Dance

        item['Documentary'] = Documentary

        item['Ghost'] = Ghost

        item['science'] = science

        item['Opera'] = Opera

        item['entertainment'] = entertainment

        item['Movies'] = Movies

        item['National'] = National

        item['Digital'] = Digital

        item['fashion'] = fashion

        yield item

设置ua池

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

import random

class randomUserAgentMiddleware(UserAgentMiddleware):

    def __init__(self,user_agent=''):

        self.user_agent = user_agent

    def process_request(self, request, spider):

        ua = random.choice(self.user_agent_list)

        if ua:

            request.headers.setdefault('User-Agent', ua)

    user_agent_list = [ \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \

        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \

        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \

        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \

        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \

        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \

        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

    ]

git地址：https://github.com/18370652038/scrapy-bilibili

scrapy-redis 分布式哔哩哔哩网站用户爬虫的更多相关文章

爬虫--scrapy+redis分布式爬取58同城北京全站租房数据
作业需求: 1.基于Spider或者CrawlSpider进行租房信息的爬取 2.本机搭建分布式环境对租房信息进行爬取 3.搭建多台机器的分布式环境,多台机器同时进行租房数据爬取建议:用Pychar ...
如何下载B站哔哩哔哩(bilibili)弹幕网站上的视频呢？小白教你个简单方法
对于90后.00后来说,B站肯定听过吧.小编有一个苦恼的地方,有时候想把哔哩哔哩(bilibili)上看到的视频保存到手机相册,不知道咋操作啊.网上百度了下,都是要下载电脑软件的,有些还得要付费的.前 ...
scrapy之分布式
分布式爬虫概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取. 原生的scrapy是不可以实现分布式爬虫? a) 调度器无法共享 b) 管道无法共享工具 scrapy-redis组件: ...
2019 哔哩哔哩java面试笔试题（含面试题解析）
本人5年开发经验.18年年底开始跑路找工作,在互联网寒冬下成功拿到阿里巴巴.今日头条.哔哩哔哩等公司offer,岗位是Java后端开发,因为发展原因最终选择去了哔哩哔哩,入职一年时间了,也成为了面 ...
最新哔哩哔哩java校招面经（含整理过的面试题大全）
从6月到10月,经过4个月努力和坚持,自己有幸拿到了网易雷火.京东.去哪儿.哔哩哔哩等10家互联网公司的校招Offer,因为某些自身原因最终选择了哔哩哔哩.6.7月主要是做系统复习.项目复盘.Leet ...
j2ee分布式架构 dubbo + springmvc + mybatis + ehcache + redis 分布式架构
介绍 <modules>  <module>jeesz-utils</module> ...
面试官问我，Redis分布式锁如何续期？懵了。
前言上一篇[面试官问我,使用Dubbo有没有遇到一些坑?我笑了.]之后,又有一位粉丝和我说在面试过程中被虐了.鉴于这位粉丝是之前肥朝的粉丝,而且周一又要开启新一轮的面试,为了回馈他长期以来的支持,所 ...
Redis 分布式缓存 Java 框架
为什么要在 Java 分布式应用程序中使用缓存? 在提高应用程序速度和性能上,每一毫秒都很重要.根据谷歌的一项研究,假如一个网站在3秒钟或更短时间内没有加载成功,会有 53% 的手机用户会离开. 缓存 ...
scrapy简单分布式爬虫
经过一段时间的折腾,终于整明白scrapy分布式是怎么个搞法了,特记录一点心得. 虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘.有能人改变了scrapy的队列调度,将起始的网 ...

随机推荐

mybatis进行分页，使用limit
这里记录两个思路: 首先是写一个不能执行的代码. <select id="query" parameterType="map" resultType=&q ...
高级Java工程师必备 ----- 深入分析 Java IO （一）BIO
BIO编程最原始BIO 网络编程的基本模型是C/S模型,即两个进程间的通信. 服务端提供IP和监听端口,客户端通过连接操作想服务端监听的地址发起连接请求,通过三次握手连接,如果连接成功建立,双方就可 ...
开发工作之外的修炼Live笔记
“开发工作之外的修炼”这期Live分享了下列话题: [1] 如何发现自己的兴趣 [2] 财富.资源与被动收入 [3] 目标管理 [4] 快速做选择 [5] 时间管理 [6] 如何投资自己 >&g ...
bzoj 1657 [Usaco2006 Mar]Mooo 奶牛的歌声——单调栈水题
题目:https://www.lydsy.com/JudgeOnline/problem.php?id=1657 #include<iostream> #include<cstdio ...
CentOS6.6中安装VNC server（CentOS配置远程桌面）
1.安装服务 yum install tigervnc-server 1 2 名字有点怪哦,CentOS5前叫vnc-server 2.运行并设置密码 vncserver + 回车 1 2 输入密码, ...
如何分析一个QT类
作者:gnuhpc 出处:http://www.cnblogs.com/gnuhpc/ 我们以QLineEdit这个类为例来看看如何学习分析一个QT类. 1.Public Types: 这是一个在这 ...
dede问答汉字变星号
在ask模块里面,question.php中,发现了2行代码 $data['title'] = preg_replace("#{$GLOBALS['cfg_replacestr']}#&qu ...
使用c语言实现的常用函数
/* 为了面试准备的,有些在工作中也可以用用,本人算法方面比较欠缺,如果有更优秀的算法麻烦告诉我啊 */ /* strcat的实现 */ #include <assert.h> char* ...
utunbu下的codeblocks配置openGL环境
真想骂娘阿,刚开始用utunbu,什么也不明白,不明白我装都软件都在哪里,不知道就像windows下的系统文件那样的文件在哪里,也不知道如何配置环境变量.就这样稀里糊涂的,还要抓紧时间装openGL, ...
HDU - 2571 命运 DP倍数跳跃处理
命运穿过幽谷意味着离大魔王lemon已经无限接近了! 可谁能想到,yifenfei在斩杀了一些虾兵蟹将后,却再次面临命运大迷宫的考验,这是魔王lemon设下的又一个机关.要知道,不论何人,若在迷宫中 ...

scrapy-redis 分布式哔哩哔哩网站用户爬虫

scrapy-redis 分布式哔哩哔哩网站用户爬虫的更多相关文章

随机推荐

热门专题