scrapy爬取知乎某个问题下的所有图片

前言：

　　1、仅仅是想下载图片，别人上传的图片也是没有版权的，下载来可以自己欣赏做手机背景但不商用

　　2、由于爬虫周期的问题，这个代码写于2019.02.13

1.关于知乎爬虫

　　网上能访问到的理论上都能爬取下来，只是网站反爬虫手段和爬取复杂的问题。知乎的内容大概是问题+回答（我才开始用，暂时的概念）。大概流程是：；<1>登录-->进入首页-->点击首页列表中的某篇问题-->查看问题和回答-->查看评论或者<2>百度到某篇问题-->查看问题和回答，在网页版中第二种方式并不需要登录，也即你爬取目的和方法有两种：

　　1.1.从知乎首页开始爬取所有问题（或者某类型问题），并爬取对应的回答（评论）

　　　　需要模拟登录的过程，再从首页访问问题，从问题地址获取回答、评论。这个在知乎模拟登录的过程（https://blog.csdn.net/sinat_34200786/article/details/78449499）有相关介绍，不过这些产品（包括反爬措施）都是在不断变化的，具体还是得自己分析。

　　1.2.爬取某个问题下的回答

　　　　现在来说不需要登录就可以直接获取，我找了上面的方法，发现我自己爬取图片的目的并不需要登录，之前只是一个小问题弄错了。

2.scrapy项目

　　这个项目也是想复习复习scrapy

　　根据分析浏览器网络访问过程可以知道，我所希望爬取的东西是通过以下网址获取的json

　　　　https://www.zhihu.com/api/v4/questions/309298287/answers?　　　　include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset=3&limit=1&sort_by=default&platform=desktop

　　里面有很多参数，主要参数是offset和limit。它请求头有些复杂，但是后来发现只要把基本的“User-Agent”设置好应该就差不多了，毕竟并不需要登录。使用了scrapy自带的ImagesPipeline,除了图片我对其他信息也不感兴趣。那么代码如下：

　　1.1. item.py中比较简单，只是储存图片地址，类型应该是['xxx.jpg','yyy.jpg']的list

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class Img584294770Item(scrapy.Item):

    # define the fields for your item here like:

    imgs = scrapy.Field()

item.py

　　1.2. Img58429477.py是spider文件，定义了爬取的过程

# -*- coding: utf-8 -*-

#Author:lwx

#21090111我想要爬取知乎id584294770问题下回答的图片

from scrapy import Spider

import scrapy

import json

import re

from zhihu.items import *

import requests

class Img584294770(Spider):

    name = 'Img584294770'

    start_urls=[

        'https://www.zhihu.com/',

        'https://www.zhihu.com/api/v4/questions/309298287/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset={offset}&limit={limit}&sort_by=default&platform=desktop'

    ]

    #print(start_urls[0].format(limit=1,offset=1));

    #设置header

    headers = {

        'Accept':'*/*',

        #'Accept-Encoding':'gzip, deflate, br',#这个不要设置，因为设置后会返回乱码

        'Accept-Language':'zh-CN,zh;q=0.9',

        'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36',

        'x-requested-with':'fetch',

    }

    #设置访问

    def start_requests(self):

        url = self.start_urls[1].format(offset=1,limit=1)

        #url = "http://36.248.23.136:8888/data_zfcg/"

        res = requests.get(url,headers=self.headers)

        data = json.loads(res.text)

        if('data' in data.keys()):#有获取到数据

            #先获取回答总数

            totalPage = data['paging']['totals']

            for Page in range(0,totalPage):

                #以三个为一组进行访问

                url = self.start_urls[1].format(offset=Page*3,limit=3)

                yield scrapy.Request(url=url, callback=self.parse_imgs, headers = self.headers)

    #获取回答中的图片地址

    def parse_imgs(self, response):

        res = json.loads(response.body)

        if('data' in res.keys()):#有获取到数据

            data = res['data']

            for d in data:

                item = Img584294770Item()

                content = d['content']

                t = self.get_imgs(content)

                item['imgs'] = t

                yield item

    #获取字符串中的图片地址src='xxx.jpg'或src="xxx.png"

    def get_imgs(self,content):

        imgs_url_list = re.findall(r'\ssrc="(.*?)"', content)

        imgs_list = []

        for i in range(len(imgs_url_list)):

            if(imgs_url_list[i].split('.')[-1]=='jpg' or imgs_url_list[i].split('.')[-1]=='png'):

                imgs_list.append(imgs_url_list[i])

        return imgs_list

Img584294770

　　1.3. settings.py设置一些必要的参数

# -*- coding: utf-8 -*-

# Scrapy settings for zhihu project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'zhihu'

SPIDER_MODULES = ['zhihu.spiders']

NEWSPIDER_MODULE = 'zhihu.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'zhihu (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False#置为false表示不遵守robot.txt，去爬取网站不允许的内容

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 3#设置一下延时

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'zhihu.middlewares.ZhihuSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'zhihu.middlewares.ZhihuDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import os

IMAGES_EXPIRES = 90 #图片过期时间，在这个时间内爬取过的都不再爬取

IMAGES_URLS_FIELD ="imgs"#图片地址在item中的名字

project_dir=os.path.abspath(os.path.dirname(__file__))

IMAGES_STORE=os.path.join(project_dir,'images')#图片储存的文件夹

ITEM_PIPELINES = {

    'scrapy.contrib.pipeline.images.ImagesPipeline':200,

    #'zhihu.pipelines.ZhihuPipeline': 300,

    'zhihu.pipelines.Img584294770Pipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

3.总结

　　3.1.'utf-8'报错，你用的所有文件,item,spider或者pineline，请设置成utf-8格式，我看到这个问题就恶心但是总是记不得

　　3.2."ROBOTSTXT_OBEY = False"，在settings.py中这个值默认是True，即遵守robot.txt的规定。如果没有设置，你会发现你的爬虫明明进去转了一圈，但是很绅士地什么都没动人家，扔给你一个200但是就是不给你想要的数据。这个值得True模式是一些搜索引擎常用的，而我们做这些爬虫，就是不受网页的所有者欢迎的，超越了robot.txt规定的范围。

　　3.3.我这个小项目拖了三天，不是因为有点不记得scrapy开发过程，主要是因为借着分析网页的目的围观某乎大佬装逼（并不）编码问题。即使后面发现就是headers中不要设置'Accept-Encoding':'gzip, deflate, br'的问题，但是编码问题还是狠狠蹂躏了我又一回，我还要因为又又又忘记编码问题回去啃一遍书。

6000多张图片了……自己右键下载多麻烦，不过自己喜欢的也不过是百来张……还是麻烦，代码又不能根据你喜好帮你挑好……

参考：

https://blog.csdn.net/xwbk12/article/details/79009995

https://blog.csdn.net/sinat_34200786/article/details/78449499

scrapy爬取知乎某个问题下的所有图片的更多相关文章

python scrapy爬取知乎问题和收藏夹下所有答案的内容和图片
上文介绍了爬取知乎问题信息的整个过程,这里介绍下爬取问题下所有答案的内容和图片,大致过程相同,部分核心代码不同. 爬取一个问题的所有内容流程大致如下: 一个问题url 请求url,获取问题下的答案个数 ...
利用 Scrapy 爬取知乎用户信息
思路:通过获取知乎某个大V的关注列表和被关注列表,查看该大V和其关注用户和被关注用户的详细信息,然后通过层层递归调用,实现获取关注用户和被关注用户的关注列表和被关注列表,最终实现获取大量用户信息. 一 ...
使用python scrapy爬取知乎提问信息
前文介绍了python的scrapy爬虫框架和登录知乎的方法. 这里介绍如何爬取知乎的问题信息,并保存到mysql数据库中. 首先,看一下我要爬取哪些内容: 如下图所示,我要爬取一个问题的6个信息: ...
scrapy 爬取知乎问题、答案，并异步写入数据库（mysql）
python版本 python2.7 爬取知乎流程: 一 .分析在访问知乎首页的时候(https://www.zhihu.com),在没有登录的情况下,会进行重定向到(https://www. ...
爬虫（十六）：scrapy爬取知乎用户信息
一:爬取思路首先我们应该找到一个账号,这个账号被关注的人和关注的人都相对比较多的,就是下图中金字塔顶端的人,然后通过爬取这个账号的信息后,再爬取他关注的人和被关注的人的账号信息,然后爬取被关注人的账 ...
scrapy爬取知乎问答
登陆参考 https://github.com/zkqiang/Zhihu-Login # -*- coding: utf-8 -*- import scrapy import time impor ...
Python爬取知乎单个问题下的回答
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 作者: 努力学习的渣渣哦 PS:如有需要Python学习资料的小伙伴可以加 ...
爬虫实战--利用Scrapy爬取知乎用户信息
思路: 主要逻辑图:
教程+资源,python scrapy实战爬取知乎最性感妹子的爆照合集(12G)!
一.出发点: 之前在知乎看到一位大牛(二胖)写的一篇文章:python爬取知乎最受欢迎的妹子(大概题目是这个,具体记不清了),但是这位二胖哥没有给出源码,而我也没用过python,正好顺便学一学,所以 ...

随机推荐

java 处理word文档（含图片，表格内容）
因为本人长期从事Oa相关项目的开发,所以处理word文档,Pdf,Excel等是在所难免的. 1.需求处理Excel 能够用jxl 或者poi 2需求用户在系统上填 ...
.vscode folder
https://stackoverflow.com/questions/32964920/should-i-commit-the-vscode-folder-to-source-control Che ...
NetFlow是一种数据交换方式，提供网络流量的会话级视图，记录下每个TCP/IP事务的信息
NetFlow是一种数据交换方式,提供网络流量的会话级视图,记录下每个TCP/IP事务的信息.也许它不能象tcpdump那样提供网络流量的完整记录,但是当汇集起来时,它更加易于管理和易读.Netflo ...
Linux下清除系统日志方法
摘要:相信大家都是用过Windows的人.对于Windows下饱受诟病的各种垃圾文件都需要自己想办法删除,不然你的系统将会变得越来越大,越来越迟钝!windows怎么清理垃圾相信大家都知道的,那么li ...
17. IntelliJ IDEA + Maven创建Java Web项目
转自:https://www.cnblogs.com/Terry-Wu/p/8006475.html 1. Maven简介相对于传统的项目,Maven 下管理和构建的项目真的非常好用和简单,所以这里 ...
Weka中数据挖掘与机器学习系列之Weka简介（二）
不多说,直接上干货! Weka简介 Weka是怀卡托智能分析环境(Waikato Environment for Knowledge Analysis)的英文字首缩写,官方网址为:http://www ...
AndroidTouchEvent总结
默认状态布局文件 <?xml version="1.0" encoding="utf-8"?> <com.malinkang.touchsa ...
如何在Centos官网下载所需版本的Centos——靠谱的Centos下载教程
很多小伙伴不知道对应版本的Centos怎么下载,最近小编整理了一份Centos详细的下载教程,希望小伙伴们不在为下不到对应版本的Centos而苦恼. 1.进入Centos官网:https://www. ...
5G时代即将到来，有线网络WiFi会消失不见吗？
说到WiFi大家都不陌生了,特别是智能手机出现后,WiFi发展的速度更是可以用“神速”来形容,几乎到处都有WiFi覆盖.以致于现在大家无论去到哪里,往往第一句话就是问“这里有没有WiFi?”或者“Wi ...
兼容IE浏览器的canvas画线和圆圈
1.新建test.html文件,代码如下: <!DOCTYPE html><html><head> <meta charset="utf-8& ...

scrapy爬取知乎某个问题下的所有图片

scrapy爬取知乎某个问题下的所有图片的更多相关文章

随机推荐

热门专题