scrapy框架爬取图片并将图片保存到本地

如果基于scrapy进行图片数据的爬取

在爬虫文件中只需要解析提取出图片地址，然后将地址提交给管道
配置文件中：IMAGES_STORE = './imgsLib'
在管道文件中进行管道类的制定：

from scrapy.pipelines.images import ImagesPipeline
将管道类的父类修改成ImagesPipeline
重写父类的三个方法

# -*- coding: utf-8 -*-

import scrapy

from imgPro.items import ImgproItem

class ImgSpider(scrapy.Spider):

    name = 'img'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['http://www.521609.com/daxuemeinv/']

    url = 'http://www.521609.com/daxuemeinv/list8%d.html'

    pageNum = 1

    def parse(self, response):

        li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')

        for li in li_list:

            img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first()

            item = ImgproItem()

            item['src'] = img_src

            yield item

        # if self.pageNum < 3:

        #     self.pageNum += 1

        #     new_url = format(self.url%self.pageNum)

        #     yield scrapy.Request(new_url,callback=self.parse)

img.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ImgproItem(scrapy.Item):

    # define the fields for your item here like:

    src = scrapy.Field()

    # pass

items.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.images import  ImagesPipeline

import scrapy

# class ImgproPipeline(object):

#     def process_item(self, item, spider):

#         return item

class ImgproPipeline(ImagesPipeline):

    #对某一个媒体资源进行请求发送

    #item就是接收到的spider提交过来的item

    def get_media_requests(self, item, info):

        yield scrapy.Request(item['src'])

    #制定媒体数据存储的名称

    def file_path(self, request, response=None, info=None):

        name = request.url.split('/')[-1]

        print('正在下载：',name)

        return name

    #将item传递给下一个即将给执行的管道类

    def item_completed(self, results, item, info):

        return item

pipelines.py

# -*- coding: utf-8 -*-

# Scrapy settings for imgPro project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://docs.scrapy.org/en/latest/topics/settings.html

#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'imgPro'

SPIDER_MODULES = ['imgPro.spiders']

NEWSPIDER_MODULE = 'imgPro.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'imgPro (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

LOG_LEVEL = 'ERROR'

# LOG_FILE = './log.txt'

IMAGES_STORE = './imgsLib'

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'imgPro.middlewares.ImgproSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'imgPro.middlewares.ImgproDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'imgPro.pipelines.ImgproPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

scrapy框架爬取图片并将图片保存到本地的更多相关文章

使用scrapy框架爬取图片网全站图片(二十多万张)，并打包成exe可执行文件
目标网站:https://www.mn52.com/ 本文代码已上传至git和百度网盘,链接分享在文末网站概览目标,使用scrapy框架抓取全部图片并分类保存到本地. 1.创建scrapy项目 s ...
python爬虫---scrapy框架爬取图片,scrapy手动发送请求,发送post请求,提升爬取效率,请求传参(meta),五大核心组件,中间件
# settings 配置 UA USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, l ...
爬虫 Scrapy框架爬取图虫图片并下载
items.py,根据需求确定自己的数据要求 # -*- coding: utf-8 -*- # Define here the models for your scraped items # # S ...
Python多线程爬图&Scrapy框架爬图
一.背景对于日常Python爬虫由于效率问题,本次测试使用多线程和Scrapy框架来实现抓取斗图啦表情.由于IO操作不使用CPU,对于IO密集(磁盘IO/网络IO/人机交互IO)型适合用多线程,对于 ...
使用scrapy框架爬取自己的博文（2）
之前写了一篇用scrapy框架爬取自己博文的博客,后来发现对于中文的处理一直有问题- - 显示的时候 [u'python\u4e0b\u722c\u67d0\u4e2a\u7f51\u9875\u76 ...
php 获取远程图片保存到本地
php 获取远程图片保存到本地使用两个函数 1.获取远程文件 2.把图片保存到本地 /** * 获取远程图片并把它保存到本地 * $url 是远程图片的完整URL地址,不能为空. */ functi ...
iOS 将图片保存到本地
//将图片保存到本地 + (void)SaveImageToLocal:(UIImage*)image Keys:(NSString*)key { NSUserDefaults* prefer ...
iOS-iOS调用相机调用相册【将图片保存到本地相册】
设置头部代理 <UINavigationControllerDelegate, UIImagePickerControllerDelegate> 1.调用相机检测前置摄像头是否可用 - ...
Android View转为图片保存为本地文件，异步监听回调操作结果；
把手机上的一个View或ViewGroup转为Bitmap,再把Bitmap保存为.png格式的图片: 由于View转Bitmap.和Bitmap转图片都是耗时操作,(生成一个1M的图片大约500ms ...

随机推荐

python中is，== 和 in 的区别
python对象的三个基本要素:id(身份标识),type(数据类型)和value(值). is 运算符:判断的是对象间的唯一身份标识(id). == 运算符:判断的是对象间的value(值)是否相同 ...
WSL2 VS Code远程开发准备
上一节我们在linux中创建了mvc项目,但是要是在linux中用命令行直接开发的话,就有些扯了. 我们可以使用VS Code进行远程开发,简单来说,就是在windows中打开VS Code,打开Li ...
2.使用jenkins自动构建并发布应用到k8s集群
作者微信:tangy8080 电子邮箱:914661180@qq.com 更新时间:2019-06-21 14:39:01 星期五欢迎您订阅和分享我的订阅号,订阅号内会不定期分享一些我自己学习过程 ...
Taro 3.x in Action
Taro 3.x in Action React, 小程序 https://taro-docs.jd.com/taro/docs/README Taro Next 跨端, 跨框架 Taro 是一个开放 ...
CSS3实现垂直居中水平居中的技巧
1 1 1 How To Center Anything With CSS Front End posted by Code My Views 1 Recently, we took a dive i ...
npm published cli package's default install missing the `-g` flag
npm published cli package's default install missing the -g flag https://npm.community/t/npm-publishe ...
puppeteer & screenshot
puppeteer & screenshot http://localhost:9812/screenshot?url=https://cdn.xgqfrms.xyz/
js 创建简单的表单同步验证器
SyncValidate declare const uni: any; export interface SyncValidateOpt { [key: string]: SyncValidateF ...
NGK福利再升级，1万枚VAST限时免费送
NGK在推出持有算力获得SPC空投活动后,福利再升级,于美国加州时间2021年2月8日下午4点推出新人福利活动,注册NGK成为新会员,即可获得0.2枚VAST奖励. VAST免费福利送活动仅送出1万枚 ...
Python算法_斐波那契数列（10）
写一个函数,输入 n ,求斐波那契(Fibonacci)数列的第 n 项.斐波那契数列的定义如下: F(0) = 0, F(1) = 1F(N) = F(N - 1) + F(N - 2), 其中 ...

scrapy框架爬取图片并将图片保存到本地

scrapy框架爬取图片并将图片保存到本地的更多相关文章

随机推荐

热门专题