如果基于scrapy进行图片数据的爬取

  1. 在爬虫文件中只需要解析提取出图片地址,然后将地址提交给管道
  2. 配置文件中:IMAGES_STORE = './imgsLib'
  3. 在管道文件中进行管道类的制定:
    • from scrapy.pipelines.images import ImagesPipeline
    • 将管道类的父类修改成ImagesPipeline
    • 重写父类的三个方法

# -*- coding: utf-8 -*-
import scrapy
from imgPro.items import ImgproItem class ImgSpider(scrapy.Spider):
name = 'img'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.521609.com/daxuemeinv/']
url = 'http://www.521609.com/daxuemeinv/list8%d.html'
pageNum = 1
def parse(self, response):
li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
for li in li_list:
img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first()
item = ImgproItem()
item['src'] = img_src yield item # if self.pageNum < 3:
# self.pageNum += 1
# new_url = format(self.url%self.pageNum)
# yield scrapy.Request(new_url,callback=self.parse)

img.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class ImgproItem(scrapy.Item):
# define the fields for your item here like:
src = scrapy.Field()
# pass

items.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.images import ImagesPipeline
import scrapy
# class ImgproPipeline(object):
# def process_item(self, item, spider):
# return item
class ImgproPipeline(ImagesPipeline): #对某一个媒体资源进行请求发送
#item就是接收到的spider提交过来的item
def get_media_requests(self, item, info):
yield scrapy.Request(item['src']) #制定媒体数据存储的名称
def file_path(self, request, response=None, info=None):
name = request.url.split('/')[-1]
print('正在下载:',name)
return name #将item传递给下一个即将给执行的管道类
def item_completed(self, results, item, info):
return item

pipelines.py

# -*- coding: utf-8 -*-

# Scrapy settings for imgPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'imgPro' SPIDER_MODULES = ['imgPro.spiders']
NEWSPIDER_MODULE = 'imgPro.spiders' USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'imgPro (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR'
# LOG_FILE = './log.txt' IMAGES_STORE = './imgsLib'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
# COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'imgPro.middlewares.ImgproSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'imgPro.middlewares.ImgproDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'imgPro.pipelines.ImgproPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

scrapy框架爬取图片并将图片保存到本地的更多相关文章

  1. 使用scrapy框架爬取图片网全站图片(二十多万张),并打包成exe可执行文件

    目标网站:https://www.mn52.com/ 本文代码已上传至git和百度网盘,链接分享在文末 网站概览 目标,使用scrapy框架抓取全部图片并分类保存到本地. 1.创建scrapy项目 s ...

  2. python爬虫---scrapy框架爬取图片,scrapy手动发送请求,发送post请求,提升爬取效率,请求传参(meta),五大核心组件,中间件

    # settings 配置 UA USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, l ...

  3. 爬虫 Scrapy框架 爬取图虫图片并下载

    items.py,根据需求确定自己的数据要求 # -*- coding: utf-8 -*- # Define here the models for your scraped items # # S ...

  4. Python多线程爬图&Scrapy框架爬图

    一.背景 对于日常Python爬虫由于效率问题,本次测试使用多线程和Scrapy框架来实现抓取斗图啦表情.由于IO操作不使用CPU,对于IO密集(磁盘IO/网络IO/人机交互IO)型适合用多线程,对于 ...

  5. 使用scrapy框架爬取自己的博文(2)

    之前写了一篇用scrapy框架爬取自己博文的博客,后来发现对于中文的处理一直有问题- - 显示的时候 [u'python\u4e0b\u722c\u67d0\u4e2a\u7f51\u9875\u76 ...

  6. php 获取远程图片保存到本地

    php 获取远程图片保存到本地 使用两个函数 1.获取远程文件 2.把图片保存到本地 /** * 获取远程图片并把它保存到本地 * $url 是远程图片的完整URL地址,不能为空. */ functi ...

  7. iOS 将图片保存到本地

    //将图片保存到本地 + (void)SaveImageToLocal:(UIImage*)image Keys:(NSString*)key {     NSUserDefaults* prefer ...

  8. iOS-iOS调用相机调用相册【将图片保存到本地相册】

    设置头部代理 <UINavigationControllerDelegate, UIImagePickerControllerDelegate> 1.调用相机 检测前置摄像头是否可用 - ...

  9. Android View转为图片保存为本地文件,异步监听回调操作结果;

    把手机上的一个View或ViewGroup转为Bitmap,再把Bitmap保存为.png格式的图片: 由于View转Bitmap.和Bitmap转图片都是耗时操作,(生成一个1M的图片大约500ms ...

随机推荐

  1. C++ inline与operator

    title: C++ inline与operator date: 2020-03-10 categories: c++ tags: [c++] inline修饰符,operator关键字 1.inli ...

  2. 缓冲区溢出实验 1 strcpy

    实验代码 https://github.com/TouwaErioH/security/tree/master/stack%20overflow 实验目的 Buffer over flow 漏洞利用实 ...

  3. bochs 调试 com 文件 magicbreak

    参考 https://blog.csdn.net/housansan/article/details/41833581 在网上看到2中解决此问题的方法:1.使用dos下的debug32工具单步跟踪pm ...

  4. codeforces 6E (非原创)

    E. Exposition time limit per test 1.5 seconds memory limit per test 64 megabytes input standard inpu ...

  5. spring-cloud-sleuth/zipkin

    Spring Cloud Sleuth 一般的,一个分布式服务跟踪系统,主要有三部分:数据收集.数据存储和数据展示.根据系统大小不同,每一部分的结构又有一定变化.譬如,对于大规模分布式系统,数据存储可 ...

  6. WebIDE All In One

    WebIDE All In One web IDE Visual Studio Code vscode Code editing Redefined. Free. Built on open sour ...

  7. foreign language learning

    foreign language learning free online learning websites 多邻国 https://www.duolingo.com 忆术家 https://www ...

  8. back to top & back to bottom

    back to top & back to bottom infinite auto load more & infinite scroll & load more https ...

  9. TypeScript & Examples

    TypeScript & Examples http://www.typescriptlang.org/samples/index.html https://github.com/Micros ...

  10. mysql一张表到底能存多少数据?

    前言 程序员平时和mysql打交道一定不少,可以说每天都有接触到,但是mysql一张表到底能存多少数据呢?计算根据是什么呢?接下来咱们逐一探讨 知识准备 数据页 在操作系统中,我们知道为了跟磁盘交互, ...