scrapy爬取二级页面的内容

1.定义数据结构item.py文件

# -*- coding: utf-8 -*-

'''

field: item.py

'''

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TupianprojectItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    # 图片标题

    title = scrapy.Field()

    # 发布时间

    publish_time = scrapy.Field()

    # 浏览量

    look = scrapy.Field()

    # 收藏量

    collect = scrapy.Field()

    # 下载量

    download = scrapy.Field()

    # 图片链接

    image_url = scrapy.Field()

2.爬虫文件

# -*- coding: utf-8 -*-

import scrapy

from tupianproject.items import TupianprojectItem

class ImageSpider(scrapy.Spider):

    name = 'image'

    allowed_domains = ['699pic.com']

    start_urls = ['http://699pic.com/people-1-0-0-0-0-0-0.html']

    url = 'http://699pic.com/people-{}-0-0-0-0-0-0.html'

    page = 1

    def parse(self, response):

        # 在一级页面中，应该将所有的图片详情页面的链接获取到

        image_detail_url_list = response.xpath('//div[@class="list"]/a/@href').extract()

        # pass

        # 遍历详情页面，向每一个详情页面发送请求即可

        for image_detail_url in image_detail_url_list:

            yield scrapy.Request(url=image_detail_url, callback=self.parse_detail)

        # 接着发送其他请求

        if self.page <= 3:

            self.page += 1

            url = self.url.format(self.page)

            yield scrapy.Request(url=url, callback=self.parse)

    def parse_detail(self, response):

        # 创建一个item对象

        item = TupianprojectItem()

        # 提取图片的每一个信息

        # title

        item['title'] = response.xpath('//div[@class="photo-view"]/h1/text()').extract_first()

        # 发布时间

        item['publish_time'] = response.xpath('//div[@class="photo-view"]/div/span[@class="publicityt"]')[0].xpath('string(.)').extract_first()

        # 获取浏览量

        item['look'] = response.xpath('//div[@class="photo-view"]/div/span[@class="look"]/read/text()').extract_first()

        # 获取收藏量

        item['collect'] = response.xpath('//div[@class="photo-view"]/div/span[@class="collect"]')[0].xpath('string(.)').extract_first()

        # 获取下载量

        item['download'] = response.xpath('//div[@class="photo-view"]/div/span[@class="download"]')[0].xpath('string(.)').extract_first().strip('\n\t')

        # 获取图片的链接

        item['image_url'] = response.xpath('//div[@class="huabu"]//img/@src').extract_first()

        # 将item发送出去

        yield item

3.管道文件

# -*- coding: utf-8 -*-

'''

filed: pipelines.py

'''

s

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

import urllib.request

import os

class TupianprojectPipeline(object):

    def open_spider(self, spider):

        self.fp = open('tupian.json', 'w', encoding='utf8')

    def process_item(self, item, spider):

        d = dict(item)

        string = json.dumps(d, ensure_ascii=False)

        self.fp.write(string + '\n')

        # 下载图片

        self.download(item)

        return item

    def download(self, item):

        dirname = './people'

        suffix = item['image_url'].split('.')[-1]

        filename = item['title'] + '.' + suffix

        filepath = os.path.join(dirname, filename)

        urllib.request.urlretrieve(item['image_url'], filepath)

    def close_spider(self, spider):

        self.fp.close()

scrapy(四): 爬取二级页面的内容的更多相关文章

scrapy框架爬取多级页面
spides.py # -*- coding: utf-8 -*- import scrapy from weather.items import WeatherItem from scrapy.cr ...
[scrapy]实例:爬取jobbole页面
工程概览: 创建工程 scrapy startproject ArticleSpider 创建spider cd /ArticleSpider/spiders/ 新建jobbole.py # -*- ...
Scrapy 框架使用 selenium 爬取动态加载内容
使用 selenium 爬取动态加载内容开启中间件 DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMidd ...
Scrapy爬取静态页面
Scrapy爬取静态页面安装Scrapy框架: Scrapy是python下一个非常有用的一个爬虫框架 Pycharm下: 搜索Scrapy库添加进项目即可终端下: #python2 sudo p ...
scrapy模拟浏览器爬取验证码页面
使用selenium模块爬取验证码页面,selenium模块需要另外安装这里不讲环境的配置,我有一篇博客有专门讲ubuntn下安装和配置模拟浏览器的开发 spider的代码 # -*- coding: ...
【Scrapy(四)】scrapy 分页爬取以及xapth使用小技巧
scrapy 分页爬取以及xapth使用小技巧这里以爬取www.javaquan.com为例: 1.构建出下一页的url: 很显然通过dom树,可以发现下一页所在的a标签 2.使用scrapy的 ...
Scrapy+selenium爬取简书全站
Scrapy+selenium爬取简书全站环境 Ubuntu 18.04 Python 3.8 Scrapy 2.1 爬取内容文字标题作者作者头像发布日期内容文章连接文章ID 思路分 ...
使用scrapy爬虫,爬取17k小说网的案例-方法二
楼主准备爬取此页面的小说,此页面一共有125章我们点击进去第一章和第一百二十五章发现了一个规律我们看到此链接的 http://www.17k.com/chapter/271047/6336386 ...
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息系统环境:Fedora22(昨天已安装scrapy环境) 爬取的开始URL:ht ...

随机推荐

Python语言基础-语法特点、保留字与标识符、变量、基本数据类型、运算符、基本输入输出、Python2.X与Python3.X区别
Python语言基础 1.Python语法特点注释: 单行注释:# #注释单行注释分为两种情况,例:第一种#用于计算bim数值bim=weight/(height*height)第二种:bim=we ...
读懂操作系统之虚拟内存TLB与缓存（cache）关系篇（四）
前言前面我们讲到通过TLB缓存页表加快地址翻译,通过上一节缓存原理的讲解为本节做铺垫引入TLB和缓存的关系,同时我们来完整梳理下从CPU产生虚拟地址最终映射为物理地址获取数据的整个过程是怎样的,若有 ...
Dos命令提示符下 - 用sqlcmd执行*.sql语句
Dos命令提示符下 - 用sqlcmd执行*.sql语句 1)在Dos命令下执行sqlcmd命令(当然事先需要将sqlcmd增加到环境变量中去), 2)下面白色部分替换为服务器名或计算机名即可sqlc ...
14.Django-JWT
一.基于JWT的Token登录认证 1. JWT简介 json Web Token(缩写JWT)是目前最流行的跨域认证解决方案 session登录的认证方案是看,用户从客户端传递用户名和密码登录信息, ...
2019-02-07 selenium...
今天是超级郁闷的一天看教程下了mysql-----配置-----不会----查资料------2小时后 mongodb-----配置------不会------查资料------1小时后然后是各 ...
jmeter录制app测试脚本
1.jmeter 下载地址 https://jmeter.apache.org 2.选择下载包 3.下载完成后解压即可使用(也可以配置环境变量,但我一般不配置,可以使用) 4.打开jmeter 创建线 ...
redis概要学习
redis 概要学习 redis简介 Redis 是完全开源免费的,遵守BSD协议,是一个高性能的key-value数据库. Redis 与其他 key - value 缓存产品有以下三个特点: Re ...
IDEA+Maven+Tomcat构建Web项目的三种方法
[本文版权归微信公众号"代码艺术"(ID:onblog)所有,若是转载请务必保留本段原创声明,违者必究.若是文章有不足之处,欢迎关注微信公众号私信与我进行交流!] 本文将介绍三种方 ...
流媒体学习计划表——pr
参考教程视频:b站oeasy 书籍:<adobe premiere pro cc 2018经典教程> 学习教训一定要多做--实践是检验真理的唯一标准书籍补充理论知识,视频讲究实操(理 ...
《Elasticsearch 权威指南》阅读笔记
书籍地址 https://www.elastic.co/guide/cn/elasticsearch/guide/current/languages.html

scrapy(四): 爬取二级页面的内容

scrapy爬取二级页面的内容

1.定义数据结构item.py文件

2.爬虫文件

3.管道文件

scrapy(四): 爬取二级页面的内容的更多相关文章

随机推荐

热门专题