scrapy_电影天堂多页数据和图片下载

嵌套的爬取

先获取第一页的标题

点击标题到第二页的图片url

1、创建项目

> scrapy startproject scrapy_movie_099

2、创建爬虫文件

spiders>scrapy genspider mv https: //www.dytt8.net/html/gndy/china/index.html

3、测试

5、运行

spiders> scrapy crawl mv

①、定义数据结构

②、分析xpath

运行

spiders> scrapy crawl mv

分析第二页的地址

运行

spiders> scrapy crawl mv

测试

分析第二页的地址

测试xpath，有些span标签识别不了。调试修改成 //div[@id="Zoom"]//img/@src

测试

查看转到定义request

meta是个字典

导入数据结构类

movie返回个管道

settings中开启管道

pipeline管道的封装

运行

spiders> scrapy crawl mv

项目文件

爬虫核心文件mv.py

import scrapy

from scrapy_movie_099.items import ScrapyMovie099Item

class MvSpider(scrapy.Spider):

    name = 'mv'

    allowed_domains = ['www.dytt8.net']

    start_urls = ['https://www.dytt8.net/html/gndy/china/index.html']

    def parse(self, response):

#         要第一个的名字 和 第二页的图片

        a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

        for a in a_list:

            # 获取第一页的name 和 要点击的链接

            name = a.xpath('./text()').extract_first()

            href = a.xpath('./@href').extract_first()

            # 第二页的地址是

            url = 'https://www.dytt8.net' + href

            # 对第二页的链接发起访问

            yield  scrapy.Request(url=url,callback=self.parse_second,meta={'name':name})

    def parse_second(self,response):

        # 注意 如果拿不到数据的情况下  一定检查你的xpath语法是否正确

        src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first()

        # 接受到请求的那个meta参数的值

        name = response.meta['name']

        movie = ScrapyMovie099Item(src=src,name=name)

        yield movie

items.py自定义结构类

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ScrapyMovie099Item(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    name = scrapy.Field()

    src = scrapy.Field()

settings.py基本配置。robots协议配置，管道配置

# Scrapy settings for scrapy_movie_099 project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://docs.scrapy.org/en/latest/topics/settings.html

#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapy_movie_099'

SPIDER_MODULES = ['scrapy_movie_099.spiders']

NEWSPIDER_MODULE = 'scrapy_movie_099.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'scrapy_movie_099 (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'scrapy_movie_099.middlewares.ScrapyMovie099SpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'scrapy_movie_099.middlewares.ScrapyMovie099DownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'scrapy_movie_099.pipelines.ScrapyMovie099Pipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py管道功能

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

from itemadapter import ItemAdapter

class ScrapyMovie099Pipeline:

    def open_spider(self,spider):

        self.fp = open('movie.json','w',encoding='utf-8')

    def process_item(self, item, spider):

        self.fp.write(str(item))

        return item

    def close_spider(self,spider):

        self.fp.close()

scrapy_电影天堂多页数据和图片下载的更多相关文章

猫眼电影和电影天堂数据csv和mysql存储
字符串常用方法 # 去掉左右空格 'hello world'.strip() # 'hello world' # 按指定字符切割 'hello world'.split(' ') # ['hello' ...
[py][mx]django添加后台课程机构页数据-图片上传设置
分析下课程页前台部分机构类别-目前机构库中没有这个字段,需要追加下所在地区 xadmin可以手动添加课程机构涉及到机构封面图, 即图片上传media设置, 也需要在xadmin里手动添加几条 ...
ajax的get方法获取豆瓣电影前10页的数据
# _*_ coding : utf-8 _*_ # @Time : 2021/11/2 11:45 # @Author : 秋泊酱 # 1页数据电影条数20 # https://movie.dou ...
python爬取豆瓣电影第一页数据and使用with open() as读写文件
# _*_ coding : utf-8 _*_ # @Time : 2021/11/2 9:58 # @Author : 秋泊酱 # @File : 获取豆瓣电影第一页 # @Project : 爬 ...
python利用requests和threading模块，实现多线程爬取电影天堂最新电影信息。
利用爬到的数据,基于Django搭建的一个最新电影信息网站: n1celll.xyz (用的花生壳动态域名解析,服务器在自己的电脑上,纯属自娱自乐哈.) 今天想利用所学知识来爬取电影天堂所有最新电影 ...
Node.js 抓取电影天堂新上电影节目单及ftp链接
代码地址如下:http://www.demodashi.com/demo/12368.html 1 概述本实例主要使用Node.js去抓取电影的节目单,方便大家使用下载. 2 node packag ...
使用Requests+正则表达式爬取猫眼TOP100电影并保存到文件或MongoDB,并下载图片
需要着重学习的地方:(1)爬取分页数据时,url链接的构建(2)保存json格式数据到文件,中文显示问题(3)线程池的使用(4)正则表达式的写法(5)根据图片url链接下载图片并保存(6)MongoD ...
scrapy电影天堂实战(二)创建爬虫项目
公众号原文创建数据库我在上一篇笔记中已经创建了数据库,具体查看<scrapy电影天堂实战(一)创建数据库>,这篇笔记创建scrapy实例,先熟悉下要用到到xpath知识用到的xpat ...
14.python案例：爬取电影天堂中所有电视剧信息
1.python案例:爬取电影天堂中所有电视剧信息 #!/usr/bin/env python3 # -*- coding: UTF-8 -*- '''======================== ...

随机推荐

【vue】使用 Video.js 播放视频
目录安装引入使用参考文档环境: vue 2.0+ element ui (这里的代码用了elmentui的按钮样式,可以不用elment ui的样式) 安装在项目中安装 video.js. ...
nodejs 安装报错解决方案
win10安装nodejs之后,查看版本号在终端输入node -v成功输出版本号,输入npm -v 之后报错...... 反复安装卸载之后,有点奔溃,最后的解决方案是:手动删除"C:\Use ...
题解 Christmas Game
题目传送门题目大意给出 \(t\) 个 \(n\) 个点 \(m\) 条边的无向图,每次可以从任意一棵树选择一条边删掉,然后该树不与根(为 \(1\) )联通的部分被删掉.不能操作的人输.问谁有必 ...
题解 [HAOI2018]反色游戏
题目传送门题目大意给出一个 \(n\) 个点 \(m\) 条无向边的图,每个点都有一个 \(\in [0,1]\) 的权值,每次可以选择一条边,然后将该边相连两点权值异或上 \(1\).问有多少种 ...
CF739E Gosha is hunting（费用流/凸优化dp）
纪念合格考爆炸. 其实这个题之前就写过博客了,qwq但是不小心弄丢了,所以今天来补一下. 首先,一看到球的个数的限制,不难相当用网络流的流量来限制每个球使用的数量. 由于涉及到最大化期望,所以要使用最 ...
Java（6）流程控制语句中分支结构if与switch
作者:季沐测试笔记原文地址:https://www.cnblogs.com/testero/p/15201528.html 博客主页:https://www.cnblogs.com/testero ...
CAM 模板样式表
视图模板类型模板子类型类型子类型刀具类型刀具子类型加工工序 mill_planar FACE_MILLING_AREA 100 261 加工工序 mill_planar FACE ...
python之字符串,列表,集合,字典方法
字典内置函数&方法函数: 1.len(dict1):打印字典的键的个数方法:dict1.( ) 2.clear():清空字典 3.copy():复制字典 4.fromkeys():使用指定 ...
乘风破浪，遇见最美Windows 11之新微软商店(Microsoft Store)生态 - 安卓(Android™)开发体验指南
什么是Windows 11的安卓(Android)应用 2021年6月25日,微软召开线上发布会,对外宣告下一代Windows操作系统Windows 11,Windows 11为用户重新打造的Micr ...
更好的 java 重试框架 sisyphus 配置的 2 种方式介绍
回顾我们前面学习了更好的 java 重试框架 sisyphus 入门简介更好的 java 重试框架 sisyphus 背后的故事这一节让我们一起学习下 sisyphus 基于函数式的配置和注解 ...

scrapy_电影天堂多页数据和图片下载

scrapy_电影天堂多页数据和图片下载的更多相关文章

随机推荐

热门专题