scrapy框架的另一种分页处理以及mongodb的持久化储存以及from_crawler类方法的使用
一.scrapy框架处理
1.分页处理
以爬取亚马逊为例
爬虫文件.py
# -*- coding: utf-8 -*-
import scrapy
from Amazon.items import AmazonItem class AmazonSpider(scrapy.Spider):
name = 'amazon'
allowed_domains = ['www.amazon.cn']
start_urls = ['www.amazon.cn'] def start_requests(self):
# 重写父类方法,拿到商品搜索页
url = 'https://www.amazon.cn/s/ref=nb_sb_noss?__mk_zh_CN=亚马逊网站&url=search-alias%3Daps&field-keywords=iphone+-xs&rh=i%3Aaps%2Ck%3Aiphone+-xs&ajr=0'
yield scrapy.Request(url=url, callback=self.parse) def parse(self, response):
# 解析每一个商品的url
links = response.xpath('//*[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
# 同时拿到下一页的连接
next_page_url = response.xpath('//a[@id="pagnNextLink"]/@href').extract_first()
print('>>>>>>>>>>>>>', next_page_url)
# 再对这些每一个商品的url进行请求
for link in links:
yield scrapy.Request(url=link, callback=self.parse_detail) #分页处理
# 把所有的商品详情遍历完了之后,再判断是否有下一页,有下一页就继续对下一页发起请求
if next_page_url:
scrapy.Request(url=next_page_url, callback=self.parse) def parse_detail(self, response):
#每个商品的详情页解析出我们要的数据
title = response.xpath('//*[@id="productTitle"]/text()').extract_first().strip()
price = (response.xpath("//*[@id='priceblock_ourprice']/text()") or response.xpath(
"//*[@id='priceblock_saleprice']/text()")).extract_first().strip()
deliver = response.xpath('//*[@id="ddmMerchantMessage"]/*[1]/text()').extract_first().strip() #把数据装到容器里面
item=AmazonItem()
item['title']=title
item['price']=price
item['deliver']=deliver
#记得返回,否则管道接不到
yield item
2.mongodb持久化储存以及from_crawl的使用
pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymongo
class AmazonPipeline(object): @classmethod
def from_crawler(cls, crawler):
"""
Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
成实例化,早于__init__方法执行
自己要的参数要去settings.py文件配置
"""
HOST = crawler.settings.get('HOST')
PORT = crawler.settings.get('PORT')
USER = crawler.settings.get('USER')
PWD = crawler.settings.get('PWD')
DB = crawler.settings.get('DB')
TABLE = crawler.settings.get('TABLE')
return cls(HOST, PORT, USER, PWD, DB, TABLE) def __init__(self,host,port,user,pwd,db,table):
self.host=host
self.port=port
self.user=user
self.pwd=pwd
self.db=db
self.table=table def open_spider(self,spider):
#程序运行时执行一次
self.client=pymongo.MongoClient(host=self.host,port=self.port) def process_item(self, item, spider):
dic_item=dict(item)
if dic_item:
self.client[self.db][self.table].save(dic_item)
return item
def close_spider(self,spider):
#程序关闭时候执行一次
self.client.close()
settings.py
# -*- coding: utf-8 -*- # Scrapy settings for Amazon project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'Amazon' SPIDER_MODULES = ['Amazon.spiders']
NEWSPIDER_MODULE = 'Amazon.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Amazon.middlewares.AmazonSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'Amazon.middlewares.AmazonDownloaderMiddleware': 543,
} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Amazon.pipelines.AmazonPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ###MONGODB的配置
HOST='127.0.0.1'
PORT=27017
USER='root'
PWD=''
DB='amazon'
TABLE='goods'
二.补充一个小技巧
一直在命令行启动爬虫文件就很累了,可以这么做
在爬虫项目的根目录直接写一个.py文件,加入如下内容
#第一个,第二不变,第三个是爬虫文件名称,也可以加第四个,--nolog不达意你日志
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'amazon'])
scrapy框架的另一种分页处理以及mongodb的持久化储存以及from_crawler类方法的使用的更多相关文章
- scrapy框架的持久化存储
一 . 基于终端指令的持久化存储 保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作. 执行输出指定格式进行存 ...
- Scrapy框架1——简单使用
一.设置与编写 打开cmd,选择好路径 1.创建项目scrapy startproject projectname d:\爬虫\11.scrapy>scrapy startproject tes ...
- DjangoRestFramework框架三种分页功能的实现 - 在DjangoStarter项目模板中封装
前言 继续Django后端开发系列文章.刚好遇到一个分页的需求,就记录一下. Django作为一个"全家桶"型的框架,本身啥都有,分页组件也是有的,但默认的分页组件没有对API开发 ...
- 爬虫写法进阶:普通函数--->函数类--->Scrapy框架
本文转载自以下网站: 从 Class 类到 Scrapy https://www.makcyun.top/web_scraping_withpython12.html 普通函数爬虫: https:// ...
- 09 Scrapy框架在爬虫中的使用
一.简介 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架.它集成高性能异步下载,队列,分布式,解析,持久化等. Scrapy 是基于twisted框架开发而来,twisted是一个 ...
- day96_11_28 mongoDB与scrapy框架
一.mongodb mongodb是一个面向文档的数据库,而不是关系型数据库.不采用关系型是为了获得更好的扩展性. 它与mysql的区别在于它没有表连接,但是可以通过其他办法实现. 安装数据库. 上官 ...
- Python逆向爬虫之scrapy框架,非常详细
爬虫系列目录 目录 Python逆向爬虫之scrapy框架,非常详细 一.爬虫入门 1.1 定义需求 1.2 需求分析 1.2.1 下载某个页面上所有的图片 1.2.2 分页 1.2.3 进行下载图片 ...
- 关于使用scrapy框架编写爬虫以及Ajax动态加载问题、反爬问题解决方案
Python爬虫总结 总的来说,Python爬虫所做的事情分为两个部分,1:将网页的内容全部抓取下来,2:对抓取到的内容和进行解析,得到我们需要的信息. 目前公认比较好用的爬虫框架为Scrapy,而且 ...
- 爬虫基础(五)-----scrapy框架简介
---------------------------------------------------摆脱穷人思维 <五> :拓展自己的视野,适当做一些眼前''无用''的事情,防止进入只关 ...
随机推荐
- ensemble 的2篇入门 文章
python 篇: http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn ...
- Luogu 4284 [SHOI2014]概率充电器
BZOJ 3566 树形$dp$ + 概率期望. 每一个点的贡献都是$1$,在本题中期望就等于概率. 发现每一个点要通电会在下面三件事中至少发生一件: 1.它自己通电了. 2.它的父亲给它通电了. 3 ...
- Django框架 之 querySet详解
Django框架 之 querySet详解 浏览目录 可切片 可迭代 惰性查询 缓存机制 exists()与iterator()方法 QuerySet 可切片 使用Python 的切片语法来限制查询集 ...
- hdu 4269 Defend Jian Ge
#include <cctype> #include <algorithm> #include <vector> #include <string> # ...
- Aircrack使用
Aircrack Aircrack-ng 组件功能之一就是采集WEP及WPA-PSK字典并应用无线端口扫描进行破解,具体组件说明如下: aircrack-ng 功能主要是WEP及WPA-PSK密码的恢 ...
- ubuntu14.04LTS下制作安装启动U盘
ubuntu自带的启动U盘制作工具在我的非UEFI电脑上无法启动,找到一个国产的好用东西:深度deepin-boot-maker. 下载地址(官方百度盘):点击下载 用起来也很简单,只需要选择下载好的 ...
- poj1840 Eqs(hash+折半枚举)
Description Consider equations having the following form: a1x13+ a2x23+ a3x33+ a4x43+ a5x53=0 The co ...
- Tomcat之——内存溢出设置JAVA_OPTS
答案1设置Tomcat启动的初始内存其初始空间(即-Xms)是物理内存的1/64,最大空间(-Xmx)是物理内存的1/4.可以利用JVM提供的-Xmn -Xms -Xmx等选项可进行设置三.实例,以下 ...
- StackOverflow: 你没见过的七个最好的Java答案
StackOverflow发展到目前,已经成为了全球开发者的金矿.它能够帮助我们找到在各个领域遇到的问题的最有用的解决方案,同时我们也会从中学习到很多新的东西.这篇文章是在我们审阅了StackOver ...
- [SinGuLaRiTy] 树链问题
[SinGuLaRiTy-1035] Copyright (c) SinGuLaRiTy 2017. All Rights Reserved. 关于树链 树链是什么?这个乍一看似乎很陌生的词汇表达的其 ...