scrapy框架综合运用爬取天气预报 + 定时任务

爬取目标网站：

http://www.weather.com.cn/

具体区域天气地址：

http://www.weather.com.cn/weather1d/101280601.shtm(深圳)

开始：

scrapy startproject weather

编写items.py

import scrapy

class WeatherItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    date  = scrapy.Field()

    temperature  = scrapy.Field()

    weather  = scrapy.Field()

    wind  = scrapy.Field()

　编写spider:

# -*- coding: utf-8 -*-

# @Time    : 2019/8/1 15:40

# @Author  : wujf

# @Email   : 1028540310@qq.com

# @File    : weather.py

# @Software: PyCharm

import scrapy

from weather.items import WeatherItem

class weather(scrapy.Spider):

    name = 'weather'

    allowed_domains = ['www.weather.com.cn/weather/101280601.shtml']

    start_urls = [

        'http://www.weather.com.cn/weather/101280601.shtml'

    ]

    def parse(self, response):

        '''

        筛选信息的函数

        date= 日期

        temperaturature = 当天的温度

        weather = 当天的天气

        wind = 当天的风向

        :param response:

        :return:

        '''

        items = []

        day = response.xpath('//ul[@class="t clearfix"]')

        for i in list(range(7)):

            item = WeatherItem()

            item['date']= day.xpath('./li['+str(i+1)+']/h1//text()').extract_first()

            item['temperature'] = day.xpath('./li['+str(i+1)+']/p[@class="tem"]/i//text()').extract_first()

            item['weather'] = day.xpath('./li['+str(i+1)+']/p[@class="wea"]//text()').extract_first()

            item['wind'] = day.xpath('./li[' + str(i + 1) + ']/p[@class="win"]/i//text()').extract_first()

            #print(item)

            items.append(item)

        return  items

　　编写管道PIPELINE:

pipelines.py是用来处理收尾爬虫抓到的数据的，一般情况下，我们会将数据存到本地

1.文本形式：最基本存储方式

2.json格式：方便调用

3.数据库：数据量比较大选择的存储方式

import os

import requests

import json

import codecs

import pymysql

'''文本方式'''

class WeatherPipeline(object):

    def process_item(self, item, spider):

        #print(item)

        #获取当前目录

        base_dir = os.getcwd()

        #filename = base_dir+'\\data\\test.txt'

        filename = r'E:\Python\weather\weather\data\test.txt'

        with open(filename,'a') as f:

            f.write(item['date'] + '\n')

            f.write(item['temperature'] + '\n')

            f.write(item['weather'] + '\n')

            f.write(item['wind'] + '\n\n')

        return item

'''json数据'''
class W2json(object):

    def process_item(self, item, spider):

        '''

        讲爬取的信息保存到json

        方便其他程序员调用

        '''

        base_dir = os.getcwd()

        #filename = base_dir + '/data/weather.json'

        filename = r'E:\Python\weather\weather\data\weather.json'

        # 打开json文件，向里面以dumps的方式吸入数据

        # 注意需要有一个参数ensure_ascii=False ，不然数据会直接为utf编码的方式存入比如:“/xe15”

        with codecs.open(filename, 'a') as f:

            line = json.dumps(dict(item), ensure_ascii=False) + '\n'

            f.write(line)

        return item

class W2mysql(object):

    def process_item(self, item, spider):

        '''

        讲爬取的信息保存到mysql

        '''

        date        = item['date']

        temperature = item['temperature']

        weather     = item['weather']

        wind        = item['wind']

        connection = pymysql.connect(

            host = '127.0.0.1',

            user = 'root',

            passwd='root',

            db = 'scrapy',

           # charset='utf-8',

            cursorclass = pymysql.cursors.DictCursor

        )

        try:

            with connection.cursor() as  cursor:

                #创建更新值的sql语句

                sql = """INSERT INTO `weather` (date, temperature, weather, wind) VALUES (%s, %s, %s, %s) """

                cursor.execute(

                    sql,(date,temperature,weather,wind)

                )

                connection.commit()

        finally:

            connection.close()

        return item

然后在settings.py里面配置下

'''
设置日志等级
　　           ERROR ： 一般错误

　　　　　　　　WARNING : 警告

　　　　　　　　INFO : 一般的信息

　　　　　　　　DEBUG ： 调试信息

　　　　　　　　默认的显示级别是DEBUG

'''

LOG_LEVEL = 'INFO'

ITEM_PIPELINES = {
   'weather.pipelines.WeatherPipeline': 300,
   'weather.pipelines.W2json': 400,
    'weather.pipelines.W2mysql': 300,
}

上面三个类就展示三种数据整理方式。

最后运行scrapy crawl weather得到三种结果：

　最后写个定时爬区任务

# -*- coding: utf-8 -*-

# @Time    : 2019/8/3 15:38

# @Author  : wujf

# @Email   : 1028540310@qq.com

# @File    : 定时爬虫.py

# @Software: PyCharm

'''

第一种方法 采用sleep

'''

# import time

# import os

# while True:

#     os.system('scrapy crawl weather')

#     time.sleep(3)

# 第二种

from  scrapy import  cmdline

import os

#retal = os.getcwd() #获取当前目录

#print(retal)

os.chdir(r'E:\Python\weather\weather')  #改变目录  因为只有进入scrapy框架才能执行scrapy crawl weather

cmdline.execute(['scrapy', 'crawl', 'weather'])

　　还有一个中间件，但是我手上没有代理ip ，所以暂时玩不了。

OK，到此结束！

scrapy框架综合运用爬取天气预报 + 定时任务的更多相关文章

基于scrapy框架输入关键字爬取有关贴吧帖子
基于scrapy框架输入关键字爬取有关贴吧帖子站点分析首先进入一个贴吧,要想达到输入关键词爬取爬取指定贴吧,必然需要利用搜索引擎点进看到有四种搜索方式,分别试一次,观察url变化我们得知: 搜 ...
一个scrapy框架的爬虫(爬取京东图书)
我们的这个爬虫设计来爬取京东图书(jd.com). scrapy框架相信大家比较了解了.里面有很多复杂的机制,超出本文的范围. 1.爬虫spider tips: 1.xpath的语法比较坑,但是你可以 ...
Scrapy 框架使用 selenium 爬取动态加载内容
使用 selenium 爬取动态加载内容开启中间件 DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMidd ...
Scrapy框架——使用CrawlSpider爬取数据
引言本篇介绍Crawlspider,相比于Spider,Crawlspider更适用于批量爬取网页 Crawlspider Crawlspider适用于对网站爬取批量网页,相对比Spider类,Cr ...
使用scrapy框架来进行抓取的原因
在python爬虫中:使用requests + selenium就可以解决将近90%的爬虫需求,那么scrapy就是解决剩下10%的吗? 这个显然不是这样的,scrapy框架是为了让我们的爬虫更强大. ...
scrapy之360图片爬取
#今日目标 **scrapy之360图片爬取** 今天要爬取的是360美女图片,首先分析页面得知网页是动态加载,故需要先找到网页链接规律, 然后调用ImagesPipeline类实现图片爬取 *代码实 ...
和风api爬取天气预报数据
''' 和风api爬取天气预报数据目标:https://free-api.heweather.net/s6/weather/forecast?key=cc33b9a52d6e48de85247779 ...
爬虫系列---scrapy全栈数据爬取框架(Crawlspider)
一简介 crawlspider 是Spider的一个子类,除了继承spider的功能特性外,还派生了自己更加强大的功能. LinkExtractors链接提取器,Rule规则解析器. 二强大的链接 ...
Scrapy爬虫框架（实战篇）【Scrapy框架对接Splash抓取javaScript动态渲染页面】
(1).前言动态页面:HTML文档中的部分是由客户端运行JS脚本生成的,即服务器生成部分HTML文档内容,其余的再由客户端生成静态页面:整个HTML文档是在服务器端生成的,即服务器生成好了,再发送 ...

随机推荐

【C++】应用程序无法正常启动0xc000007b
在Windows平台编程时,或运行应用程序时,偶尔会遇到“应用程序无法正常启动0xc000007b”或“缺少***.dll”的问题, 首先需要考虑的就是程序相关联的dll有没有放到系统环境中,dll通 ...
域名解析服务-DNS
一.DNS概述 DNS(Domain Name System)即域名系统.它使用层次结构的命名系统.将域名和IP相互映射在整个互联网环境中连接了数以亿计的服务器以及个人主机.其中大部分网站都使用了域 ...
sed命令入门
什么是sed sed是一种流处理编辑器,可以分割.查找.替换文本. sed命令的处理流程:行处理 Created with Raphaël 2.1.0在shell中执行sed文本或管道输入读入到模式空 ...
sys.stdout.write和print和sys.stdout.flush
1. 先看下官方文档 """ sys.stdout.write(string) Write string to stream. Returns the number of ...
增加yum源方式安装升级 Mysql
MySQL官方新提供了一种安装MySQL的方法--使用YUM源安装MySQL 1.MySQL官方网站下载MySQL的YUM源, https://dev.mysql.com/down ...
springboot 基于Tomcate的自启动流程
Springboot 内置了Tomcat的容器,我们今天来说一下Springboot的自启动流程. 一.Spring通过注解导入Bean大体可分为四种方式,我们主要来说以下Import的两种实现方法: ...
[jQuery]jQuery和DOM对象(三)
iQuery和DOM对象用原生js获取来的对象就是DOM对象 // 1. DOM对象 var myDiv = document.get.querySelector('div'); // myDiv ...
珠峰-6-http和http-server原理
???? websock改天研究下然后用node去搞. websock的实现原理. ##### 第9天的笔记内容. ## Header 规范 ## Http 状态码 - 101 webscoket 双 ...
利用VS Code在Azure上构建部署静态页面
0x00 前言前一段时间,我找到了Jendrik Illner的个人网站.除了那里的精彩文章,网站的主题也吸引了我的注意力,而且我发现该网站的主题采用了Hugo的Academic主题. 然后,我认为 ...
HttpContext.Current.Server未将对象引用到实例
问题描述: 在一些类库中需要读取当前系统的xml文件,当时用HttpContext.Current无法找到实例化对象解决代码如下: XmlDocument xml = new XmlDocument ...

scrapy框架综合运用 爬取天气预报 + 定时任务

scrapy框架综合运用 爬取天气预报 + 定时任务的更多相关文章

随机推荐

热门专题

scrapy框架综合运用爬取天气预报 + 定时任务

scrapy框架综合运用爬取天气预报 + 定时任务的更多相关文章