爬虫框架Scrapy之案例一

阳光热线问政平台

http://wz.sun0769.com/index.php/question/questionType?type=4

爬取投诉帖子的编号、帖子的url、帖子的标题，和帖子里的内容。

items.py

import scrapy

class SunwzItem(scrapy.Item):

    number = scrapy.Field()

    url = scrapy.Field()

    title = scrapy.Field()

    content = scrapy.Field()

spiders/sunwz.py



# -*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from Sunwz.items import SunwzItem

class SunwzSpider(CrawlSpider):

    name = 'sunwz'

    num = 0

    allow_domain = ['http://wz.sun0769.com/']

    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4']

    rules = {

        Rule(LinkExtractor(allow='page')),

        Rule(LinkExtractor(allow='/index\.php/question/questionType\?type=4$')),

        Rule(LinkExtractor(allow='/html/question/\d+/\d+\.shtml$'), follow = True, callback='parse_content')

    }

    xpathDict = {

        'title': '//div[contains(@class, "pagecenter p3")]/div/div/div[contains(@class,"cleft")]/strong/text()',

        'content': '//div[contains(@class, "c1 text14_2")]/text()',

        'content_first': '//div[contains(@class, "contentext")]/text()'

    }

    def parse_content(self, response):

        item = SunwzItem()

        content = response.xpath(self.xpathDict['content_first']).extract()

        if len(content) == 0:

            content = response.xpath(self.xpathDict['content']).extract()[0]

        else:

            content = content[0]

        title = response.xpath(self.xpathDict['title']).extract()[0]

        title_list = title.split(' ')

        number = title_list[-1]

        number = number.split(':')[-1]

        url = response.url

        item['url'] = url

        item['number'] = number

        item['title'] = title

        item['content'] = content

        yield item

pipelines.py

import json

import codecs

class JsonWriterPipeline(object):

    def __init__(self):

        self.file = codecs.open('sunwz.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        line = json.dumps(dict(item), ensure_ascii=False) + "\n"

        self.file.write(line)

        return item

    def spider_closed(self, spider):

        self.file.close()

settings.py

ITEM_PIPELINES = {

    'Sunwz.pipelines.JsonWriterPipeline': 300,

}

在项目根目录下新建main.py文件,用于调试

from scrapy import cmdline

cmdline.execute('scrapy crawl sunwz'.split())

执行程序

py2 main.py

爬虫框架Scrapy之案例一的更多相关文章

爬虫框架Scrapy之案例二
新浪网分类资讯爬虫爬取新浪网导航页所有下所有大类.小类.小类里的子链接,以及子链接页面的新闻内容. 效果演示图: items.py import scrapy import sys reload(s ...
爬虫框架Scrapy之案例三图片下载器
items.py class CoserItem(scrapy.Item): url = scrapy.Field() name = scrapy.Field() info = scrapy.Fiel ...
教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scr ...
【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...
爬虫框架Scrapy
前面十章爬虫笔记陆陆续续记录了一些简单的Python爬虫知识, 用来解决简单的贴吧下载,绩点运算自然不在话下. 不过要想批量下载大量的内容,比如知乎的所有的问答,那便显得游刃不有余了点. 于是乎,爬虫 ...
第三篇：爬虫框架 - Scrapy
前言 Python提供了一个比较实用的爬虫框架 - Scrapy.在这个框架下只要定制好指定的几个模块,就能实现一个爬虫. 本文将讲解Scrapy框架的基本体系结构,以及使用这个框架定制爬虫的具体步骤 ...
网络爬虫框架Scrapy简介
作者: 黄进(QQ:7149101) 一. 网络爬虫网络爬虫(又被称为网页蜘蛛,网络机器人),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本:它是一个自动提取网页的程序,它为搜索引擎从万维 ...
Linux 安装python爬虫框架 scrapy
Linux 安装python爬虫框架 scrapy http://scrapy.org/ Scrapy是python最好用的一个爬虫框架.要求: python2.7.x. 1. Ubuntu14.04 ...
Python爬虫框架Scrapy实例（三）数据存储到MongoDB
Python爬虫框架Scrapy实例(三)数据存储到MongoDB任务目标:爬取豆瓣电影top250,将数据存储到MongoDB中. items.py文件复制代码# -*- coding: utf-8 ...

随机推荐

Jenkins 持续集成配置
Jenkins搭建.NET自动编译测试与发布环境 Jenkins之Deploy部署(包括站点和类库项目) * 续篇--TFS+MSbuild+jenkins 实现持续集成+自动部署到WEB网站 Je ...
python学习笔记（二）— 元组（tuple）
Python 的元组与列表类似,不同之处在于元组的元素不能修改:元组使用小括号,列表使用方括号. 元组创建很简单,只需要在括号中添加元素,并使用逗号隔开即可: tup1 = ('a', 'b', 19 ...
转！！mybatis xml 传值 if test判断
当mapper方法传参数为 String时,且xml中药进行参数比较比如是不是等于1 或者等于2 方式1. 方式2. 转自:https://blog.csdn.net/chenaini119/a ...
golang： multiple http.writeHeader calls
背景: golang的http服务,读取文件,提供给client下载时候. 出现 multiple http.writeHeader calls 错误. func DownloadFile(w htt ...
intel EPT 机制详解
2016-11-08 在虚拟化环境下,intel CPU在处理器级别加入了对内存虚拟化的支持.即扩展页表EPT,而AMD也有类似的成为NPT.在此之前,内存虚拟化使用的一个重要技术为影子页表. 背景: ...
django-网页视屏播放
基本都基于第三方: -cc视频 -播放免费视频 -收费视频 -需要做认证,cc视频会给你发消息,你返回,携带数据 -在前端页面中添加响应的视屏框的代码 -功能实现,有相关接口文档,配置即可
19.Eclipse 修改默认的keystore签名文件
Android开发中apk运行都需要签名,就算连接手机直接运行调试,apk都有签名,开发工具会有默认的debug_keyStore Eclipse ADT调试运行使用的是临时生成的Debug专用证书, ...
怎么解决tomcat占用8080端口问题图文教程
怎么解决tomcat占用8080端口问题相信很多朋友都遇到过这样的问题吧,tomcat死机了,重启eclipse之后,发现 Several ports (8080, 8009) required ...
SDUT中大数实现的题目，持续更新（JAVA实现）
SDUT2525:A-B (模板题) import java.util.Scanner; import java.math.*; public class Main { public static v ...
PKU 1573 Robot Motion(简单模拟)
原题大意:原题链接给出一个矩阵(矩阵中的元素均为方向英文字母),和人的初始位置,问是否能根据这些英文字母走出矩阵.(因为有可能形成环而走不出去) 此题虽然属于水题,但是完全独立完成而且直接1A还是很 ...

爬虫框架Scrapy之案例一

阳光热线问政平台

items.py

spiders/sunwz.py

pipelines.py

settings.py

在项目根目录下新建main.py文件,用于调试

执行程序

爬虫框架Scrapy之案例一的更多相关文章

随机推荐

热门专题