scrapy实战--爬取报刊名称及地址

目标：爬取全国报刊名称及地址

链接：http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm

目的：练习scrapy爬取数据

学习过scrapy的基本使用方法后，我们开始写一个最简单的爬虫吧。

目标截图：

　　1、创建爬虫工程

$ cd ~/code/crawler/scrapyProject
$ scrapy startproject newSpapers

　　2、创建爬虫程序

$ cd newSpapers/
$ scrapy genspider nationalNewspaper news.xinhuanet.com

　　3、配置数据爬取项　

$ cat items.py
# -*- coding: utf-8 -*-
 
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
class NewspapersItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    addr = scrapy.Field()

　4、　配置爬虫程序

$ cat spiders/nationalNewspaper.py
# -*- coding: utf-8 -*-
import scrapy
from newSpapers.items import NewspapersItem
 
class NationalnewspaperSpider(scrapy.Spider):
    name = "nationalNewspaper"
    allowed_domains = ["news.xinhuanet.com"]
    start_urls = ['http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm']
 
    def parse(self, response):
        sub_country = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[2]')
        sub2_local = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[4]')
        tags_a_country = sub_country.xpath('./td/table/tbody/tr/td/p/a')
        items = []
        for each in tags_a_country:
            item = NewspapersItem()
            item['name'] = each.xpath('./strong/text()').extract()
            item['addr'] = each.xpath('./@href').extract()
            items.append(item)
        return items

　　5、配置谁去处理爬取结果

$ cat settings.py
……
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
ITEM_PIPELINES = {'newSpapers.pipelines.NewspapersPipeline':100}

　　6、配置数据处理程序

$ cat pipelines.py
# -*- coding: utf-8 -*-
 
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 
import time
class NewspapersPipeline(object):
    def process_item(self, item, spider):
        now = time.strftime('%Y-%m-%d',time.localtime())
        filename = 'newspaper.txt'
        print '================='
        print item
        print '================'
        with open(filename,'a') as fp:
            fp.write(item['name'][0].encode("utf8")+ '\t' +item['addr'][0].encode("utf8") + '\n')
        return item

　　7、查看结果

$ cat spiders/newspaper.txt
人民日报	http://paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm
海外版	http://paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm
光明日报	http://www.gmw.cn/01gmrb/2007-09/20/default.htm
经济日报	http://www.economicdaily.com.cn/no1/
解放军报	http://www.gmw.cn/01gmrb/2007-09/20/default.htm
中国日报	http://pub1.chinadaily.com.cn/cdpdf/cndy/

程序源代码：

scrapy实战--爬取报刊名称及地址的更多相关文章

简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息系统环境:Fedora22(昨天已安装scrapy环境) 爬取的开始URL:ht ...
教程+资源,python scrapy实战爬取知乎最性感妹子的爆照合集(12G)!
一.出发点: 之前在知乎看到一位大牛(二胖)写的一篇文章:python爬取知乎最受欢迎的妹子(大概题目是这个,具体记不清了),但是这位二胖哥没有给出源码,而我也没用过python,正好顺便学一学,所以 ...
scrapy实战--爬取最新美剧
现在写一个利用scrapy爬虫框架爬取最新美剧的项目. 准备工作: 目标地址:http://www.meijutt.com/new100.html 爬取项目:美剧名称.状态.电视台.更新时间 1.创建 ...
Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)
1. Scrapy框架 Scrapy是python下实现爬虫功能的框架,能够将数据解析.数据处理.数据存储合为一体功能的爬虫框架. 2. Scrapy安装 1. 安装依赖包 yum install g ...
Scrapy实战篇（六）之Scrapy配合Selenium爬取京东信息（上）
在之前的一篇实战之中,我们已经爬取过京东商城的文胸数据,但是前面的那一篇其实是有一个缺陷的,不知道你看出来没有,下面就来详细的说明和解决这个缺陷. 我们在京东搜索页面输入关键字进行搜索的时候,页面的返 ...
scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）
items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentati ...
scrapy框架爬取笔趣阁完整版
继续上一篇,这一次的爬取了小说内容 pipelines.py import csv class ScrapytestPipeline(object): # 爬虫文件中提取数据的方法每yield一次it ...
scrapy框架爬取笔趣阁
笔趣阁是很好爬的网站了,这里简单爬取了全部小说链接和每本的全部章节链接,还想爬取章节内容在biquge.py里在加一个爬取循环,在pipelines.py添加保存函数即可 1 创建一个scrapy项目 ...
Python分布式爬虫开发搜索引擎 Scrapy实战视频教程
点击了解更多Python课程>>> Python分布式爬虫开发搜索引擎 Scrapy实战视频教程课程目录 |--第01集教程推介 98.23MB |--第02集 windows下 ...

随机推荐

GODADDY 优质DNS 未被墙
下面列出的是我本地测试出的优质服务器 NS10.DOMAINCONTROL.COM NS12.DOMAINCONTROL.COM NS14.DOMAINCONTROL.COM NS19.DOMAINC ...
IE中的userData
之前做项目时用到了localstorage,但是考虑到浏览器存在IE8以下不兼容问题,所以来介绍以下IE中的userData. 本地存储解决方案很多,比如Flash SharedObject.Goog ...
Java_反射机制详解
本篇文章依旧采用小例子来说明,因为我始终觉的,案例驱动是最好的,要不然只看理论的话,看了也不懂,不过建议大家在看完文章之后,在回过头去看看理论,会有更好的理解. 下面开始正文. [案例1]通过一个对象 ...
Flow类中的resolveBreaks与resolveContinues
/** Resolve all continues of this statement. */ boolean resolveContinues(JCTree tree) { boolean resu ...
nodejs + express 热更新
以前node中的express框架,每次修改代码之后,都需要重新npm start 才能看到改动的效果,非常麻烦,所以这里引入nodemon模块,实现了不用重启也能自动更新这样的好处 1.全局安装no ...
celery在Django中的应用
这里不解释celery,如果不清楚可以参考下面链接: http://docs.celeryproject.org/en/latest/getting-started/introduction.html ...
C语言理论知识
C语言-----理论部分一:软件开发概述1.程序语言的发展:机器语言-->汇编语言-->高级语言.2.软件开发的基本步骤与方法:分析问题,建立数学模型-->确定数据结构和算法- ...
spring cloud连载第三篇之Spring Cloud Netflix
1. Service Discovery: Eureka Server(服务发现:eureka服务器) 1.1 依赖 <dependency> <groupId>org.spr ...
MySQL 学习 --- 数据结构和索引
本文参考了多篇文章集成的笔记,希望各位学习之前可以阅读以下参考资料先概述文章分几个部分 :第一部分介绍了B-Tree 和 B+Tree 这种数据结构作为索引:第二部分介绍索引的最左前缀原则和覆盖索 ...
jQuery如何根据元素值删除数组元素
用到的方法$.inArry(); $.inArray( value, array [, fromIndex ] ) value 任意类型用于查找的值. array Array类型指定被查找的数组. ...

scrapy实战--爬取报刊名称及地址

scrapy实战--爬取报刊名称及地址的更多相关文章

随机推荐

热门专题