Scrapy框架——CrawlSpider爬取某招聘信息网站

CrawlSpider

Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。

它是Spider的派生类，Spider类的设计原则是只爬取start_url列表中的网页，

而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合。

创建项目指令：

scrapy startproject tenCent

CrawlSpider创建：

scrapy genspider -t crawl crawl_tencent "hr.tencent.com"

CrawlSpider继承于Spider类，除了继承过来的属性外（name、allow_domains），还提供了新的属性和方法:

LinkExtractor

from scrapy.linkextractors import LinkExtractor

LinkExtractor(allow=r'start=\d+')

通过实例化LinkExtractor提取链接

主要参数

allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。

deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。

rules

Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_tencent', follow=True),

在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。

如果多个rule匹配了相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用

主要参数

link_extractor：是一个Link Extractor对象，用于定义需要提取的链接

callback： 从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接受一个response作为其第一个参数。

注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。

follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None，follow 默认设置为True ，否则默认为False。

使用CrawlSpider爬取信息

1.编写item文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TencentItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    # 职位名称

    position_name = scrapy.Field()

    # 详情链接

    position_link = scrapy.Field()

    # 职位类别

    position_type = scrapy.Field()

    # 职位人数

    position_number = scrapy.Field()

    # 职位地点

    work_location = scrapy.Field()

    # 发布时间

    publish_times = scrapy.Field()

    # 工作职责

    position_duty = scrapy.Field()

    # 工作要求

    position_require = scrapy.Field()

class DetailItem(scrapy.Item):

    # 工作职责

    position_duty = scrapy.Field()

    # 工作要求

    position_require = scrapy.Field()

2.编写crawlspider文件

# -*- coding: utf-8 -*-

import scrapy

from tenCent.items import TencentItem

from tenCent.items import DetailItem

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

class CrawlTencentSpider(CrawlSpider):

    name = 'crawl_tencent'

    allowed_domains = ['hr.tencent.com']

    start_urls = ['https://hr.tencent.com/position.php']

    '''

    rule LinkExtractor规则:

        allow:根据正则表达式匹配链接

        callback:回调函数

        follow:是否提取跟进页（链接套链接）

    '''

    rules = (

        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_tencent', follow=True),

        # 从上面的规则传递下一个

        Rule(LinkExtractor(allow=r'position_detail\.php\?id=\d+'), callback='parse_detail', follow=False),

    )

    def parse_tencent(self, response):

        print('start……')

        node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')

        # 选取所有标签tr 且class属性等于even或odd的元素

        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()

        #i['name'] = response.xpath('//div[@id="name"]').extract()

        #i['description'] = response.xpath('//div[@id="description"]').extract()

        for node in node_list:

            item = TencentItem()

            item['position_name'] = node.xpath('./td[1]/a/text()').extract_first()  # 获取第一个td标签下a标签的文本

            item['position_link'] = node.xpath('./td[1]/a/@href').extract_first()  # 获取第一个td标签下a标签href属性

            item['position_type'] = node.xpath('./td[2]/text()').extract_first()  # 获取第二个td标签下文本

            item['position_number'] = node.xpath('./td[3]/text()').extract_first()  # 获取第3个td标签下文本

            item['work_location'] = node.xpath('./td[4]/text()').extract_first()  # 获取第4个td标签下文本

            item['publish_times'] = node.xpath('./td[5]/text()').extract_first()  # 获取第5个td标签下文本

            yield item

    def parse_detail(self, response):

        item = DetailItem()

        item['position_duty'] = ''.join(

            response.xpath('//ul[@class="squareli"]')[0].xpath('./li/text()').extract())  # 转化为字符串

        item['position_require'] = ''.join(

            response.xpath('//ul[@class="squareli"]')[1].xpath('./li/text()').extract())  # 转化为字符串

        yield item

3.建立pipeline文件

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

from .items import TencentItem

from .items import DetailItem

class TencentPipeline(object):

    def open_spider(self, spider):

        """

         # spider (Spider 对象) – 被开启的spider

         # 可选实现，当spider被开启时，这个方法被调用。

        :param spider:

        :return:

        """

        self.file = open('tencent.json', 'w', encoding='utf-8')

        json_header = '{ "tencent_info":['

        self.count = 0

        self.file.write(json_header)  # 保存到文件

    def close_spider(self, spider):

        """

        # spider (Spider 对象) – 被关闭的spider

        # 可选实现，当spider被关闭时，这个方法被调用

        :param spider:

        :return:

        """

        json_tail = '] }'

        self.file.seek(self.file.tell() - 1)  # 定位到最后一个逗号

        self.file.truncate()  # 截断后面的字符

        self.file.write(json_tail)  # 添加终止符保存到文件

        self.file.close()

    def process_item(self, item, spider):

        """

        # item (Item 对象) – 被爬取的item

        # spider (Spider 对象) – 爬取该item的spider

        # 这个方法必须实现，每个item pipeline组件都需要调用该方法，

        # 这个方法必须返回一个 Item 对象，被丢弃的item将不会被之后的pipeline组件所处理。

        :param item:

        :param spider:

        :return:

        """

        # print('item=',dict(item))

        if isinstance(item, TencentItem):

            print('--'*20)

            content = json.dumps(dict(item), ensure_ascii=False, indent=2) + ","  # 字典转换json字符串

            self.count += 1

            print('content', self.count)

            self.file.write(content)  # 保存到文件

        '''

        return item后，item会根据优先级

        传递到下一个管道DetailPipeline处理

        此段代码说明当实例不属于TencentItem时，放弃存储json，

        直接传递到下一个管道处理

        return放在if外面，如果写在if里面item在不属于TencentItem实例后，

        item会终止传递，造成detail数据丢失

        '''

        return item

class DetailPipeline(object):

    def open_spider(self, spider):

        """

         # spider (Spider 对象) – 被开启的spider

         # 可选实现，当spider被开启时，这个方法被调用。

        :param spider:

        :return:

        """

        self.file = open('detail.json', 'w', encoding='utf-8')

        json_header = '{ "detail_info":['

        self.count = 0

        self.file.write(json_header)  # 保存到文件

    def close_spider(self, spider):

        """

        # spider (Spider 对象) – 被关闭的spider

        # 可选实现，当spider被关闭时，这个方法被调用

        :param spider:

        :return:

        """

        json_tail = '] }'

        self.file.seek(self.file.tell() - 1)  # 定位到最后一个逗号

        self.file.truncate()  # 截断后面的字符

        self.file.write(json_tail)  # 添加终止符保存到文件

        self.file.close()

    def process_item(self, item, spider):

        """

        # item (Item 对象) – 被爬取的item

        # spider (Spider 对象) – 爬取该item的spider

        # 这个方法必须实现，每个item pipeline组件都需要调用该方法，

        # 这个方法必须返回一个 Item 对象，被丢弃的item将不会被之后的pipeline组件所处理。

        :param item:

        :param spider:

        :return:

        """

        # print('item=',dict(item))

        if isinstance(item, DetailItem):

            '''

            得到item,判断item实例属于DetailItem，存储json文件

            如果不属于，直接return item到下一个管道

          '''

            print('**' * 30)

            content = json.dumps(dict(item), ensure_ascii=False, indent=2) + ","  # 字典转换json字符串

            self.count += 1

            print('content', self.count)

            self.file.write(content)  # 保存到文件

        return item

4.设置settiing

#1、项目名称，默认的USER_AGENT由它来构成，也作为日志记录的日志名

BOT_NAME = 'tenCent'

# 2、爬虫应用路径

SPIDER_MODULES = ['tenCent.spiders']

NEWSPIDER_MODULE = 'tenCent.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = '"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"'  # 头部信息，反爬

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# log日志

LOG_FILE = 'tencent.log'

LOG_LEVEL = 'DEBUG'

LOG_ENCODING = 'utf-8'

LOG_DATEFORMAT='%m/%d/%Y %H:%M:%S %p'

ITEM_PIPELINES = {

   'tenCent.pipelines.TencentPipeline': 300,

   'tenCent.pipelines.DetailPipeline':400

}

5.执行程序

scrapy crawl crawl_tencent

tencent.log

tencent.json

detail.json

Scrapy框架——CrawlSpider爬取某招聘信息网站的更多相关文章

爬虫框架之Scrapy——爬取某招聘信息网站
案例1:爬取内容存储为一个文件 1.建立项目 C:\pythonStudy\ScrapyProject>scrapy startproject tenCent New Scrapy projec ...
爬取拉勾网招聘信息并使用xlwt存入Excel
xlwt 1.3.0 xlwt 文档 xlrd 1.1.0 python操作excel之xlrd 1.Python模块介绍 - xlwt ,什么是xlwt? Python语言中,写入Excel文件的扩 ...
Scrapy实战篇（七）之Scrapy配合Selenium爬取京东商城信息（下）
之前我们使用了selenium加Firefox作为下载中间件来实现爬取京东的商品信息.但是在大规模的爬取的时候,Firefox消耗资源比较多,因此我们希望换一种资源消耗更小的方法来爬取相关的信息. 下 ...
Python爬取拉勾网招聘信息并写入Excel
这个是我想爬取的链接:http://www.lagou.com/zhaopin/Python/?labelWords=label 页面显示如下: 在Chrome浏览器中审查元素,找到对应的链接: 然后 ...
python-scrapy爬虫框架爬取拉勾网招聘信息
本文实例为爬取拉勾网上的python相关的职位信息, 这些信息在职位详情页上, 如职位名, 薪资, 公司名等等. 分析思路分析查询结果页在拉勾网搜索框中搜索'python'关键字, 在浏览器地址栏 ...
python之scrapy爬取jingdong招聘信息到mysql数据库
1.创建工程 scrapy startproject jd 2.创建项目 scrapy genspider jingdong 3.安装pymysql pip install pymysql 4.set ...
pyspider爬虫框架webui简介-爬取阿里招聘信息
命令行输入pyspider开启pyspider 浏览器打开http://localhost:5000/ group表示组名,几个项目可以同一个组名,方便管理,当组名修改为delete时,项目会在一天后 ...
python scrapy爬取前程无忧招聘信息
使用scrapy框架之前,使用以下命令下载库: pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple 1.创建项目文件夹 scr ...
scrapy框架 + selenium 爬取豆瓣电影top250......
废话不说,直接上代码..... 目录结构 items.py import scrapy class DoubanCrawlerItem(scrapy.Item): # 电影名称 movieName = ...

随机推荐

Python decode和encody
s = "我今天非常的困" bs = s.encode("utf-8")# 把字符串转化成utf-8格式bytes # bytes 不是给人看的, 给机器用的 ...
C# Dictionary源码剖析
参考:https://blog.csdn.net/exiaojiu/article/details/51252515 http://www.cnblogs.com/wangjun1234/p/3719 ...
C#写的window服务内存溢出
浅谈c#垃圾回收机制(GC) 写了一个window服务,循环更新sqlite记录,内存一点点稳步增长.三天后,内存溢出.于是,我从自己的代码入手,查找到底哪儿占用内存释放不掉,最终明确是调用servi ...
[UE4]虚幻4的智能指针
虚幻自己实现了一套智能指针系统,为了跨平台. 指针: 占用8个字节,4个字节的Object指针,4字节的引用计数控制器的指针, 引用计数控制器需要12字节, 一个C++的Object指针4字节,一个共 ...
Hadoop操作前准备工作
摘要:本文介绍Hadoop操作前的准备工作. 关键词:Hadoop Linux JDK WinSCP 俗语说,“磨刀不误砍柴工”.Hadoop操作前的准备工作可以加快Hadoop的操作与应用. ...
hbase基于solr配置二级索引
一.概述 Hbase适用于大表的存储,通过单一的RowKey查询虽然能快速查询,但是对于复杂查询,尤其分页.查询总数等,实现方案浪费计算资源,所以可以针对hbase数据创建二级索引(Hbase Sec ...
stenciljs 学习一 web 组件开发
stenciljs 介绍参考官方网站,或者 https://www.cnblogs.com/rongfengliang/p/9706542.html 创建项目使用脚手架工具 npm init ste ...
MQ的不足
调用方实时依赖执行结果的业务场景,请使用调用,而不是MQ.MQ是互联网分层架构中的解耦利器,那所有通讯都使用MQ岂不是很好?这是一个严重的误区,调用与被调用的关系,是无法被MQ取代的.比如用户登录场景 ...
标题: Re: 总感觉IT没我大山东啥事？
发信人: liuzhlai (liuzhlai), 信区: ITExpress 标题: Re: 总感觉IT没我大山东啥事? 发信站: 水木社区 (Sat Aug 22 15:51:50 2015) ...
【Bitmap Index】B-Tree索引与Bitmap位图索引的锁代价比较研究
通过以下实验,来验证Bitmap位图索引较之普通的B-Tree索引锁的“高昂代价”.位图索引会带来“位图段级锁”,实际使用过程一定要充分了解不同索引带来的锁代价情况. 1.为比较区别,创建两种索引类型 ...