利用scrapy爬取腾讯的招聘信息

利用scrapy框架抓取腾讯的招聘信息，爬取地址为：https://hr.tencent.com/position.php

抓取字段包括：招聘岗位，人数，工作地点，发布时间，及具体的工作要求和工作任务

最终结果保存为两个文件，一个文件放前面的四个字段信息，一个放具体内容信息

1.网页分析

通过网页源码和F12显示的代码对比发现，该网页属于静态网页。

可以采用xpath解析网页源码，获取tr标签下的相关内容，具体见代码部分。

2.编辑items.py文件

通过scrapy startproject + 项目名称生成项目后，来到items.py文件下，首先定义爬取的字段。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TencentItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    # 职位名称

    position_name = scrapy.Field()

    # 职位类别

    position_type = scrapy.Field()

    # 招聘人数

    wanted_number = scrapy.Field()

    # 工作地点

    work_location = scrapy.Field()

    # 发布时间

    publish_time = scrapy.Field()

    # 详情信息

    position_link = scrapy.Field()

class DetailsItem(scrapy.Item):

    """

    将详情页提取到的数据另外保存到一个文件中

    """

    # 工作职责

    work_duties = scrapy.Field()

    # 工作要求

    work_skills = scrapy.Field()

3.编写爬虫部分

使用scrapy genspiders + 名称+初始url，生成爬虫后，来到spiders文件夹下的爬虫文件，编写爬虫逻辑，具体代码如下：

# -*- coding: utf-8 -*-

import scrapy

# 导入待爬取字段名

from tencent.items import TencentItem, DetailsItem

class TencentWantedSpider(scrapy.Spider):

    name = 'tencent_wanted'

    allowed_domains = ['hr.tencent.com']

    start_urls = ['https://hr.tencent.com/position.php']

    base_url = 'https://hr.tencent.com/'

    def parse(self, response):

        # 获取页面中招聘信息在网页中位置节点

        node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')

        # 匹配到下一页的按钮

        next_page = response.xpath('//a[@id="next"]/@href').extract_first()

        # 遍历节点，进入详情页，获取其他信息

        for node in node_list:

            # 实例化，填写数据

            item = TencentItem()

            item['position_name'] = node.xpath('./td[1]/a/text()').extract_first()

            item['position_link'] = node.xpath('./td[1]/a/@href').extract_first()

            item['position_type'] = node.xpath('./td[2]/text()').extract_first()

            item['wanted_number'] = node.xpath('./td[3]/text()').extract_first()

            item['work_location'] = node.xpath('./td[4]/text()').extract_first()

            item['publish_time' ] = node.xpath('./td[5]/text()').extract_first()

            yield item

            yield scrapy.Request(url=self.base_url + item['position_link'], callback=self.details)

        # 访问下一页信息

        yield scrapy.Request(url=self.base_url + next_page, callback=self.parse)

    def details(self, response):

        """

        对详情页信息进行抽取和解析

        :return:

        """

        item = DetailsItem()

        # 从详情页获取工作责任和工作技能两个字段名

        item['work_duties'] = ''.join(response.xpath('//ul[@class="squareli"]')[0].xpath('./li/text()').extract())

        item['work_skills'] = ''.join(response.xpath('//ul[@class="squareli"]')[1].xpath('./li/text()').extract())

        yield item

4.编写pipelines.py文件，对抓取数据进行保存。

对爬取的数据进行保存，首先要在settings.py文件里，注册爬虫的管道信息，如:

具体代码如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

from tencent.items import TencentItem, DetailsItem

class TencentPipeline(object):

    def open_spider(self, spider):

        """

        爬虫运行时，执行的方法

        :param spider:

        :return:

        """

        self.file = open('tenc_wanted_2.json', 'w', encoding='utf-8')

        self.file_detail = open('tenc_wanted_detail.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        content = json.dumps(dict(item), ensure_ascii=False)

        # 判断数据来源于哪里（是哪个类的实例），写入对应的文件

        if isinstance(item, TencentItem):

            self.file.write(content + '\n')

        if isinstance(item, DetailsItem):

            self.file_detail.write(content + '\n')

        return item

    def close_spider(self, spider):

        """

        爬虫运行结束后执行的方法

        :param spider:

        :return:

        """

        self.file.close()

        self.file_detail.close()

5.运行结果

6.完整代码

参见：https://github.com/zInPython/Tencent_wanted

利用scrapy爬取腾讯的招聘信息的更多相关文章

python之scrapy爬取jd和qq招聘信息
1.settings.py文件 # -*- coding: utf-8 -*- # Scrapy settings for jd project # # For simplicity, this fi ...
scrapy爬取全部知乎用户信息
# -*- coding: utf-8 -*- # scrapy爬取全部知乎用户信息 # 1:是否遵守robbots_txt协议改为False # 2: 加入爬取所需的headers: user-ag ...
利用 Scrapy 爬取知乎用户信息
思路:通过获取知乎某个大V的关注列表和被关注列表,查看该大V和其关注用户和被关注用户的详细信息,然后通过层层递归调用,实现获取关注用户和被关注用户的关注列表和被关注列表,最终实现获取大量用户信息. 一 ...
python3 scrapy 爬取腾讯招聘
安装scrapy不再赘述, 在控制台中输入scrapy startproject tencent 创建爬虫项目名字为 tencent 接着cd tencent 用pycharm打开tencent项目 ...
利用Crawlspider爬取腾讯招聘数据(全站，深度)
需求: 使用crawlSpider(全站)进行数据爬取 - 首页: 岗位名称,岗位类别 - 详情页:岗位职责 - 持久化存储代码: 爬虫文件: from scrapy.linkextractors ...
利用Scrapy爬取所有知乎用户详细信息并存至MongoDB
欢迎大家关注腾讯云技术社区-博客园官方主页,我们将持续在博客园为大家推荐技术精品文章哦~ 作者 :崔庆才本节分享一下爬取知乎用户所有用户信息的 Scrapy 爬虫实战. 本节目标本节要实现的内容有 ...
<scrapy爬虫>爬取腾讯社招信息
1.创建scrapy项目 dos窗口输入: scrapy startproject tencent cd tencent 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # - ...
利用scrapy爬取文件后并基于管道化的持久化存储
我们在pycharm上爬取首先我们可以在本文件打开命令框或在Terminal下创建 scrapy startproject xiaohuaPro ------------创建文件 scrapy ...
Python爬虫从入门到放弃（十八）之 Scrapy爬取所有知乎用户信息(上)
爬取的思路首先我们应该找到一个账号,这个账号被关注的人和关注的人都相对比较多的,就是下图中金字塔顶端的人,然后通过爬取这个账号的信息后,再爬取他关注的人和被关注的人的账号信息,然后爬取被关注人的账号 ...

随机推荐

Java基础（十七）日志（Log）
1.日志的概念在调试有问题的代码时,经常需要插入一些System.out.println方法来观察程序运行的操作过程.但是,一旦发现了问题并且解决了问题,就需要将这些System.out.print ...
httprunner-2-linux下搭建hrun（下）
前言前面我们说了linux下安装python3,hrun是需要依赖数据库,我们用docker进行安装mysql5.7让数据库能正常连接.安装mysql5.7请参考:https://www.cnblo ...
Python调试工具
1. 日志通过日志或者print来打印变量.必要时可以打印locals()和globals() 建议使用logging.debug()来代替print,这样到了正式环境,就可以统一删除这些日志. 2 ...
dig-基本使用
dig:Domain Information Groper,是一个DNS查询工具 1:使用google的域名服务器:查询特定域名的A记录 [root@localhost ~]# dig @8.8.8. ...
postman的监控接口响应时间monitor
Monitor简介1.是基于Postman集合API的灵活监控 2.监控API的正常运行时间.响应能力和正确性 3.提供监测结果的详细报告 4.对所有Postman用户每月提供1000个免费的监控请求 ...
springboot---发送邮件
1.pom.xml配置 <dependencies> <dependency> <groupId>org.springframework.boot</grou ...
【XSY2505】tree
Description 机房断网了!xj轻而易举地撬开了中心机房的锁,拉着zwl走了进去.他们发现中心主机爆炸了. 中心主机爆炸后分裂成了 n 块碎片,但碎片仍然互相连接,形成一个树的结构.每个碎片有 ...
scss新手使用指南
还在用死的css写样式吗?那可太麻烦了,各种长串选择器不说,还有各种继承权重有时候还有可能不生效我的小程序项目也结束了,是时候总结一下scss语法了,毕竟用起来更加方便而且还能精简一点代码,好处多多 ...
NOIP模(ka)拟(chang)测试30 考试报告
应得分:300 实得分:210 毒瘤卡常出题人,卡掉90分! T1 Return 开个副本数组sort一下,unique去重就可以啦.时间复杂度$ O(nlog2(n)) $ T2 One 其实就是约 ...
Java多线程-CountDownLatch、CyclicBarrier、Semaphore
上次简单了解了多线程中锁的类型,今天要简单了解下多线程并发控制的一些工具类了. 1. 概念说明: CountDownLatch:相当于一个待执行线程计数器,当计数减为零时表示所有待执行线程都已执行完毕 ...