python 爬虫系列03--职位爬虫

职位爬虫

import requests

from lxml import etree

cookie = {

    'Cookie':'user_trace_token=20181015184304-692c4bf4-4e71-4cfd-8906-6219253e0ae8; _ga=GA1.2.1135099826.1539600208; LGUID=20181015184305-18c8e815-d067-11e8-bc15-5254005c3644; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.73712408.1539738633; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221667fc4129f205-01a02c2a87905b-51422e1f-2073600-1667fc412a0a16%22%2C%22%24device_id%22%3A%221667fc4129f205-01a02c2a87905b-51422e1f-2073600-1667fc412a0a16%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; JSESSIONID=ABAAABAAADEAAFI1F6DEB9C84C5A5AADBE0CCBE43481EB7; _gat=1; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539600208,1539738633,1539769054; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539769054; LGSID=20181017173710-3879f572-d1f0-11e8-bb7a-525400f775ce; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E8%25BF%2590%25E7%25BB%25B4%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588%3Fcity%3D%25E4%25B8%258A%25E6%25B5%25B7%26cl%3Dfalse%26fromSearch%3Dtrue%26labelWords%3D%26suginput%3D; LGRID=20181017173710-3879f6d3-d1f0-11e8-bb7a-525400f775ce; SEARCH_ID=47902a4acdc34c47977e8eeb46c523f2'

}

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',

    'Accept':'application/json, text/javascript, */*; q=0.01',

    'Host':'www.lagou.com',

    'Origin':'https://www.lagou.com',

    'Referer':'https://www.lagou.com/jobs/list_%E8%BF%90%E7%BB%B4%E5%B7%A5%E7%A8%8B%E5%B8%88?city=%E4%B8%8A%E6%B5%B7&cl=false&fromSearch=true&labelWords=&suginput=',

}

data = {

    'first': False,

    'pn': 1,

    'kd': '运维工程师',

}

def get_job(data):

   # url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false&isSchoolJob=0'

    url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false'

    page = requests.post(url=url, cookies=cookie, headers=headers, data=data)

    page.encoding = 'utf-8'

    result = page.json()

    jobs = result['content']['positionResult']['result']

    for job in jobs:

        companyShortName = job['companyShortName']

        positionId = job['positionId']  # 主页ID

        companyFullName = job['companyFullName'] # 公司全名

        companyLabelList = job['companyLabelList'] # 福利待遇

        companySize = job['companySize'] # 公司规模

        industryField = job['industryField']

        createTime = job['createTime'] # 发布时间

        district = job['district'] # 地区

        education = job['education'] # 学历要求

        financeStage = job['financeStage'] # 上市否

        firstType = job['firstType'] # 类型

        secondType = job['secondType'] # 类型

        formatCreateTime = job['formatCreateTime']

        publisherId = job['publisherId'] # 发布人ID

        salary = job['salary'] # 薪资

        workYear = job['workYear'] # 工作年限

        positionName = job['positionName'] #

        jobNature = job['jobNature'] # 全职

        positionAdvantage = job['positionAdvantage'] # 工作福利

        positionLables = job['positionLables'] # 工种

        detail_url = 'https://www.lagou.com/jobs/{}.html'.format(positionId)

        response = requests.get(url=detail_url, headers=headers, cookies=cookie)

        response.encoding = 'utf-8'

        tree = etree.HTML(response.text)

        desc = tree.xpath('//*[@id="job_detail"]/dd[2]/div/p/text()')

        print(companyFullName)

        print('%s 拉勾网链接:-> %s' % (companyShortName, detail_url))

        print('职位：%s' % positionName)

        print('职位类型：%s' % firstType)

        print('薪资待遇：%s' % salary)

        print('职位诱惑：%s' % positionAdvantage)

        print('地区：%s' % district)

        print('类型：%s' % jobNature)

        print('工作经验：%s' % workYear)

        print('学历要求：%s' % education)

        print('发布时间：%s' % createTime)

        x = ''

        for label in positionLables:

            x += label + ','

        print('技能标签：%s' % x)

        print('公司类型：%s' % industryField)

        for des in desc:

            print(des)

def url(data):

    for x in range(1,50):

        data['pn'] = x

        get_job(data)

if __name__ == '__main__':

    url(data)

python 爬虫系列03--职位爬虫的更多相关文章

【网络爬虫入门03】爬虫解析利器beautifulSoup模块的基本应用
[网络爬虫入门03]爬虫解析利器beautifulSoup模块的基本应用 1.引言网络爬虫最终的目的就是过滤选取网络信息,因此最重要的就是解析器了,其性能的优劣直接决定这网络爬虫的速度和效率.B ...
java爬虫系列第一讲-爬虫入门
1. 概述 java爬虫系列包含哪些内容? java爬虫框架webmgic入门使用webmgic爬取 http://ady01.com 中的电影资源(动作电影列表页.电影下载地址等信息) 使用web ...
python爬虫系列之初识爬虫
前言我们这里主要是利用requests模块和bs4模块进行简单的爬虫的讲解,让大家可以对爬虫有了初步的认识,我们通过爬几个简单网站,让大家循序渐进的掌握爬虫的基础知识,做网络爬虫还是需要基本的前端的 ...
python 爬虫系列07-天气爬虫
看天气 import requests from bs4 import BeautifulSoup ALL_DATA = [] def parse_page(url): headers = { 'Us ...
Python3爬虫系列：理论+实验+爬取妹子图实战
Github: https://github.com/wangy8961/python3-concurrency-pics-02 ,欢迎star 爬虫系列: (1) 理论 Python3爬虫系列01 ...
爬虫系列(三) urllib的基本使用
一.urllib 简介 urllib 是 Python3 中自带的 HTTP 请求库,无需复杂的安装过程即可正常使用,十分适合爬虫入门 urllib 中包含四个模块,分别是 request:请求处理模 ...
爬虫系列(九) xpath的基本使用
一.xpath 简介究竟什么是 xpath 呢?简单来说,xpath 就是一种在 XML 文档中查找信息的语言而 XML 文档就是由一系列节点构成的树,例如,下面是一份简单的 XML 文档: &l ...
爬虫系列(五) re的基本使用
1.简介究竟什么是正则表达式 (Regular Expression) 呢?可以用下面的一句话简单概括: 正则表达式是一组特殊的字符序列,由一些事先定义好的字符以及这些字符的组合形成,常常用于匹 ...
爬虫系列(七) requests的基本使用
一.requests 简介 requests 是一个功能强大.简单易用的 HTTP 请求库,可以使用 pip install requests 命令进行安装下面我们将会介绍 requests 中常用 ...

随机推荐

python 测试报告发送邮件
使用过程成出现的如下错误 smtplib.SMTPDataError: (554, 'DT:SPM 126 smtp5错误解决办法 1.自动化测试中,调用邮件模块自动发送邮件时,运行脚本报错: s ...
APUE（1）----UNIX基础知识
一.UNIX体系结构所有操作系统都为他们所运行的程序提供服务,典型的服务包括:执行新程序.打开文件.读文件.分配存储区等.严格意义上来说,操作系统可以定义为一种软件,它控制计算机硬件资源,提供程序运 ...
delphi 指针认识
delphi 指针分为类型指针和无类型指针: 类型指针分为PChar.PInteger.PString等. 无类型指针Pointer. PPChar/PP...为指针的指针 @和Addr一样,为获取变 ...
python 爬虫proxy,BeautifulSoup+requests+mysql 爬取样例
实现思路: 由于反扒机制,所以需要做代理切换,去爬取,内容通过BeautifulSoup去解析,最后入mysql库 1.在西刺免费代理网获取代理ip,并自我检测是否可用 2.根据获取的可用代理ip去发 ...
My97DatePicker常用日期格式
WdatePicker({ minDate: '%y-%M-%d', maxDate: '#F{$dp.$D(\'GradeEndDate\',{d:-1});}' }); WdatePicker({ ...
SHELL编程之条件判断
一.if 语句结构 (1)单分支语句结构 if 条件测试操作 then 命令序列 fi #!/bin/bash MOUNT_DIR="/media/cdrom/" #-d $M ...
CF914E Palindromes in a Tree(点分治)
link 题目大意:给定一个n个点的树,每个点都有一个字符(a-t,20个字符) 我们称一个路径是神犇的,当这个路径上所有点的字母的某个排列是回文求出对于每个点,求出经过他的神犇路径的数量题解: ...
[POI2007]MEG-Megalopolis 树的dfs序+树状数组维护差分 BZOJ1103
题目描述 Byteotia has been eventually touched by globalisation, and so has Byteasar the Postman, who onc ...
AF 与 PF区别
AF 表示ADDRESS FAMILY 地址族 PF 表示PROTOCL FAMILY 协议族 Winsock2.h中#define AF_INET 0#define PF_INET AF_INET ...
jQuery.isEmptyObject() 函数详解转
原文地址 http://www.365mini.com/page/jquery_isemptyobject.htm jQuery.isEmptyObject()函数用于判断指定参数是否是一个空对象. ...

python 爬虫系列03--职位爬虫

python 爬虫系列03--职位爬虫的更多相关文章

随机推荐

热门专题