python使用bs4爬取boss静态页面

思路：

　　1、将需要查询城市列表，通过城市接口转换成相应的code码

　　2、遍历城市、职位生成url

　　3、通过url获取列表页面信息，遍历列表页面信息

　　4、再根据列表页面信息的job_link获取详情页面信息，将需要的信息以字典data的形式存在列表datas里　　

　　5、判断列表页面是否有下一页，重复步骤3、4；同时将列表datas一直传递下去

　　6、一个城市、职位url爬取完后，将列表datas接在列表datas_list后面，重复3、4、5

　　7、最后将列表datas_list的数据，遍历写在Excel里面

知识点：

　　1、将response内容以json形式输出，解析json并取值

　　2、soup 的select()和find_all()和find()方法使用

　　3、异常Exception的使用

　　4、wldt创建编辑Excel的使用

import requests, time, xlwt

from bs4 import BeautifulSoup

class MyJob():

    def __init__(self, mycity, myquery):

        self.city = mycity

        self.query = myquery

        self.list_url = "https://www.zhipin.com/job_detail/?query=%s&city=%s&industry=&position="%(self.query, self.city)

        self.datas = []

        self.header = {

            'authority': 'www.zhipin.com',

            'method': 'GET',

            'scheme': 'https',

            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

            'accept-encoding': 'gzip, deflate, br',

            'accept-language': 'zh-CN,zh;q=0.9',

            'cache-control': 'max-age=0',

            'cookie': 'lastCity=101210100;uab_collina=154408714637849548916323;toUrl=/;c=1558272251;g=-;l=l=%2Fwww.zhipin.com%2Fuser%2Flogin.html&r=; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1555852331,1556985726,1558169427,1558272251; __a=40505844.1544087205.1558169426.1558272251.41.14.4.31; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1558272385',

            'referer': 'https://www.zhipin.com/?ka=header-logo',

            'upgrade-insecure-requests': '',

            'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'

        }

    #将城市转化为code码

    def get_city(self,city_list):

        city_url = "https://www.zhipin.com/wapi/zpCommon/data/city.json" #获取城市

        json = requests.get(city_url).json()

        zpData = json["zpData"]["cityList"]

        list = []

        for city in city_list :

            for data_sf in zpData:

                for data_dq in data_sf["subLevelModelList"]:

                    if city == data_dq["name"]:

                         list.append(data_dq["code"])

        return list

    #获取所有页内容

    def get_job_list(self, url, datas):

        print(url)

        html = requests.get(url, headers=self.header).text

        soup = BeautifulSoup(html, 'html.parser')

        jobs = soup.select(".job-primary")

        for job in jobs:

            data = {}

            # 招聘id

            data["job_id"] = job.find_all("div", attrs={"class": "info-primary"})[0].find("a").get("data-jobid")

            # 招聘链接

            data["job_link"] = "https://www.zhipin.com" + job.find_all("div", attrs={"class": "info-primary"})[0].find("a").get("href")

            # 招聘岗位

            data["job_name"] = job.find_all("div", attrs={"class": "info-primary"})[0].find("div", attrs={"class": "job-title"}).get_text()

            # 薪资

            data["job_red"] = job.find_all("div", attrs={"class": "info-primary"})[0].find("span", attrs={"class": "red"}).get_text()

            # 地址 #工作年限 #学历

            data["job_address"] = job.find_all("div", attrs={"class": "info-primary"})[0].find("p").get_text().split(" ")

            # 企业链接

            data["job_company_link"] = job.find_all("div", attrs={"class": "info-company"})[0].find("a").get("href")

            # 企业信息

            data["job_company"] = job.find_all("div", attrs={"class": "info-company"})[0].find("p").get_text().split(" ")

            # boss链接

            data["job_publis_link"] = job.find_all("div", attrs={"class": "info-publis"})[0].find("img").get("src")

            # boos信息

            data["job_publis"] = job.find_all("div", attrs={"class": "info-publis"})[0].find("h3").get_text().split(" ")

            time.sleep(5)

            self.get_job_detail(data)  # 获取job详情页内容

            print(data)

            datas.append(data)  # 将某条job添加到datas中，直到将当前页添加完

        try:

            next_url = soup.find("div", attrs={"class": "page"}).find("a", attrs={"class": "next"}).get("href")

            #if next_url[-1] =="3":  # 第二页自动抛异常

            if next_url in "javascript:;":  # 最后一页自动抛异常

                raise Exception()

        except Exception as e:

            print("最后一页了；%s" % e)

            return datas  # 返回所有页内容

        else:

            time.sleep(5)

            next_url = "https://www.zhipin.com" + next_url

            self.get_job_list(next_url, datas)

            return datas  # 返回所有页内容

    #获取详情页内容

    def get_job_detail(self, data):

        print(data["job_link"])

        html = requests.get(data["job_link"], headers=self.header).text

        soup = BeautifulSoup(html, 'html.parser')

        # 招聘公司

        data["detail_content_name"] = soup.find_all("div", attrs={"class": "detail-content"})[0].find("div", attrs={"class": "name"}).get_text()

        # 福利

        data["detail_primary_tags"] = soup.find_all("div", attrs={"class": "info-primary"})[0].find("div", attrs={"class": "job-tags"}).get_text().strip()

        # 招聘岗位

        data["detail_primary_name"] = soup.find_all("div", attrs={"class": "info-primary"})[0].find("h1").get_text()

        # 招聘状态

        data["detail_primary_status"] = soup.find_all("div", attrs={"class": "info-primary"})[0].find("div", attrs={"class": "job-status"}).get_text()

        # 薪资

        data["detail_primary_salary"] = soup.find_all("div", attrs={"class": "info-primary"})[0].find("span", attrs={"class": "salary"}).get_text()

        # 地址 #工作年限 #学历

        data["detail_primary_address"] = soup.find_all("div", attrs={"class": "info-primary"})[0].find("p").get_text()

        # 工作地址

        data["detail_content_address"] = soup.find_all("div", attrs={"class": "detail-content"})[0].find("div", attrs={"class": "location-address"}).get_text()

        # 职位描述

        data["detail_content_text"] = soup.find_all("div", attrs={"class": "detail-content"})[0].find("div", attrs={"class": "text"}).get_text().strip().replace("；", "\n")

        # boss名字

        data["detail_op_name"] = soup.find_all("div", attrs={"class": "detail-op"})[1].find("h2", attrs={"class": "name"}).get_text()

        # boss职位

        data["detail_op_job"] = soup.find_all("div", attrs={"class": "detail-op"})[1].find("p", attrs={"class": "gray"}).get_text().split("·")[0]

        # boss状态

        data["detail_op_status"] = soup.find_all("div", attrs={"class": "detail-op"})[1].find("p", attrs={"class": "gray"}).get_text().split("·")[1]

    #将获取的数据写入Excel

    def setExcel(self, datas_list):

        book = xlwt.Workbook(encoding='utf-8')

        table = book.add_sheet("boss软件测试")

        table.write(0, 0, "编号")

        table.write(0, 1, "招聘链接")

        table.write(0, 2, "招聘岗位")

        table.write(0, 3, "薪资")

        table.write(0, 4, "地址")

        table.write(0, 5, "企业链接")

        table.write(0, 6, "企业信息")

        table.write(0, 7, "boss链接")

        table.write(0, 8, "boss信息")

        table.write(0, 9, "detail详情")

        i = 1

        for data in datas_list:

            table.write(i, 0, data["job_id"])

            table.write(i, 1, data["job_link"])

            table.write(i, 2, data["job_name"])

            table.write(i, 3, data["job_red"])

            table.write(i, 4, data["job_address"])

            table.write(i, 5, data["job_company_link"])

            table.write(i, 6, data["job_company"])

            table.write(i, 7, data["job_publis_link"])

            table.write(i, 8, data["job_publis"])

            table.write(i, 10, data["detail_content_name"])

            table.write(i, 11, data["detail_primary_name"])

            table.write(i, 12, data["detail_primary_status"])

            table.write(i, 13, data["detail_primary_salary"])

            table.write(i, 14, data["detail_primary_address"])

            table.write(i, 15, data["detail_content_text"])

            table.write(i, 16, data["detail_op_name"])

            table.write(i, 17, data["detail_op_job"])

            table.write(i, 18, data["detail_op_status"])

            table.write(i, 19, data["detail_primary_tags"])

            table.write(i, 20, data["detail_content_address"])

            i += 1

        book.save(r'C:\%s_boss软件测试.xls' % time.strftime('%Y%m%d%H%M%S'))

        print("Excel保存成功")

if __name__ == '__main__':

    city_list = MyJob("","").get_city(["杭州"])

    query_list = ["软件测试", "测试工程师"]

    datas_list = []

    for city in city_list:

        for query in query_list:

            myjob = MyJob(city, query)

            datas = myjob.get_job_list(myjob.list_url, myjob.datas)

            datas_list.extend(datas)

    myjob.setExcel(datas_list)

python使用bs4爬取boss静态页面的更多相关文章

Python 2.7_爬取CSDN单页面博客文章及url(二)_xpath提取_20170118
上次用的是正则匹配文章title 和文章url,因为最近在看Scrapy框架爬虫需要了解xpath语法学习了下拿这个例子练手 1.爬取的单页面还是这个rooturl:http://blog.csd ...
Python 2.7_爬取CSDN单页面利用正则提取博客文章及url_20170114
年前有点忙,没来的及更博,最近看爬虫正则的部分巩固下 1.爬取的单页面:http://blog.csdn.net/column/details/why-bug.html 2.过程解析url获得网站 ...
python+selenium+bs4爬取百度文库内文字 && selenium 元素可以定位到，但是无法点击问题 && pycharm多行缩进、左移
先说一下可能用到的一些python知识一.python中使用的是unicode编码, 而日常文本使用各类编码如:gbk utf-8 等等所以使用python进行文字读写操作时候经常会出现各种错误, ...
Scrapy 爬取BOSS直聘关于Python招聘岗位
年前的时候想看下招聘Python的岗位有多少,当时考虑目前比较流行的招聘网站就属于boss直聘,所以使用Scrapy来爬取下boss直聘的Python岗位. 1.首先我们创建一个Scrapy 工程 s ...
Python的scrapy之爬取boss直聘网站
在我们的项目中,单单分析一个51job网站的工作职位可能爬取结果不太理想,所以我又爬取了boss直聘网的工作,不过boss直聘的网站一次只能展示300个职位,所以我们一次也只能爬取300个职位. jo ...
大神：python怎么爬取js的页面
大神:python怎么爬取js的页面可以试试抓包看看它请求了哪些东西, 很多时候可以绕过网页直接请求后面的API 实在不行就上 selenium (selenium大法好) selenium和pha ...
Python爬虫《爬取get请求的页面数据》
一.urllib库 urllib是Python自带的一个用于爬虫的库,其主要作用就是可以通过代码模拟浏览器发送请求.其常被用到的子模块在Python3中的为urllib.request和urllib. ...
python实战项目 — 使用bs4 爬取猫眼电影热榜（存入本地txt、以及存储数据库列表）
案例一: 重点: 1. 使用bs4 爬取 2. 数据写入本地 txt from bs4 import BeautifulSoup import requests url = "http:// ...
Python爬虫之爬取慕课网课程评分
BS是什么? BeautifulSoup是一个基于标签的文本解析工具.可以根据标签提取想要的内容,很适合处理html和xml这类语言文本.如果你希望了解更多关于BS的介绍和用法,请看Beautiful ...

随机推荐

Centos6.5在线配置安装Java环境与Tomcat环境
书写此文一来记录环境,以便后期查看使用,Linux环境下配置centos与Java开发环境本文环境:虚拟机系统centos6.5 链接工具:xshell脚本链接工具一.安装Java开发 ...
php面试专题---Mysql索引类型、介绍及优点
php面试专题---Mysql索引类型.介绍及优点一.总结一句话总结: 精品视频讲解里面的资料来源也是通过各种资料,比如博客.书等,只不过是基于讲解者的知识体系有整理的过程 1.B-Tree索引三 ...
Python——GUI可视化
import sys from PyQt5.QtCore import * from PyQt5.QtGui import * from PyQt5.QtWidgets import * class ...
java.io.NotSerializableException错误解决方法
运行tomcat下面的 ssh项目,启动,打开某页面(让session起作用),停止:再启动,有可能会报类似如下的错误: org.apache.catalina.session.StandardMan ...
Eclipse如何汉化[完美版]
当前版本:Eclipse 4.5.1 1.如何查看eclipse的版本呢找到关于Eclipse,点击 . 2.打开浏览器连接http://www.eclipse.org/babel/download ...
关于C++ string 的神奇用法
c++里有大部分字符的操作都在#include<cstring>这个库中,这个库的函数在考试的时候都是可以用的,这个库里包含了很多字符串操作函数,特别是string这个数据类型特别优美,它 ...
MySQL 增删改语句
# DML语言 /* 数据操作语言: 插入:insert 修改:update 删除: delete */ 一.插入语句 insert /* 语法: 方式一: insert into 表名(列名,..) ...
ELMO，BERT和GPT简介
1.Contextualized Word Embedding 同样的单词有不同的意思,比如下面的几个句子,同样有 “bank” ,却有着不同的意思.但是用训练出来的 Word2Vec 得到 “ban ...
Android 中三种启用线程的方法
在多线程编程这块,我们经常要使用Handler(处理),Thread(线程)和Runnable这三个类,那么他们之间的关系你是否弄清楚了呢? 首先说明Android的CPU分配的最小单元是线程,Han ...
Spring cloud学习--Zuul01
Zuul解决的问题作为系统的统一入口,屏蔽了系统内部各个微服务的细节可以与微服务治理框架结合,实现自动化的服务实例维护以及负载均衡的路由转发实现接口权限校验与微服务业务逻辑的解耦搭建Zuul服 ...

python使用bs4爬取boss静态页面

python使用bs4爬取boss静态页面的更多相关文章

随机推荐

热门专题