前言

从智联招聘爬取相关信息后，我们关心的是如何对内容进行分析，获取用用的信息。本次以上篇文章“5分钟掌握智联招聘网站爬取并保存到MongoDB数据库”中爬取的数据为基础，分析关键词为“python”的爬取数据的情况，获取包括全国python招聘数量Top10的城市列表以及其他相关信息。

一、主要分析步骤

数据读取
数据整理
对职位数量在全国主要城市的分布情况进行分析
对全国范围内的职位月薪情况进行分析
对该职位招聘岗位要求描述进行词云图分析，获取频率最高的关键字
选取两个城市，分别分析月薪分布情况以及招聘要求的词云图分析

二、具体分析过程

import pymongo
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
% matplotlib inline
plt.style.use('ggplot')

# 解决matplotlib显示中文问题
plt.rcParams['font.sans-serif'] = ['SimHei']  # 指定默认字体
plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题

1 读取数据

client = pymongo.MongoClient('localhost')
db = client['zhilian']
table = db['python']
columns = ['zwmc',
           'gsmc',
           'zwyx',
           'gbsj',
           'gzdd',
           'fkl',
           'brief',
           'zw_link',
           '_id',
           'save_date']
# url_set =  set([records['zw_link'] for records in table.find()])
# print(url_set)
df = pd.DataFrame([records for records in table.find()], columns=columns)
# columns_update = ['职位名称',
#                   '公司名称',
#                   '职位月薪',
#                   '公布时间',
#                   '工作地点',
#                   '反馈率',
#                   '招聘简介',
#                   '网页链接',
#                   '_id',
#                   '信息保存日期']
# df.columns = columns_update
print('总行数为：{}行'.format(df.shape[0]))
df.head(2)

结果如图1所示：

2 数据整理

2.1 将str格式的日期变为 datatime

df['save_date'] = pd.to_datetime(df['save_date'])
print(df['save_date'].dtype)
# df['save_date']

datetime64[ns]

2.2 筛选月薪格式为“XXXX-XXXX”的信息

df_clean = df[['zwmc',
           'gsmc',
           'zwyx',
           'gbsj',
           'gzdd',
           'fkl',
           'brief',
           'zw_link',
           'save_date']]
# 对月薪的数据进行筛选，选取格式为“XXXX-XXXX”的信息，方面后续分析
df_clean = df_clean[df_clean['zwyx'].str.contains('\d+-\d+', regex=True)]
print('总行数为：{}行'.format(df_clean.shape[0]))
# df_clean.head()

总行数为：22605行

2.3 分割月薪字段，分别获取月薪的下限值和上限值

# http://stackoverflow.com/questions/14745022/pandas-dataframe-how-do-i-split-a-column-into-two
# http://stackoverflow.com/questions/20602947/append-column-to-pandas-dataframe
# df_temp.loc[: ,'zwyx_min'],df_temp.loc[: , 'zwyx_max'] = df_temp.loc[: , 'zwyx'].str.split('-',1).str #会有警告
s_min, s_max = df_clean.loc[: , 'zwyx'].str.split('-',1).str
df_min = pd.DataFrame(s_min)
df_min.columns = ['zwyx_min']
df_max = pd.DataFrame(s_max)
df_max.columns = ['zwyx_max']
df_clean_concat = pd.concat([df_clean, df_min, df_max], axis=1)
# df_clean['zwyx_min'].astype(int)
df_clean_concat['zwyx_min'] = pd.to_numeric(df_clean_concat['zwyx_min'])
df_clean_concat['zwyx_max'] = pd.to_numeric(df_clean_concat['zwyx_max'])
# print(df_clean['zwyx_min'].dtype)
print(df_clean_concat.dtypes)
df_clean_concat.head(2)

运行结果如图2所示：

将数据信息按职位月薪进行排序

df_clean_concat.sort_values('zwyx_min',inplace=True)
# df_clean_concat.tail()

判断爬取的数据是否有重复值

# 判断爬取的数据是否有重复值
print(df_clean_concat[df_clean_concat.duplicated('zw_link')==True])

Empty DataFrame
Columns: [zwmc, gsmc, zwyx, gbsj, gzdd, fkl, brief, zw_link, save_date, zwyx_min, zwyx_max]
Index: []

从上述结果可看出，数据是没有重复的。

3 对全国范围内的职位进行分析

3.1 主要城市的招聘职位数量分布情况

# from IPython.core.display import display, HTML
ADDRESS = [ '北京', '上海', '广州', '深圳',
           '天津', '武汉', '西安', '成都', '大连',
           '长春', '沈阳', '南京', '济南', '青岛',
           '杭州', '苏州', '无锡', '宁波', '重庆',
           '郑州', '长沙', '福州', '厦门', '哈尔滨',
           '石家庄', '合肥', '惠州', '太原', '昆明',
           '烟台', '佛山', '南昌', '贵阳', '南宁']
df_city = df_clean_concat.copy()
# 由于工作地点的写上，比如北京，包含许多地址为北京-朝阳区等
# 可以用替换的方式进行整理，这里用pandas的replace()方法
for city in ADDRESS:
    df_city['gzdd'] = df_city['gzdd'].replace([(city+'.*')],[city],regex=True)
# 针对全国主要城市进行分析
df_city_main = df_city[df_city['gzdd'].isin(ADDRESS)]
df_city_main_count = df_city_main.groupby('gzdd')['zwmc','gsmc'].count()
df_city_main_count['gsmc'] = df_city_main_count['gsmc']/(df_city_main_count['gsmc'].sum())
df_city_main_count.columns = ['number', 'percentage']
# 按职位数量进行排序
df_city_main_count.sort_values(by='number', ascending=False, inplace=True)
# 添加辅助列，标注城市和百分比，方面在后续绘图时使用
df_city_main_count['label']=df_city_main_count.index+ ' '+  ((df_city_main_count['percentage']*100).round()).astype('int').astype('str')+'%'
print(type(df_city_main_count))
# 职位数量最多的Top10城市的列表
print(df_city_main_count.head(10))

<class 'pandas.core.frame.DataFrame'>
      number  percentage   label
gzdd
北京      6936    0.315948  北京 32%
上海      3213    0.146358  上海 15%
深圳      1908    0.086913   深圳 9%
成都      1290    0.058762   成都 6%
杭州      1174    0.053478   杭州 5%
广州      1167    0.053159   广州 5%
南京       826    0.037626   南京 4%
郑州       741    0.033754   郑州 3%
武汉       552    0.025145   武汉 3%
西安       473    0.021546   西安 2%

对结果进行绘图：

from  matplotlib import cm
label = df_city_main_count['label']
sizes = df_city_main_count['number']
# 设置绘图区域大小
fig, axes = plt.subplots(figsize=(10,6),ncols=2)
ax1, ax2 = axes.ravel()
colors = cm.PiYG(np.arange(len(sizes))/len(sizes)) # colormaps: Paired, autumn, rainbow, gray,spring,Darks
# 由于城市数量太多，饼图中不显示labels和百分比
patches, texts = ax1.pie(sizes,labels=None, shadow=False, startangle=0, colors=colors)
ax1.axis('equal')  
ax1.set_title('职位数量分布', loc='center')
# ax2 只显示图例（legend）
ax2.axis('off')
ax2.legend(patches, label, loc='center left', fontsize=9)
plt.savefig('job_distribute.jpg')
plt.show()

运行结果如下述饼图所示：

3.2 月薪分布情况（全国）

from matplotlib.ticker import FormatStrFormatter
fig, (ax1, ax2) = plt.subplots(figsize=(10,8), nrows=2)
x_pos = list(range(df_clean_concat.shape[0]))
y1 = df_clean_concat['zwyx_min']
ax1.plot(x_pos, y1)
ax1.set_title('Trend of min monthly salary in China', size=14)
ax1.set_xticklabels('')
ax1.set_ylabel('min monthly salary(RMB)')
bins = [3000,6000, 9000, 12000, 15000, 18000, 21000, 24000, 100000]
counts, bins, patches = ax2.hist(y1, bins, normed=1, histtype='bar', facecolor='g', rwidth=0.8)
ax2.set_title('Hist of min monthly salary in China', size=14)
ax2.set_yticklabels('')
# ax2.set_xlabel('min monthly salary(RMB)')
# http://stackoverflow.com/questions/6352740/matplotlib-label-each-bin
ax2.set_xticks(bins) #将bins设置为xticks
ax2.set_xticklabels(bins, rotation=-90) # 设置为xticklabels的方向
# Label the raw counts and the percentages below the x-axis...
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
for count, x in zip(counts, bin_centers):
#     # Label the raw counts
#     ax2.annotate(str(count), xy=(x, 0), xycoords=('data', 'axes fraction'),
#         xytext=(0, -70), textcoords='offset points', va='top', ha='center', rotation=-90)
    # Label the percentages
    percent = '%0.0f%%' % (100 * float(count) / counts.sum())
    ax2.annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'),
        xytext=(0, -40), textcoords='offset points', va='top', ha='center', rotation=-90, color='b', size=14)
fig.savefig('salary_quanguo_min.jpg')

运行结果如下述图所示：

不考虑部分极值后，分析月薪分布情况

df_zwyx_adjust = df_clean_concat[df_clean_concat['zwyx_min']<=20000]
fig, (ax1, ax2) = plt.subplots(figsize=(10,8), nrows=2)
x_pos = list(range(df_zwyx_adjust.shape[0]))
y1 = df_zwyx_adjust['zwyx_min']
ax1.plot(x_pos, y1)
ax1.set_title('Trend of min monthly salary in China (adjust)', size=14)
ax1.set_xticklabels('')
ax1.set_ylabel('min monthly salary(RMB)')
bins = [3000,6000, 9000, 12000, 15000, 18000, 21000]
counts, bins, patches = ax2.hist(y1, bins, normed=1, histtype='bar', facecolor='g', rwidth=0.8)
ax2.set_title('Hist of min monthly salary in China (adjust)', size=14)
ax2.set_yticklabels('')
# ax2.set_xlabel('min monthly salary(RMB)')
# http://stackoverflow.com/questions/6352740/matplotlib-label-each-bin
ax2.set_xticks(bins) #将bins设置为xticks
ax2.set_xticklabels(bins, rotation=-90) # 设置为xticklabels的方向
# Label the raw counts and the percentages below the x-axis...
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
for count, x in zip(counts, bin_centers):
#     # Label the raw counts
#     ax2.annotate(str(count), xy=(x, 0), xycoords=('data', 'axes fraction'),
#         xytext=(0, -70), textcoords='offset points', va='top', ha='center', rotation=-90)
    # Label the percentages
    percent = '%0.0f%%' % (100 * float(count) / counts.sum())
    ax2.annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'),
        xytext=(0, -40), textcoords='offset points', va='top', ha='center', rotation=-90, color='b', size=14)
fig.savefig('salary_quanguo_min_adjust.jpg')

运行结果如下述图所示：

3.3 相关技能要求

brief_list = list(df_clean_concat['brief'])
brief_str = ''.join(brief_list)
print(type(brief_str))
# print(brief_str)
# with open('brief_quanguo.txt', 'w', encoding='utf-8') as f:
#     f.write(brief_str)

<class 'str'>

对获取到的职位招聘要求进行词云图分析，代码如下：

# -*- coding: utf-8 -*-
"""
Created on Wed May 17 2017
@author: lemon
"""
import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import os
import PIL.Image as Image
import numpy as np
with open('brief_quanguo.txt', 'rb') as f: # 读取文件内容
    text = f.read()
    f.close()
# 首先使用 jieba 中文分词工具进行分词
wordlist = jieba.cut(text, cut_all=False)
# cut_all, True为全模式，False为精确模式
wordlist_space_split = ' '.join(wordlist)
d = os.path.dirname(__file__)
alice_coloring = np.array(Image.open(os.path.join(d,'colors.png')))
my_wordcloud = WordCloud(background_color='#F0F8FF', max_words=100, mask=alice_coloring,
                         max_font_size=300, random_state=42).generate(wordlist_space_split)
image_colors = ImageColorGenerator(alice_coloring)
plt.show(my_wordcloud.recolor(color_func=image_colors))
plt.imshow(my_wordcloud)            # 以图片的形式显示词云
plt.axis('off')                     # 关闭坐标轴
plt.show()
my_wordcloud.to_file(os.path.join(d, 'brief_quanguo_colors_cloud.png'))

得到结果如下：

4 北京

4.1 月薪分布情况

df_beijing = df_clean_concat[df_clean_concat['gzdd'].str.contains('北京.*', regex=True)]
df_beijing.to_excel('zhilian_kw_python_bj.xlsx')
print('总行数为：{}行'.format(df_beijing.shape[0]))
# df_beijing.head()

总行数为：6936行

参考全国分析时的代码，月薪分布情况图如下：

4.2 相关技能要求

brief_list_bj = list(df_beijing['brief'])
brief_str_bj = ''.join(brief_list_bj)
print(type(brief_str_bj))
# print(brief_str_bj)
# with open('brief_beijing.txt', 'w', encoding='utf-8') as f:
#     f.write(brief_str_bj)

<class 'str'>

词云图如下：

5 长沙

5.1 月薪分布情况

df_changsha = df_clean_concat[df_clean_concat['gzdd'].str.contains('长沙.*', regex=True)]
# df_changsha = pd.DataFrame(df_changsha, ignore_index=True)
df_changsha.to_excel('zhilian_kw_python_cs.xlsx')
print('总行数为：{}行'.format(df_changsha.shape[0]))
# df_changsha.tail()

总行数为：280行

参考全国分析时的代码，月薪分布情况图如下：

5.2 相关技能要求

brief_list_cs = list(df_changsha['brief'])
brief_str_cs = ''.join(brief_list_cs)
print(type(brief_str_cs))
# print(brief_str_cs)
# with open('brief_changsha.txt', 'w', encoding='utf-8') as f:
#     f.write(brief_str_cs)

<class 'str'>

词云图如下：

zhilian_data_analysis.ipynb 为主分析代码

word_cloud 为词云图分析的代码

先运行主分析代码，在运行词云图分析代码

word_cloud.py为词云图分析的代码

# -*- coding: utf-8 -*-
"""
Created on Wed May 17 2017
@author: lemon
"""

import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import os
import PIL.Image as Image
import numpy as np

with open('brief_quanguo.txt', 'rb') as f: # 读取文件内容
text = f.read()
f.close()

# 首先使用 jieba 中文分词工具进行分词
wordlist = jieba.cut(text, cut_all=False)
# cut_all, True为全模式，False为精确模式

wordlist_space_split = ' '.join(wordlist)

d = os.path.dirname(__file__)
alice_coloring = np.array(Image.open(os.path.join(d,'colors.png')))
my_wordcloud = WordCloud(background_color='#F0F8FF', max_words=100, mask=alice_coloring,
max_font_size=300, random_state=42).generate(wordlist_space_split)

image_colors = ImageColorGenerator(alice_coloring)

plt.show(my_wordcloud.recolor(color_func=image_colors))
plt.imshow(my_wordcloud) # 以图片的形式显示词云
plt.axis('off') # 关闭坐标轴
plt.show()

my_wordcloud.to_file(os.path.join(d, 'brief_quanguo_colors_cloud.png'))

python爬虫实践--求职Top10城市的更多相关文章

python求职Top10城市，来看看是否有你所在的城市
前言从智联招聘爬取相关信息后,我们关心的是如何对内容进行分析,获取用用的信息. 本次以上篇文章“5分钟掌握智联招聘网站爬取并保存到MongoDB数据库”中爬取的数据为基础,分析关键词为“python ...
python爬虫实践教学
i春秋作家:Mochazz 一.前言这篇文章之前是给新人培训时用的,大家觉的挺好理解的,所以就分享出来,与大家一起学习.如果你学过一些python,想用它做些什么又没有方向,不妨试试完成下面几个案例 ...
python爬虫实践
模拟登陆与文件下载爬取http://moodle.tipdm.com上面的视频并下载模拟登陆由于泰迪杯网站问题,测试之后发现无法用正常的账号密码登陆,这里会使用访客账号登陆. 我们先打开泰迪杯的 ...
Python爬虫实践 -- 记录我的第二只爬虫
1.爬虫基本原理我们爬取中国电影最受欢迎的影片<红海行动>的相关信息.其实,爬虫获取网页信息和人工获取信息,原理基本是一致的. 人工操作步骤: 1. 获取电影信息的页面 2. 定位(找到 ...
python爬虫实践（二）——爬取张艺谋导演的电影《影》的豆瓣影评并进行简单分析
学了爬虫之后,都只是爬取一些简单的小页面,觉得没意思,所以我现在准备爬取一下豆瓣上张艺谋导演的“影”的短评,存入数据库,并进行简单的分析和数据可视化,因为用到的只是比较多,所以写一篇博客当做笔记. 第 ...
python爬虫实践（一）
最近在学习爬虫,学完后想实践一下,所以现在准备爬取校花网的一部分图片第一步,导入需要的库 from urllib import request #用于处理request请求和获得响应 from ur ...
Python爬虫实践 -- 记录我的第一只爬虫
一.环境配置 1. 下载安装 python3 .(或者安装 Anaconda) 2. 安装requests和lxml 进入到 pip 目录,CMD --> C:\Python\Scripts,输 ...
《转载》python爬虫实践之模拟登录
有些网站设置了权限,只有在登录了之后才能爬取网站的内容,如何模拟登录,目前的方法主要是利用浏览器cookie模拟登录. 浏览器访问服务器的过程在用户访问网页时,不论是通过URL输入域名或IP ...
Python爬虫实践~BeautifulSoup+urllib+Flask实现静态网页的爬取
爬取的网站类型: 论坛类网站类型涉及主要的第三方模块: BeautifulSoup:解析.遍历页面 urllib:处理URL请求 Flask:简易的WEB框架介绍: 本次主要使用urllib获取网 ...

随机推荐

clusterdb - 对一个PostgreSQL数据库进行建簇
SYNOPSIS clusterdb [ connection-option...] [ --table | -t table] [ dbname] clusterdb [ connection-op ...
如何优化LIMIT
首先我们先创建个数据表做测试表名 test (id(int) , name(var char) , content(text) , pid(int) ) 往里面倒几百万条数据进去做测试. 我们都知道 ...
vue中的组件传值
组件关系可以分为父子组件通信.兄弟组件通信.跨级组件通信. 父传子 - props 子传父 - $emit 跨级可以用bus 父子双向 v-model 父链(this.$parent this.$ch ...
GCC编译链接过程
编译链接过程代码 #cat main.c #include <stdio.h> int add(int x, int y); int sub(int x, int y); int mul ...
[实现] 利用 Seq2Seq 预测句子后续字词（Pytorch）2
最近有个任务:利用 RNN 进行句子补全,即给定一个不完整的句子,预测其后续的字词.本文使用了 Seq2Seq 模型,输入为 5 个中文字词,输出为 1 个中文字词.目录关于RNN 语料预处理搭建 ...
while(n--)
while(n--)的意思:先判断n是否等于0,如果等于0,就不循环.如果不等于0,就进入循环,同时n的值减1.一直等到n=0才退出while循环. C语言.C++
[Luogu] P4838 P哥破解密码
题目背景 P哥是一个经常丢密码条的男孩子. 在ION 8102赛场上,P哥又弄丢了密码条,笔试满分的他当然知道这可是要扣5分作为惩罚的,于是他开始破解ION Xunil系统的密码. 题目描述定义一个 ...
python TCP协议与UDP协议
1. TCP协议 / UDP协议 1.1 TCP协议 1.可靠.慢.全双工通信 2.建立连接的时候 : 三次握手 3.断开连接的时候 : 四次挥手 4.在建立起连接之后发送的每一条信息都有回执为了 ...
多.h项目出现的问题：使用了预编译头依然出现error LNK2005:***obj已在***obj中定义与c++ error C2011: “xxx”:“class”类型重定义解决办法
使用了预编译头依然出现error LNK2005:***obj已在***obj中定义造成该问题的可能性比较多,本人将在今后遇到时添加进来,今天先放出本人遇到的一种情况. 多重包含含有变量定义的.h文 ...
Python中的列表（6）
列表切片如何拿到列表中的部分元素,Python 引入了 “切片” 的概念. 上代码: words = ['a','b','c','d'] print(words[0:3]) console: 冒号( ...

python爬虫实践--求职Top10城市

前言