pandas强化练习
这篇文章写得更好:http://wittyfans.com/coding/%E5%88%A9%E7%94%A8Pandas%E5%88%86%E6%9E%90%E7%BE%8E%E5%9B%BD%E4%BA%A4%E8%AD%A6%E5%BC%80%E6%94%BE%E7%9A%84%E6%90%9C%E6%9F%A5%E6%95%B0%E6%8D%AE.html import pandas as pd
import matplotlib.pyplot as plt #需要声明才能在notebook中画图
%matplotlib inline #下载的罗曼的警务数据,这里以ri代表罗德曼岛警务数据
ri=pd.read_csv('police.csv') ri.head()
stop_date | stop_time | county_name | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | NaN | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | NaN | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | NaN | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
3 | 2005-02-20 | 17:15 | NaN | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False |
4 | 2005-03-14 | 10:00 | NaN | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
ri.shape
(91741, 15)
ri.isnull().sum()
stop_date 0
stop_time 0
county_name 91741
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64
移除某列
ri.head()
stop_date | stop_time | county_name | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | NaN | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | NaN | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | NaN | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
3 | 2005-02-20 | 17:15 | NaN | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False |
4 | 2005-03-14 | 10:00 | NaN | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
#写法等同于ri.drop('county_name', axis=1 , inplace=True)
#删除空值的
ri.drop('county_name', axis='columns', inplace=True)
ri.shape
(91741, 14)
ri.columns
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
'driver_age', 'driver_race', 'violation_raw', 'violation',
'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
'stop_duration', 'drugs_related_stop'],
dtype='object')
#删除有空值的行
ri.dropna(axis='columns',how='all').shape
(91741, 14)
pandas过滤功能
保留布尔值为真的数据,这里我们保留violaton值为真的数据
ri[ri.violation=='Speeding'].head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
4 | 2005-03-14 | 10:00 | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
6 | 2005-04-01 | 17:30 | M | 1969.0 | 36.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
values_counts
## 超速违规的驾驶员男女各多少人
print(ri[ri.violation=='Speeding'].driver_gender.value_counts()
)
M 32979
F 15482
Name: driver_gender, dtype: int64
# 超速男女各占多少比例 normalize归一化处理
print(ri[ri.violation=='Speeding'].driver_gender.value_counts(normalize=True))
M 0.680527
F 0.319473
Name: driver_gender, dtype: float64
ri.loc[ri.violation=='Speeding','driver_gender'].value_counts(normalize=True)
M 0.680527
F 0.319473
Name: driver_gender, dtype: float64
#男性驾驶员中,各种交通违规的比例
ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True)
Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
#女性驾驶员中各种交通违规的比例
ri[ri.driver_gender=='F'].violation.value_counts(normalize=True)
Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
Name: violation, dtype: float64
groupby方法
查看不同driver_gender,violation的各种值的占比
#对比以上两种数据
ri.groupby('driver_gender').violation.value_counts(normalize=True)
driver_gender violation
F Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
M Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
mean方法
mean可以默认计算占比
#True为执行搜查,False为未执行搜查
print(ri.search_conducted.value_counts(normalize=True))
False 0.965163
True 0.034837
Name: search_conducted, dtype: float64
#这例men可以计算出True的咋还占比
print(ri.search_conducted.mean())
0.03483720473942948
男女分组看他们的搜索值
ri.groupby('driver_gender').search_conducted.mean()
driver_gender
F 0.020033
M 0.043326
Name: search_conducted, dtype: float64
男的搜查比例比女的高
再看一下如果是多重分组,男女搜查的比例
ri.groupby(['violation','driver_gender']).search_conducted.mean()
violation driver_gender
Equipment F 0.042622
M 0.070081
Moving violation F 0.036205
M 0.059831
Other F 0.056522
M 0.047146
Registration/plates F 0.066140
M 0.110376
Seat belt F 0.012598
M 0.037980
Speeding F 0.008720
M 0.024925
Name: search_conducted, dtype: float64
ri.isnull().sum()
stop_date 0
stop_time 0
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64
#是否search_conducted为false的时候,search_type都丢失了
ri.search_conducted.value_counts()
False 88545
True 3196
Name: search_conducted, dtype: int64
是不是数值和上面的search_type丢失的值相同啊
再次验证一下
ri[ri.search_conducted==False].search_type.value_counts()
Series([], Name: search_type, dtype: int64)
#value_counts()这个方法时候默认忽略丢失值(空值)
ri[ri.search_conducted==False].search_type.value_counts(dropna=False)
NaN 88545
Name: search_type, dtype: int64
#当searcch_conducted的值为True,search_type从来不丢失
ri[ri.search_conducted==True].search_type.value_counts(dropna=False)
Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Inventory,Protective Frisk 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Protective Frisk 2
Inventory,Probable Cause,Reasonable Suspicion 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
ri[ri.search_conducted==True].search_type.isnull().sum()
0
查看搜索类型
ri.search_type.value_counts(dropna=False)
NaN 88545
Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Inventory,Protective Frisk 11
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Reasonable Suspicion 2
Inventory,Probable Cause,Protective Frisk 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
ri['frisk']=ri.search_type=='Protective Frisk'
ri.frisk.dtype
dtype('bool')
ri.frisk.sum()
161
ri.frisk.mean()
0.0017549405391264537
ri.frisk.value_counts()
False 91580
True 161
Name: frisk, dtype: int64
161/(91580+161)
0.0017549405391264537
字符操作
#上面的操作是把ri.search_type=='Protective Frisk'的值付给日['firsk']这一列
#现在是字符串的包含操作
ri['frisk']=ri.search_type.str.contains('Protective Frisk')
ri.frisk.sum()
274
ri.frisk.mean()
0.08573216520650813
#用mean()计算符合条件和不符合条件的占比
ri.frisk.value_counts()
False 2922
True 274
Name: frisk, dtype: int64
#再看一下他们的计算是否和men()的结构一样
274/(2922+274)
0.08573216520650813
上面的这一部分是计算字符串匹配操作
用正确的关键字去计算比例
pandas计算式忽略缺失值的
#那一年的数据最少
ri.stop_date.str.slice(0,4).value_counts()
2012 10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: stop_date, dtype: int64
#将ri.stop_date转化为datetime的格式的dataframe,存到stop_datetime新列中
ri['stop_datetime'] = pd.to_datetime(ri.stop_date) #注意这里有dt方法,类似于上面的str方法
#dt后可以使用year、month等方法
ri.stop_datetime.dt.year.value_counts()
2012 10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: stop_datetime, dtype: int64
ri.stop_datetime.dt.month.value_counts()
1 8479
5 7935
11 7877
10 7745
3 7742
6 7630
8 7615
7 7568
4 7529
9 7427
12 7152
2 7042
Name: stop_datetime, dtype: int64
#关于毒驾
ri.drugs_related_stop.dtype
dtype('bool')
#基础比例
ri.drugs_related_stop.mean()
0.008883705213590434
#不能使用小时分组,除非你创建了小时这一列
#取出小时列,转换成时间格式,再转化才成小时分组
ri['stop_time_datetime']=pd.to_datetime(ri.stop_time)
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean()
stop_time_datetime
0 0.019728
1 0.013507
2 0.015462
3 0.017065
4 0.011811
5 0.004762
6 0.003040
7 0.003281
8 0.002687
9 0.006288
10 0.005714
11 0.006976
12 0.004467
13 0.010326
14 0.007810
15 0.006416
16 0.005723
17 0.005517
18 0.010148
19 0.011596
20 0.008084
21 0.013342
22 0.013533
23 0.016344
Name: drugs_related_stop, dtype: float64
#按小时的时毒驾频率分布图
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x9d72d30>
#按小时的,毒驾数量分布图
ri.stop_time_datetime.dt.hour.value_counts().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5460710>
#按小时分组,毒驾数量排序分布图
ri.stop_time_datetime.dt.hour.value_counts().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5420860>
ri.groupby(ri.stop_time_datetime.dt.hour).stop_date.count().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x557c2e8>
#把无用的数据标记为丢失值
ri.stop_duration.value_counts()
0-15 Min 69543
16-30 Min 13635
30+ Min 3228
1 1
2 1
Name: stop_duration, dtype: int64
ri[(ri.stop_duration=='1')|(ri.stop_duration=='2')].stop_duration='NaN'
C:\Anaconda3\lib\site-packages\pandas\core\generic.py:4401: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
ri.stop_duration.value_counts()
0-15 Min 69543
16-30 Min 13635
30+ Min 3228
1 1
2 1
Name: stop_duration, dtype: int64
ri.loc[(ri.stop_duration=='1')|(ri.stop_duration=='2'),'stop_duration']='NaN'
ri.stop_duration.value_counts(dropna=False)
0-15 Min 69543
16-30 Min 13635
NaN 5333
30+ Min 3228
NaN 2
Name: stop_duration, dtype: int64
#用执行的nan类型替换NaN
import numpy as np
ri.loc[ri.stop_duration == 'NaN', 'stop_duration'] = np.nan
ri.stop_duration.value_counts(dropna=False)
0-15 Min 69543
16-30 Min 13635
NaN 5335
30+ Min 3228
Name: stop_duration, dtype: int64
ri.stop_duration.replace(['1', '2'], value=np.nan, inplace=True)
# stop_duration中的各种比例
#Series的map方法可以接受一个函数或含有映射关系的字典型对象。
#对某一个列进行批操作,本文中是批量替换
mapping={'0-15 Min':8,'16-30 Min':23,'30+ Min':45} #记得这不是原地操作原始数据,需要新建一列存储map后的结果
ri['stop_minutes'] = ri.stop_duration.map(mapping)
#为各种粘皮匹配值
ri.stop_minutes.value_counts()
8.0 69543
23.0 13635
45.0 3228
Name: stop_minutes, dtype: int64
ri.groupby('violation_raw').stop_minutes.mean()
violation_raw
APB 20.987342
Call for Service 22.034669
Equipment/Inspection Violation 11.460345
Motorist Assist/Courtesy 16.916256
Other Traffic Violation 13.900265
Registration Violation 13.745629
Seatbelt Violation 9.741531
Special Detail/Directed Patrol 15.061100
Speeding 10.577690
Suspicious Person 18.750000
Violation of City/Town Ordinance 13.388626
Warrant 21.400000
Name: stop_minutes, dtype: float64
# 使用某种方法如mean、count对某类数据进行操作。 # 过去agg只能groupby之后的数据进行操作,现在还可以对dataframe类、series类进行操作。
ri.groupby('violation_raw').stop_minutes.agg(['mean','count'])
mean | count | |
---|---|---|
violation_raw | ||
APB | 20.987342 | 79 |
Call for Service | 22.034669 | 1298 |
Equipment/Inspection Violation | 11.460345 | 11020 |
Motorist Assist/Courtesy | 16.916256 | 203 |
Other Traffic Violation | 13.900265 | 16223 |
Registration Violation | 13.745629 | 3432 |
Seatbelt Violation | 9.741531 | 2952 |
Special Detail/Directed Patrol | 15.061100 | 2455 |
Speeding | 10.577690 | 48462 |
Suspicious Person | 18.750000 | 56 |
Violation of City/Town Ordinance | 13.388626 | 211 |
Warrant | 21.400000 | 15 |
plot 默认是折线方法
ri.groupby('violation_raw').stop_minutes.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10873ef0>
#换成bartu
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1092eb38>
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x10a4a5f8>
ri.groupby('violation').driver_age.describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
violation | ||||||||
Equipment | 11007.0 | 31.781503 | 11.400900 | 16.0 | 23.0 | 28.0 | 38.0 | 89.0 |
Moving violation | 16164.0 | 36.120020 | 13.185805 | 15.0 | 25.0 | 33.0 | 46.0 | 99.0 |
Other | 4204.0 | 39.536870 | 13.034639 | 16.0 | 28.0 | 39.0 | 49.0 | 87.0 |
Registration/plates | 3427.0 | 32.803035 | 11.033675 | 16.0 | 24.0 | 30.0 | 40.0 | 74.0 |
Seat belt | 2952.0 | 32.206301 | 11.213122 | 17.0 | 24.0 | 29.0 | 38.0 | 77.0 |
Speeding | 48361.0 | 33.530097 | 12.821847 | 15.0 | 23.0 | 30.0 | 42.0 | 90.0 |
ri.driver_age.plot(kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x1003a518>
ri.driver_age.value_counts().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10088080>
ri.hist('driver_age', by='violation')
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000100D8438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010111208>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000000001013B898>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010163F28>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000101945F8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010194630>]],
dtype=object)
ri.hist('driver_age',by='violation',sharex=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000102C6C50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103243C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010346908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010370E80>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1470>]],
dtype=object)
ri.hist('driver_age',by='violation',sharex=True,sharey=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000104C4F98>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000001059D358>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105C0748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105E9B38>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F28>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F60>]],
dtype=object)
ri.head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-02 | 2019-04-05 01:55:00 | 8.0 | 20.0 |
1 | 2005-01-18 | 08:15 | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-18 | 2019-04-05 08:15:00 | 8.0 | 40.0 |
2 | 2005-01-23 | 23:15 | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-23 | 2019-04-05 23:15:00 | 8.0 | 33.0 |
3 | 2005-02-20 | 17:15 | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False | NaN | 2005-02-20 | 2019-04-05 17:15:00 | 23.0 | 19.0 |
4 | 2005-03-14 | 10:00 | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-03-14 | 2019-04-05 10:00:00 | 8.0 | 21.0 |
ri.tail()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
91736 | 2015-12-31 | 20:27 | M | 1986.0 | 29.0 | White | Speeding | Speeding | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:27:00 | 8.0 | 29.0 |
91737 | 2015-12-31 | 20:35 | F | 1982.0 | 33.0 | White | Equipment/Inspection Violation | Equipment | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:35:00 | 8.0 | 33.0 |
91738 | 2015-12-31 | 20:45 | M | 1992.0 | 23.0 | White | Other Traffic Violation | Moving violation | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:45:00 | 8.0 | 23.0 |
91739 | 2015-12-31 | 21:42 | M | 1993.0 | 22.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 21:42:00 | 8.0 | 22.0 |
91740 | 2015-12-31 | 22:46 | M | 1959.0 | 56.0 | Hispanic | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 22:46:00 | 8.0 | 56.0 |
ri['new_age']=ri.stop_datetime.dt.year-ri.driver_age_raw
ri[['driver_age','new_age']].hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000107FE7F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000001083C2E8>]],
dtype=object)
ri[['driver_age','new_age']].describe()
driver_age | new_age | |
---|---|---|
count | 86120.000000 | 86414.000000 |
mean | 34.011333 | 39.784294 |
std | 12.738564 | 110.822145 |
min | 15.000000 | -6794.000000 |
25% | 23.000000 | 24.000000 |
50% | 31.000000 | 31.000000 |
75% | 43.000000 | 43.000000 |
max | 99.000000 | 2015.000000 |
ri[(ri.new_age<15)|(ri.new_age>99)].shape
(294, 19)
ri.driver_age_raw.isnull().sum()
5327
ri.driver_age.isnull().sum()
5621
5621-5327
294
ri[(ri.driver_age_raw.notnull())&(ri.driver_age.isnull())].head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
146 | 2005-10-05 | 08:50 | M | 0.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-10-05 | 2019-04-05 08:50:00 | 8.0 | 2005.0 |
281 | 2005-10-10 | 12:05 | F | 0.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Warning | False | 0-15 Min | False | NaN | 2005-10-10 | 2019-04-05 12:05:00 | 8.0 | 2005.0 |
331 | 2005-10-12 | 07:50 | M | 0.0 | NaN | White | Motorist Assist/Courtesy | Other | False | NaN | No Action | False | 0-15 Min | False | NaN | 2005-10-12 | 2019-04-05 07:50:00 | 8.0 | 2005.0 |
414 | 2005-10-17 | 08:32 | M | 2005.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-10-17 | 2019-04-05 08:32:00 | 8.0 | 0.0 |
455 | 2005-10-18 | 18:30 | F | 0.0 | NaN | White | Speeding | Speeding | False | NaN | Warning | False | 0-15 Min | False | NaN | 2005-10-18 | 2019-04-05 18:30:00 | 8.0 | 2005.0 |
ri.loc[(ri.new_age<15)|(ri.new_age>99),'new_age']=np.nan
ri.new_age.equals(ri.driver_age)
True
pandas强化练习的更多相关文章
- 【强化学习】用pandas 与 numpy 分别实现 q-learning, saras, saras(lambda)算法
本文作者:hhh5460 本文地址:https://www.cnblogs.com/hhh5460/p/10159331.html 特别感谢:本文的三幅图皆来自莫凡的教程 https://morvan ...
- 【强化学习】python 实现 q-learning 例一
本文作者:hhh5460 本文地址:https://www.cnblogs.com/hhh5460/p/10134018.html 问题情境 -o---T# T 就是宝藏的位置, o 是探索者的位置 ...
- 深度强化学习:Policy-Based methods、Actor-Critic以及DDPG
Policy-Based methods 在上篇文章中介绍的Deep Q-Learning算法属于基于价值(Value-Based)的方法,即估计最优的action-value function $q ...
- pandas基础-Python3
未完 for examples: example 1: # Code based on Python 3.x # _*_ coding: utf-8 _*_ # __Author: "LEM ...
- [django]数据导出excel升级强化版(很强大!)
不多说了,原理采用xlwt导出excel文件,所谓的强化版指的是实现在网页上选择一定条件导出对应的数据 之前我的博文出过这类文章,但只是实现导出数据,这次左思右想,再加上网上的搜索,终于找出方法实现条 ...
- 10 Minutes to pandas
摘要 一.创建对象 二.查看数据 三.选择和设置 四.缺失值处理 五.相关操作 六.聚合 七.重排(Reshaping) 八.时间序列 九.Categorical类型 十.画图 十一 ...
- ITTC数据挖掘平台介绍(七)强化的数据库, 虚拟化,脚本编辑器
一. 前言 好久没有更新博客了,最近一直在忙着找工作,目前差不多尘埃落定.特别期待而且准备的都很少能成功,反而是没怎么在意的最终反而能拿到,真是神一样的人生. 言归正传,一直以来,数据挖掘系统的数据类 ...
- 利用Python进行数据分析(15) pandas基础: 字符串操作
字符串对象方法 split()方法拆分字符串: strip()方法去掉空白符和换行符: split()结合strip()使用: "+"符号可以将多个字符串连接起来: join( ...
- 利用Python进行数据分析(10) pandas基础: 处理缺失数据
数据不完整在数据分析的过程中很常见. pandas使用浮点值NaN表示浮点和非浮点数组里的缺失数据. pandas使用isnull()和notnull()函数来判断缺失情况. 对于缺失数据一般处理 ...
随机推荐
- jqgrid扩展 获取表单数据
$.fn.GetPostData = function () { var data = {}; var k = false; $(this).find(".datacontrol" ...
- 黑盒测试实践--Day1 11.25
黑盒测试实践--Day1 今天完成任务情况: 晚上得到老师布置的本周小组作业--黑盒测试的基本要求,然后小组在上周作业建立的微信群里开了个在线的短会,主要内容如下: 组长小靳带领大家学习了这个要求 计 ...
- linux 首次登陆与线上求助
开始下达指令概念 上述指令详细说明如下:1. 一行指令中第一个输入的部分绝对是『指令(command)』或『可执行文件案(例如批次脚本,script)』2. command 为指令的名称,例如变换工作 ...
- mac安装nose,command not found:nosetests
mac通过pip install nose失败,看了一下是权限的问题,重新用sudo pip install nose安装,安装成功. 但是执行nosetests时,提示command not fou ...
- 史融资2.5亿的“自主国产”红芯浏览器,其实是个套壳Chrome
红芯浏览器 今天早上看到朋友发的浏览器图片,感觉很好奇,然后就看了下,感觉文章还不错,就转发了下,然后下载浏览器着实花了不小心思,最后文末添加了红芯浏览器转存在蓝奏云盘的下载连接了. 文章原文 今天又 ...
- 编写高质量代码改善C#程序的157个建议——建议49:在Dispose模式中应提取一个受保护的虚方法
建议49:在Dispose模式中应提取一个受保护的虚方法 在标准的Dispose模式中,真正的IDisposable接口的Dispose方法并没有做实际的清理工作,它其实是调用了下面的这个带bool参 ...
- 用原生css实现高斯模糊、黑白等滤镜效果
—引导— 在CSS3中,有一个强大的属性,那就是filter属性,filter顾名思义就是“滤镜”的意思,用filter属性可以让图片无需PS处理就达到一些简单的显示效果. —定义和使用— filte ...
- Python中多使用迭代器
英文原文出处:Use More Iterators 本文介绍将代码转换为使用迭代器的原因和实用技巧. 我最喜欢的Python语言的特色之一是生成器,它们是非常有用的,然而当阅读开源代码时,我很少遇到它 ...
- angular 子路由
const routes: Routes = [ { path: '', redirectTo: '/home', pathMatch: 'full' }, { path: 'home', compo ...
- Android系列一: 环境搭建
相关软件 JAVA JDKAndroid StudioHAXM JDK的安装和Java环境变量的设置 1.JDK下载地址: http://www.oracle.com/technetwork/j ...