pandas强化练习
这篇文章写得更好:http://wittyfans.com/coding/%E5%88%A9%E7%94%A8Pandas%E5%88%86%E6%9E%90%E7%BE%8E%E5%9B%BD%E4%BA%A4%E8%AD%A6%E5%BC%80%E6%94%BE%E7%9A%84%E6%90%9C%E6%9F%A5%E6%95%B0%E6%8D%AE.html import pandas as pd
import matplotlib.pyplot as plt #需要声明才能在notebook中画图
%matplotlib inline #下载的罗曼的警务数据,这里以ri代表罗德曼岛警务数据
ri=pd.read_csv('police.csv') ri.head()
stop_date | stop_time | county_name | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | NaN | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | NaN | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | NaN | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
3 | 2005-02-20 | 17:15 | NaN | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False |
4 | 2005-03-14 | 10:00 | NaN | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
ri.shape
(91741, 15)
ri.isnull().sum()
stop_date 0
stop_time 0
county_name 91741
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64
移除某列
ri.head()
stop_date | stop_time | county_name | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | NaN | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | NaN | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | NaN | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
3 | 2005-02-20 | 17:15 | NaN | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False |
4 | 2005-03-14 | 10:00 | NaN | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
#写法等同于ri.drop('county_name', axis=1 , inplace=True)
#删除空值的
ri.drop('county_name', axis='columns', inplace=True)
ri.shape
(91741, 14)
ri.columns
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
'driver_age', 'driver_race', 'violation_raw', 'violation',
'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
'stop_duration', 'drugs_related_stop'],
dtype='object')
#删除有空值的行
ri.dropna(axis='columns',how='all').shape
(91741, 14)
pandas过滤功能
保留布尔值为真的数据,这里我们保留violaton值为真的数据
ri[ri.violation=='Speeding'].head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
4 | 2005-03-14 | 10:00 | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
6 | 2005-04-01 | 17:30 | M | 1969.0 | 36.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
values_counts
## 超速违规的驾驶员男女各多少人
print(ri[ri.violation=='Speeding'].driver_gender.value_counts()
)
M 32979
F 15482
Name: driver_gender, dtype: int64
# 超速男女各占多少比例 normalize归一化处理
print(ri[ri.violation=='Speeding'].driver_gender.value_counts(normalize=True))
M 0.680527
F 0.319473
Name: driver_gender, dtype: float64
ri.loc[ri.violation=='Speeding','driver_gender'].value_counts(normalize=True)
M 0.680527
F 0.319473
Name: driver_gender, dtype: float64
#男性驾驶员中,各种交通违规的比例
ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True)
Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
#女性驾驶员中各种交通违规的比例
ri[ri.driver_gender=='F'].violation.value_counts(normalize=True)
Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
Name: violation, dtype: float64
groupby方法
查看不同driver_gender,violation的各种值的占比
#对比以上两种数据
ri.groupby('driver_gender').violation.value_counts(normalize=True)
driver_gender violation
F Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
M Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
mean方法
mean可以默认计算占比
#True为执行搜查,False为未执行搜查
print(ri.search_conducted.value_counts(normalize=True))
False 0.965163
True 0.034837
Name: search_conducted, dtype: float64
#这例men可以计算出True的咋还占比
print(ri.search_conducted.mean())
0.03483720473942948
男女分组看他们的搜索值
ri.groupby('driver_gender').search_conducted.mean()
driver_gender
F 0.020033
M 0.043326
Name: search_conducted, dtype: float64
男的搜查比例比女的高
再看一下如果是多重分组,男女搜查的比例
ri.groupby(['violation','driver_gender']).search_conducted.mean()
violation driver_gender
Equipment F 0.042622
M 0.070081
Moving violation F 0.036205
M 0.059831
Other F 0.056522
M 0.047146
Registration/plates F 0.066140
M 0.110376
Seat belt F 0.012598
M 0.037980
Speeding F 0.008720
M 0.024925
Name: search_conducted, dtype: float64
ri.isnull().sum()
stop_date 0
stop_time 0
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64
#是否search_conducted为false的时候,search_type都丢失了
ri.search_conducted.value_counts()
False 88545
True 3196
Name: search_conducted, dtype: int64
是不是数值和上面的search_type丢失的值相同啊
再次验证一下
ri[ri.search_conducted==False].search_type.value_counts()
Series([], Name: search_type, dtype: int64)
#value_counts()这个方法时候默认忽略丢失值(空值)
ri[ri.search_conducted==False].search_type.value_counts(dropna=False)
NaN 88545
Name: search_type, dtype: int64
#当searcch_conducted的值为True,search_type从来不丢失
ri[ri.search_conducted==True].search_type.value_counts(dropna=False)
Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Inventory,Protective Frisk 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Protective Frisk 2
Inventory,Probable Cause,Reasonable Suspicion 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
ri[ri.search_conducted==True].search_type.isnull().sum()
0
查看搜索类型
ri.search_type.value_counts(dropna=False)
NaN 88545
Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Inventory,Protective Frisk 11
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Reasonable Suspicion 2
Inventory,Probable Cause,Protective Frisk 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
ri['frisk']=ri.search_type=='Protective Frisk'
ri.frisk.dtype
dtype('bool')
ri.frisk.sum()
161
ri.frisk.mean()
0.0017549405391264537
ri.frisk.value_counts()
False 91580
True 161
Name: frisk, dtype: int64
161/(91580+161)
0.0017549405391264537
字符操作
#上面的操作是把ri.search_type=='Protective Frisk'的值付给日['firsk']这一列
#现在是字符串的包含操作
ri['frisk']=ri.search_type.str.contains('Protective Frisk')
ri.frisk.sum()
274
ri.frisk.mean()
0.08573216520650813
#用mean()计算符合条件和不符合条件的占比
ri.frisk.value_counts()
False 2922
True 274
Name: frisk, dtype: int64
#再看一下他们的计算是否和men()的结构一样
274/(2922+274)
0.08573216520650813
上面的这一部分是计算字符串匹配操作
用正确的关键字去计算比例
pandas计算式忽略缺失值的
#那一年的数据最少
ri.stop_date.str.slice(0,4).value_counts()
2012 10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: stop_date, dtype: int64
#将ri.stop_date转化为datetime的格式的dataframe,存到stop_datetime新列中
ri['stop_datetime'] = pd.to_datetime(ri.stop_date) #注意这里有dt方法,类似于上面的str方法
#dt后可以使用year、month等方法
ri.stop_datetime.dt.year.value_counts()
2012 10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: stop_datetime, dtype: int64
ri.stop_datetime.dt.month.value_counts()
1 8479
5 7935
11 7877
10 7745
3 7742
6 7630
8 7615
7 7568
4 7529
9 7427
12 7152
2 7042
Name: stop_datetime, dtype: int64
#关于毒驾
ri.drugs_related_stop.dtype
dtype('bool')
#基础比例
ri.drugs_related_stop.mean()
0.008883705213590434
#不能使用小时分组,除非你创建了小时这一列
#取出小时列,转换成时间格式,再转化才成小时分组
ri['stop_time_datetime']=pd.to_datetime(ri.stop_time)
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean()
stop_time_datetime
0 0.019728
1 0.013507
2 0.015462
3 0.017065
4 0.011811
5 0.004762
6 0.003040
7 0.003281
8 0.002687
9 0.006288
10 0.005714
11 0.006976
12 0.004467
13 0.010326
14 0.007810
15 0.006416
16 0.005723
17 0.005517
18 0.010148
19 0.011596
20 0.008084
21 0.013342
22 0.013533
23 0.016344
Name: drugs_related_stop, dtype: float64
#按小时的时毒驾频率分布图
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x9d72d30>
#按小时的,毒驾数量分布图
ri.stop_time_datetime.dt.hour.value_counts().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5460710>
#按小时分组,毒驾数量排序分布图
ri.stop_time_datetime.dt.hour.value_counts().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5420860>
ri.groupby(ri.stop_time_datetime.dt.hour).stop_date.count().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x557c2e8>
#把无用的数据标记为丢失值
ri.stop_duration.value_counts()
0-15 Min 69543
16-30 Min 13635
30+ Min 3228
1 1
2 1
Name: stop_duration, dtype: int64
ri[(ri.stop_duration=='1')|(ri.stop_duration=='2')].stop_duration='NaN'
C:\Anaconda3\lib\site-packages\pandas\core\generic.py:4401: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
ri.stop_duration.value_counts()
0-15 Min 69543
16-30 Min 13635
30+ Min 3228
1 1
2 1
Name: stop_duration, dtype: int64
ri.loc[(ri.stop_duration=='1')|(ri.stop_duration=='2'),'stop_duration']='NaN'
ri.stop_duration.value_counts(dropna=False)
0-15 Min 69543
16-30 Min 13635
NaN 5333
30+ Min 3228
NaN 2
Name: stop_duration, dtype: int64
#用执行的nan类型替换NaN
import numpy as np
ri.loc[ri.stop_duration == 'NaN', 'stop_duration'] = np.nan
ri.stop_duration.value_counts(dropna=False)
0-15 Min 69543
16-30 Min 13635
NaN 5335
30+ Min 3228
Name: stop_duration, dtype: int64
ri.stop_duration.replace(['1', '2'], value=np.nan, inplace=True)
# stop_duration中的各种比例
#Series的map方法可以接受一个函数或含有映射关系的字典型对象。
#对某一个列进行批操作,本文中是批量替换
mapping={'0-15 Min':8,'16-30 Min':23,'30+ Min':45} #记得这不是原地操作原始数据,需要新建一列存储map后的结果
ri['stop_minutes'] = ri.stop_duration.map(mapping)
#为各种粘皮匹配值
ri.stop_minutes.value_counts()
8.0 69543
23.0 13635
45.0 3228
Name: stop_minutes, dtype: int64
ri.groupby('violation_raw').stop_minutes.mean()
violation_raw
APB 20.987342
Call for Service 22.034669
Equipment/Inspection Violation 11.460345
Motorist Assist/Courtesy 16.916256
Other Traffic Violation 13.900265
Registration Violation 13.745629
Seatbelt Violation 9.741531
Special Detail/Directed Patrol 15.061100
Speeding 10.577690
Suspicious Person 18.750000
Violation of City/Town Ordinance 13.388626
Warrant 21.400000
Name: stop_minutes, dtype: float64
# 使用某种方法如mean、count对某类数据进行操作。 # 过去agg只能groupby之后的数据进行操作,现在还可以对dataframe类、series类进行操作。
ri.groupby('violation_raw').stop_minutes.agg(['mean','count'])
mean | count | |
---|---|---|
violation_raw | ||
APB | 20.987342 | 79 |
Call for Service | 22.034669 | 1298 |
Equipment/Inspection Violation | 11.460345 | 11020 |
Motorist Assist/Courtesy | 16.916256 | 203 |
Other Traffic Violation | 13.900265 | 16223 |
Registration Violation | 13.745629 | 3432 |
Seatbelt Violation | 9.741531 | 2952 |
Special Detail/Directed Patrol | 15.061100 | 2455 |
Speeding | 10.577690 | 48462 |
Suspicious Person | 18.750000 | 56 |
Violation of City/Town Ordinance | 13.388626 | 211 |
Warrant | 21.400000 | 15 |
plot 默认是折线方法
ri.groupby('violation_raw').stop_minutes.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10873ef0>
#换成bartu
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1092eb38>
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x10a4a5f8>
ri.groupby('violation').driver_age.describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
violation | ||||||||
Equipment | 11007.0 | 31.781503 | 11.400900 | 16.0 | 23.0 | 28.0 | 38.0 | 89.0 |
Moving violation | 16164.0 | 36.120020 | 13.185805 | 15.0 | 25.0 | 33.0 | 46.0 | 99.0 |
Other | 4204.0 | 39.536870 | 13.034639 | 16.0 | 28.0 | 39.0 | 49.0 | 87.0 |
Registration/plates | 3427.0 | 32.803035 | 11.033675 | 16.0 | 24.0 | 30.0 | 40.0 | 74.0 |
Seat belt | 2952.0 | 32.206301 | 11.213122 | 17.0 | 24.0 | 29.0 | 38.0 | 77.0 |
Speeding | 48361.0 | 33.530097 | 12.821847 | 15.0 | 23.0 | 30.0 | 42.0 | 90.0 |
ri.driver_age.plot(kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x1003a518>
ri.driver_age.value_counts().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10088080>
ri.hist('driver_age', by='violation')
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000100D8438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010111208>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000000001013B898>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010163F28>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000101945F8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010194630>]],
dtype=object)
ri.hist('driver_age',by='violation',sharex=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000102C6C50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103243C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010346908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010370E80>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1470>]],
dtype=object)
ri.hist('driver_age',by='violation',sharex=True,sharey=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000104C4F98>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000001059D358>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105C0748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105E9B38>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F28>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F60>]],
dtype=object)
ri.head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-02 | 2019-04-05 01:55:00 | 8.0 | 20.0 |
1 | 2005-01-18 | 08:15 | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-18 | 2019-04-05 08:15:00 | 8.0 | 40.0 |
2 | 2005-01-23 | 23:15 | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-23 | 2019-04-05 23:15:00 | 8.0 | 33.0 |
3 | 2005-02-20 | 17:15 | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False | NaN | 2005-02-20 | 2019-04-05 17:15:00 | 23.0 | 19.0 |
4 | 2005-03-14 | 10:00 | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-03-14 | 2019-04-05 10:00:00 | 8.0 | 21.0 |
ri.tail()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
91736 | 2015-12-31 | 20:27 | M | 1986.0 | 29.0 | White | Speeding | Speeding | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:27:00 | 8.0 | 29.0 |
91737 | 2015-12-31 | 20:35 | F | 1982.0 | 33.0 | White | Equipment/Inspection Violation | Equipment | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:35:00 | 8.0 | 33.0 |
91738 | 2015-12-31 | 20:45 | M | 1992.0 | 23.0 | White | Other Traffic Violation | Moving violation | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:45:00 | 8.0 | 23.0 |
91739 | 2015-12-31 | 21:42 | M | 1993.0 | 22.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 21:42:00 | 8.0 | 22.0 |
91740 | 2015-12-31 | 22:46 | M | 1959.0 | 56.0 | Hispanic | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 22:46:00 | 8.0 | 56.0 |
ri['new_age']=ri.stop_datetime.dt.year-ri.driver_age_raw
ri[['driver_age','new_age']].hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000107FE7F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000001083C2E8>]],
dtype=object)
ri[['driver_age','new_age']].describe()
driver_age | new_age | |
---|---|---|
count | 86120.000000 | 86414.000000 |
mean | 34.011333 | 39.784294 |
std | 12.738564 | 110.822145 |
min | 15.000000 | -6794.000000 |
25% | 23.000000 | 24.000000 |
50% | 31.000000 | 31.000000 |
75% | 43.000000 | 43.000000 |
max | 99.000000 | 2015.000000 |
ri[(ri.new_age<15)|(ri.new_age>99)].shape
(294, 19)
ri.driver_age_raw.isnull().sum()
5327
ri.driver_age.isnull().sum()
5621
5621-5327
294
ri[(ri.driver_age_raw.notnull())&(ri.driver_age.isnull())].head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
146 | 2005-10-05 | 08:50 | M | 0.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-10-05 | 2019-04-05 08:50:00 | 8.0 | 2005.0 |
281 | 2005-10-10 | 12:05 | F | 0.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Warning | False | 0-15 Min | False | NaN | 2005-10-10 | 2019-04-05 12:05:00 | 8.0 | 2005.0 |
331 | 2005-10-12 | 07:50 | M | 0.0 | NaN | White | Motorist Assist/Courtesy | Other | False | NaN | No Action | False | 0-15 Min | False | NaN | 2005-10-12 | 2019-04-05 07:50:00 | 8.0 | 2005.0 |
414 | 2005-10-17 | 08:32 | M | 2005.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-10-17 | 2019-04-05 08:32:00 | 8.0 | 0.0 |
455 | 2005-10-18 | 18:30 | F | 0.0 | NaN | White | Speeding | Speeding | False | NaN | Warning | False | 0-15 Min | False | NaN | 2005-10-18 | 2019-04-05 18:30:00 | 8.0 | 2005.0 |
ri.loc[(ri.new_age<15)|(ri.new_age>99),'new_age']=np.nan
ri.new_age.equals(ri.driver_age)
True
pandas强化练习的更多相关文章
- 【强化学习】用pandas 与 numpy 分别实现 q-learning, saras, saras(lambda)算法
本文作者:hhh5460 本文地址:https://www.cnblogs.com/hhh5460/p/10159331.html 特别感谢:本文的三幅图皆来自莫凡的教程 https://morvan ...
- 【强化学习】python 实现 q-learning 例一
本文作者:hhh5460 本文地址:https://www.cnblogs.com/hhh5460/p/10134018.html 问题情境 -o---T# T 就是宝藏的位置, o 是探索者的位置 ...
- 深度强化学习:Policy-Based methods、Actor-Critic以及DDPG
Policy-Based methods 在上篇文章中介绍的Deep Q-Learning算法属于基于价值(Value-Based)的方法,即估计最优的action-value function $q ...
- pandas基础-Python3
未完 for examples: example 1: # Code based on Python 3.x # _*_ coding: utf-8 _*_ # __Author: "LEM ...
- [django]数据导出excel升级强化版(很强大!)
不多说了,原理采用xlwt导出excel文件,所谓的强化版指的是实现在网页上选择一定条件导出对应的数据 之前我的博文出过这类文章,但只是实现导出数据,这次左思右想,再加上网上的搜索,终于找出方法实现条 ...
- 10 Minutes to pandas
摘要 一.创建对象 二.查看数据 三.选择和设置 四.缺失值处理 五.相关操作 六.聚合 七.重排(Reshaping) 八.时间序列 九.Categorical类型 十.画图 十一 ...
- ITTC数据挖掘平台介绍(七)强化的数据库, 虚拟化,脚本编辑器
一. 前言 好久没有更新博客了,最近一直在忙着找工作,目前差不多尘埃落定.特别期待而且准备的都很少能成功,反而是没怎么在意的最终反而能拿到,真是神一样的人生. 言归正传,一直以来,数据挖掘系统的数据类 ...
- 利用Python进行数据分析(15) pandas基础: 字符串操作
字符串对象方法 split()方法拆分字符串: strip()方法去掉空白符和换行符: split()结合strip()使用: "+"符号可以将多个字符串连接起来: join( ...
- 利用Python进行数据分析(10) pandas基础: 处理缺失数据
数据不完整在数据分析的过程中很常见. pandas使用浮点值NaN表示浮点和非浮点数组里的缺失数据. pandas使用isnull()和notnull()函数来判断缺失情况. 对于缺失数据一般处理 ...
随机推荐
- 使用 insertBefore 和insertAfter,在指定位置追加与删除元素
来自于<sencha touch 权威指南> ----------------------------------- 除 append 和 overwrite 外,还可以使用 insert ...
- 通过event事件来控制红绿灯通行车辆
事件的初始值为False,所以最开始就是红灯,先模拟红绿灯的规律,设定为每两秒变换一次灯,然后再模拟车辆通行,通过事件来将两者的事件结合起来, 当事件为False时,为红灯,车辆处于等待状态,一直wa ...
- hdu 1556 Color the ball (线段树做法)
Problem Description N个气球排成一排,从左到右依次编号为1,2,3....N.每次给定2个整数a b(a <= b),lele便为骑上他的“小飞鸽"牌电动车从气球a ...
- Android getDimension,getDimensionPixelOffset,getDimensionPixelSize
1.例如在onMeasure(int , int)方法中可能要获取自定义属性的值.如: TypedArray a = context.obtainStyledAttributes(attrs, R.s ...
- WPF 控件库——仿制Chrome的ColorPicker
WPF 控件库系列博文地址: WPF 控件库——仿制Chrome的ColorPicker WPF 控件库——仿制Windows10的进度条 WPF 控件库——轮播控件 WPF 控件库——带有惯性的Sc ...
- 读写文本文件之StreamReader和StreamWriter
private string _filePath = @"1.txt"; //查询文件是否存在,如果不存在,则创建 if (!File.Exists(_filePath)) { u ...
- go的Type switch是一个switch语句么?
相信这样的语句在go中大家见的很多 switch t := arg.(type) { default: fmt.Printf("unexpected type %T\n", t) ...
- C# LINQ(7)
大部分的LINQ的关键字都说了,最后说一下排序吧. LINQ的是查询的利器. 那么查询就会有排序. 所有LINQ提供了两种简单的排序.倒序和默认排序. 关键字是: orderby ascending ...
- 316. Remove Duplicate Letters (accumulate -> count of the difference elements in a vector)
Given a string which contains only lowercase letters, remove duplicate letters so that every letter ...
- Shell脚本——初识
1.在一般情况下,人们并不区分 Bourne Shell 和 Bourne Again Shell,所以,像 #!/bin/sh,它同样也可以改为 #!/bin/bash. #! 告诉系统其后路径所指 ...