pandas强化练习
这篇文章写得更好:http://wittyfans.com/coding/%E5%88%A9%E7%94%A8Pandas%E5%88%86%E6%9E%90%E7%BE%8E%E5%9B%BD%E4%BA%A4%E8%AD%A6%E5%BC%80%E6%94%BE%E7%9A%84%E6%90%9C%E6%9F%A5%E6%95%B0%E6%8D%AE.html import pandas as pd
import matplotlib.pyplot as plt #需要声明才能在notebook中画图
%matplotlib inline #下载的罗曼的警务数据,这里以ri代表罗德曼岛警务数据
ri=pd.read_csv('police.csv') ri.head()
stop_date | stop_time | county_name | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | NaN | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | NaN | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | NaN | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
3 | 2005-02-20 | 17:15 | NaN | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False |
4 | 2005-03-14 | 10:00 | NaN | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
ri.shape
(91741, 15)
ri.isnull().sum()
stop_date 0
stop_time 0
county_name 91741
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64
移除某列
ri.head()
stop_date | stop_time | county_name | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | NaN | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | NaN | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | NaN | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
3 | 2005-02-20 | 17:15 | NaN | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False |
4 | 2005-03-14 | 10:00 | NaN | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
#写法等同于ri.drop('county_name', axis=1 , inplace=True)
#删除空值的
ri.drop('county_name', axis='columns', inplace=True)
ri.shape
(91741, 14)
ri.columns
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
'driver_age', 'driver_race', 'violation_raw', 'violation',
'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
'stop_duration', 'drugs_related_stop'],
dtype='object')
#删除有空值的行
ri.dropna(axis='columns',how='all').shape
(91741, 14)
pandas过滤功能
保留布尔值为真的数据,这里我们保留violaton值为真的数据
ri[ri.violation=='Speeding'].head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
1 | 2005-01-18 | 08:15 | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
2 | 2005-01-23 | 23:15 | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
4 | 2005-03-14 | 10:00 | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
6 | 2005-04-01 | 17:30 | M | 1969.0 | 36.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False |
values_counts
## 超速违规的驾驶员男女各多少人
print(ri[ri.violation=='Speeding'].driver_gender.value_counts()
)
M 32979
F 15482
Name: driver_gender, dtype: int64
# 超速男女各占多少比例 normalize归一化处理
print(ri[ri.violation=='Speeding'].driver_gender.value_counts(normalize=True))
M 0.680527
F 0.319473
Name: driver_gender, dtype: float64
ri.loc[ri.violation=='Speeding','driver_gender'].value_counts(normalize=True)
M 0.680527
F 0.319473
Name: driver_gender, dtype: float64
#男性驾驶员中,各种交通违规的比例
ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True)
Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
#女性驾驶员中各种交通违规的比例
ri[ri.driver_gender=='F'].violation.value_counts(normalize=True)
Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
Name: violation, dtype: float64
groupby方法
查看不同driver_gender,violation的各种值的占比
#对比以上两种数据
ri.groupby('driver_gender').violation.value_counts(normalize=True)
driver_gender violation
F Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
M Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
mean方法
mean可以默认计算占比
#True为执行搜查,False为未执行搜查
print(ri.search_conducted.value_counts(normalize=True))
False 0.965163
True 0.034837
Name: search_conducted, dtype: float64
#这例men可以计算出True的咋还占比
print(ri.search_conducted.mean())
0.03483720473942948
男女分组看他们的搜索值
ri.groupby('driver_gender').search_conducted.mean()
driver_gender
F 0.020033
M 0.043326
Name: search_conducted, dtype: float64
男的搜查比例比女的高
再看一下如果是多重分组,男女搜查的比例
ri.groupby(['violation','driver_gender']).search_conducted.mean()
violation driver_gender
Equipment F 0.042622
M 0.070081
Moving violation F 0.036205
M 0.059831
Other F 0.056522
M 0.047146
Registration/plates F 0.066140
M 0.110376
Seat belt F 0.012598
M 0.037980
Speeding F 0.008720
M 0.024925
Name: search_conducted, dtype: float64
ri.isnull().sum()
stop_date 0
stop_time 0
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64
#是否search_conducted为false的时候,search_type都丢失了
ri.search_conducted.value_counts()
False 88545
True 3196
Name: search_conducted, dtype: int64
是不是数值和上面的search_type丢失的值相同啊
再次验证一下
ri[ri.search_conducted==False].search_type.value_counts()
Series([], Name: search_type, dtype: int64)
#value_counts()这个方法时候默认忽略丢失值(空值)
ri[ri.search_conducted==False].search_type.value_counts(dropna=False)
NaN 88545
Name: search_type, dtype: int64
#当searcch_conducted的值为True,search_type从来不丢失
ri[ri.search_conducted==True].search_type.value_counts(dropna=False)
Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Inventory,Protective Frisk 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Protective Frisk 2
Inventory,Probable Cause,Reasonable Suspicion 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
ri[ri.search_conducted==True].search_type.isnull().sum()
0
查看搜索类型
ri.search_type.value_counts(dropna=False)
NaN 88545
Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Inventory,Protective Frisk 11
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Reasonable Suspicion 2
Inventory,Probable Cause,Protective Frisk 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
ri['frisk']=ri.search_type=='Protective Frisk'
ri.frisk.dtype
dtype('bool')
ri.frisk.sum()
161
ri.frisk.mean()
0.0017549405391264537
ri.frisk.value_counts()
False 91580
True 161
Name: frisk, dtype: int64
161/(91580+161)
0.0017549405391264537
字符操作
#上面的操作是把ri.search_type=='Protective Frisk'的值付给日['firsk']这一列
#现在是字符串的包含操作
ri['frisk']=ri.search_type.str.contains('Protective Frisk')
ri.frisk.sum()
274
ri.frisk.mean()
0.08573216520650813
#用mean()计算符合条件和不符合条件的占比
ri.frisk.value_counts()
False 2922
True 274
Name: frisk, dtype: int64
#再看一下他们的计算是否和men()的结构一样
274/(2922+274)
0.08573216520650813
上面的这一部分是计算字符串匹配操作
用正确的关键字去计算比例
pandas计算式忽略缺失值的
#那一年的数据最少
ri.stop_date.str.slice(0,4).value_counts()
2012 10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: stop_date, dtype: int64
#将ri.stop_date转化为datetime的格式的dataframe,存到stop_datetime新列中
ri['stop_datetime'] = pd.to_datetime(ri.stop_date) #注意这里有dt方法,类似于上面的str方法
#dt后可以使用year、month等方法
ri.stop_datetime.dt.year.value_counts()
2012 10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: stop_datetime, dtype: int64
ri.stop_datetime.dt.month.value_counts()
1 8479
5 7935
11 7877
10 7745
3 7742
6 7630
8 7615
7 7568
4 7529
9 7427
12 7152
2 7042
Name: stop_datetime, dtype: int64
#关于毒驾
ri.drugs_related_stop.dtype
dtype('bool')
#基础比例
ri.drugs_related_stop.mean()
0.008883705213590434
#不能使用小时分组,除非你创建了小时这一列
#取出小时列,转换成时间格式,再转化才成小时分组
ri['stop_time_datetime']=pd.to_datetime(ri.stop_time)
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean()
stop_time_datetime
0 0.019728
1 0.013507
2 0.015462
3 0.017065
4 0.011811
5 0.004762
6 0.003040
7 0.003281
8 0.002687
9 0.006288
10 0.005714
11 0.006976
12 0.004467
13 0.010326
14 0.007810
15 0.006416
16 0.005723
17 0.005517
18 0.010148
19 0.011596
20 0.008084
21 0.013342
22 0.013533
23 0.016344
Name: drugs_related_stop, dtype: float64
#按小时的时毒驾频率分布图
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x9d72d30>
#按小时的,毒驾数量分布图
ri.stop_time_datetime.dt.hour.value_counts().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5460710>
#按小时分组,毒驾数量排序分布图
ri.stop_time_datetime.dt.hour.value_counts().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x5420860>
ri.groupby(ri.stop_time_datetime.dt.hour).stop_date.count().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x557c2e8>
#把无用的数据标记为丢失值
ri.stop_duration.value_counts()
0-15 Min 69543
16-30 Min 13635
30+ Min 3228
1 1
2 1
Name: stop_duration, dtype: int64
ri[(ri.stop_duration=='1')|(ri.stop_duration=='2')].stop_duration='NaN'
C:\Anaconda3\lib\site-packages\pandas\core\generic.py:4401: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
ri.stop_duration.value_counts()
0-15 Min 69543
16-30 Min 13635
30+ Min 3228
1 1
2 1
Name: stop_duration, dtype: int64
ri.loc[(ri.stop_duration=='1')|(ri.stop_duration=='2'),'stop_duration']='NaN'
ri.stop_duration.value_counts(dropna=False)
0-15 Min 69543
16-30 Min 13635
NaN 5333
30+ Min 3228
NaN 2
Name: stop_duration, dtype: int64
#用执行的nan类型替换NaN
import numpy as np
ri.loc[ri.stop_duration == 'NaN', 'stop_duration'] = np.nan
ri.stop_duration.value_counts(dropna=False)
0-15 Min 69543
16-30 Min 13635
NaN 5335
30+ Min 3228
Name: stop_duration, dtype: int64
ri.stop_duration.replace(['1', '2'], value=np.nan, inplace=True)
# stop_duration中的各种比例
#Series的map方法可以接受一个函数或含有映射关系的字典型对象。
#对某一个列进行批操作,本文中是批量替换
mapping={'0-15 Min':8,'16-30 Min':23,'30+ Min':45} #记得这不是原地操作原始数据,需要新建一列存储map后的结果
ri['stop_minutes'] = ri.stop_duration.map(mapping)
#为各种粘皮匹配值
ri.stop_minutes.value_counts()
8.0 69543
23.0 13635
45.0 3228
Name: stop_minutes, dtype: int64
ri.groupby('violation_raw').stop_minutes.mean()
violation_raw
APB 20.987342
Call for Service 22.034669
Equipment/Inspection Violation 11.460345
Motorist Assist/Courtesy 16.916256
Other Traffic Violation 13.900265
Registration Violation 13.745629
Seatbelt Violation 9.741531
Special Detail/Directed Patrol 15.061100
Speeding 10.577690
Suspicious Person 18.750000
Violation of City/Town Ordinance 13.388626
Warrant 21.400000
Name: stop_minutes, dtype: float64
# 使用某种方法如mean、count对某类数据进行操作。 # 过去agg只能groupby之后的数据进行操作,现在还可以对dataframe类、series类进行操作。
ri.groupby('violation_raw').stop_minutes.agg(['mean','count'])
mean | count | |
---|---|---|
violation_raw | ||
APB | 20.987342 | 79 |
Call for Service | 22.034669 | 1298 |
Equipment/Inspection Violation | 11.460345 | 11020 |
Motorist Assist/Courtesy | 16.916256 | 203 |
Other Traffic Violation | 13.900265 | 16223 |
Registration Violation | 13.745629 | 3432 |
Seatbelt Violation | 9.741531 | 2952 |
Special Detail/Directed Patrol | 15.061100 | 2455 |
Speeding | 10.577690 | 48462 |
Suspicious Person | 18.750000 | 56 |
Violation of City/Town Ordinance | 13.388626 | 211 |
Warrant | 21.400000 | 15 |
plot 默认是折线方法
ri.groupby('violation_raw').stop_minutes.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10873ef0>
#换成bartu
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1092eb38>
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x10a4a5f8>
ri.groupby('violation').driver_age.describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
violation | ||||||||
Equipment | 11007.0 | 31.781503 | 11.400900 | 16.0 | 23.0 | 28.0 | 38.0 | 89.0 |
Moving violation | 16164.0 | 36.120020 | 13.185805 | 15.0 | 25.0 | 33.0 | 46.0 | 99.0 |
Other | 4204.0 | 39.536870 | 13.034639 | 16.0 | 28.0 | 39.0 | 49.0 | 87.0 |
Registration/plates | 3427.0 | 32.803035 | 11.033675 | 16.0 | 24.0 | 30.0 | 40.0 | 74.0 |
Seat belt | 2952.0 | 32.206301 | 11.213122 | 17.0 | 24.0 | 29.0 | 38.0 | 77.0 |
Speeding | 48361.0 | 33.530097 | 12.821847 | 15.0 | 23.0 | 30.0 | 42.0 | 90.0 |
ri.driver_age.plot(kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x1003a518>
ri.driver_age.value_counts().sort_index().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10088080>
ri.hist('driver_age', by='violation')
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000100D8438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010111208>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000000001013B898>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010163F28>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000101945F8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010194630>]],
dtype=object)
ri.hist('driver_age',by='violation',sharex=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000102C6C50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103243C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010346908>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010370E80>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1470>]],
dtype=object)
ri.hist('driver_age',by='violation',sharex=True,sharey=True)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000104C4F98>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000001059D358>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105C0748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105E9B38>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F28>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F60>]],
dtype=object)
ri.head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005-01-02 | 01:55 | M | 1985.0 | 20.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-02 | 2019-04-05 01:55:00 | 8.0 | 20.0 |
1 | 2005-01-18 | 08:15 | M | 1965.0 | 40.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-18 | 2019-04-05 08:15:00 | 8.0 | 40.0 |
2 | 2005-01-23 | 23:15 | M | 1972.0 | 33.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-01-23 | 2019-04-05 23:15:00 | 8.0 | 33.0 |
3 | 2005-02-20 | 17:15 | M | 1986.0 | 19.0 | White | Call for Service | Other | False | NaN | Arrest Driver | True | 16-30 Min | False | NaN | 2005-02-20 | 2019-04-05 17:15:00 | 23.0 | 19.0 |
4 | 2005-03-14 | 10:00 | F | 1984.0 | 21.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-03-14 | 2019-04-05 10:00:00 | 8.0 | 21.0 |
ri.tail()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
91736 | 2015-12-31 | 20:27 | M | 1986.0 | 29.0 | White | Speeding | Speeding | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:27:00 | 8.0 | 29.0 |
91737 | 2015-12-31 | 20:35 | F | 1982.0 | 33.0 | White | Equipment/Inspection Violation | Equipment | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:35:00 | 8.0 | 33.0 |
91738 | 2015-12-31 | 20:45 | M | 1992.0 | 23.0 | White | Other Traffic Violation | Moving violation | False | NaN | Warning | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 20:45:00 | 8.0 | 23.0 |
91739 | 2015-12-31 | 21:42 | M | 1993.0 | 22.0 | White | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 21:42:00 | 8.0 | 22.0 |
91740 | 2015-12-31 | 22:46 | M | 1959.0 | 56.0 | Hispanic | Speeding | Speeding | False | NaN | Citation | False | 0-15 Min | False | NaN | 2015-12-31 | 2019-04-05 22:46:00 | 8.0 | 56.0 |
ri['new_age']=ri.stop_datetime.dt.year-ri.driver_age_raw
ri[['driver_age','new_age']].hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000107FE7F0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000000001083C2E8>]],
dtype=object)
ri[['driver_age','new_age']].describe()
driver_age | new_age | |
---|---|---|
count | 86120.000000 | 86414.000000 |
mean | 34.011333 | 39.784294 |
std | 12.738564 | 110.822145 |
min | 15.000000 | -6794.000000 |
25% | 23.000000 | 24.000000 |
50% | 31.000000 | 31.000000 |
75% | 43.000000 | 43.000000 |
max | 99.000000 | 2015.000000 |
ri[(ri.new_age<15)|(ri.new_age>99)].shape
(294, 19)
ri.driver_age_raw.isnull().sum()
5327
ri.driver_age.isnull().sum()
5621
5621-5327
294
ri[(ri.driver_age_raw.notnull())&(ri.driver_age.isnull())].head()
stop_date | stop_time | driver_gender | driver_age_raw | driver_age | driver_race | violation_raw | violation | search_conducted | search_type | stop_outcome | is_arrested | stop_duration | drugs_related_stop | frisk | stop_datetime | stop_time_datetime | stop_minutes | new_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
146 | 2005-10-05 | 08:50 | M | 0.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-10-05 | 2019-04-05 08:50:00 | 8.0 | 2005.0 |
281 | 2005-10-10 | 12:05 | F | 0.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Warning | False | 0-15 Min | False | NaN | 2005-10-10 | 2019-04-05 12:05:00 | 8.0 | 2005.0 |
331 | 2005-10-12 | 07:50 | M | 0.0 | NaN | White | Motorist Assist/Courtesy | Other | False | NaN | No Action | False | 0-15 Min | False | NaN | 2005-10-12 | 2019-04-05 07:50:00 | 8.0 | 2005.0 |
414 | 2005-10-17 | 08:32 | M | 2005.0 | NaN | White | Other Traffic Violation | Moving violation | False | NaN | Citation | False | 0-15 Min | False | NaN | 2005-10-17 | 2019-04-05 08:32:00 | 8.0 | 0.0 |
455 | 2005-10-18 | 18:30 | F | 0.0 | NaN | White | Speeding | Speeding | False | NaN | Warning | False | 0-15 Min | False | NaN | 2005-10-18 | 2019-04-05 18:30:00 | 8.0 | 2005.0 |
ri.loc[(ri.new_age<15)|(ri.new_age>99),'new_age']=np.nan
ri.new_age.equals(ri.driver_age)
True
pandas强化练习的更多相关文章
- 【强化学习】用pandas 与 numpy 分别实现 q-learning, saras, saras(lambda)算法
本文作者:hhh5460 本文地址:https://www.cnblogs.com/hhh5460/p/10159331.html 特别感谢:本文的三幅图皆来自莫凡的教程 https://morvan ...
- 【强化学习】python 实现 q-learning 例一
本文作者:hhh5460 本文地址:https://www.cnblogs.com/hhh5460/p/10134018.html 问题情境 -o---T# T 就是宝藏的位置, o 是探索者的位置 ...
- 深度强化学习:Policy-Based methods、Actor-Critic以及DDPG
Policy-Based methods 在上篇文章中介绍的Deep Q-Learning算法属于基于价值(Value-Based)的方法,即估计最优的action-value function $q ...
- pandas基础-Python3
未完 for examples: example 1: # Code based on Python 3.x # _*_ coding: utf-8 _*_ # __Author: "LEM ...
- [django]数据导出excel升级强化版(很强大!)
不多说了,原理采用xlwt导出excel文件,所谓的强化版指的是实现在网页上选择一定条件导出对应的数据 之前我的博文出过这类文章,但只是实现导出数据,这次左思右想,再加上网上的搜索,终于找出方法实现条 ...
- 10 Minutes to pandas
摘要 一.创建对象 二.查看数据 三.选择和设置 四.缺失值处理 五.相关操作 六.聚合 七.重排(Reshaping) 八.时间序列 九.Categorical类型 十.画图 十一 ...
- ITTC数据挖掘平台介绍(七)强化的数据库, 虚拟化,脚本编辑器
一. 前言 好久没有更新博客了,最近一直在忙着找工作,目前差不多尘埃落定.特别期待而且准备的都很少能成功,反而是没怎么在意的最终反而能拿到,真是神一样的人生. 言归正传,一直以来,数据挖掘系统的数据类 ...
- 利用Python进行数据分析(15) pandas基础: 字符串操作
字符串对象方法 split()方法拆分字符串: strip()方法去掉空白符和换行符: split()结合strip()使用: "+"符号可以将多个字符串连接起来: join( ...
- 利用Python进行数据分析(10) pandas基础: 处理缺失数据
数据不完整在数据分析的过程中很常见. pandas使用浮点值NaN表示浮点和非浮点数组里的缺失数据. pandas使用isnull()和notnull()函数来判断缺失情况. 对于缺失数据一般处理 ...
随机推荐
- FreeMarker 的空值处理 , 简单理解 , 不用TMD就会忘记
NO.1 而对于FreeMarker来说,null值和不存在的变量是完全一样的 NO.2 ! 指定缺失变量的默认值 返回String NO.3 ?? 判断变量是否存在 返回boolean NO.4 $ ...
- css总结2:Flex 布局教程:Flex 语法(转)
Flex 布局教程:语法篇 网页布局(layout)是 CSS 的一个重点应用. 布局的传统解决方案,基于盒状模型,依赖 display 属性 + position属性 + float属性.它对于那些 ...
- Struts2 配置及运行时遇到问题
1.java.lang.ClassNotFoundException: org.apache.struts2.dispatcher.filter.StrutsPrepareAndExecuteFilt ...
- 献上一款漂亮的手写PHP验证码
献上一款漂亮的PHP验证码,可以根据个人需求作调整,代码如下(审美观不同,欢迎吐槽): <?php /** * Author: xiongwei * Email: 695704253@qq.co ...
- Hibernate不能建表的问题
项目使用hibernate进行正向工程建立表,各项配置都正确,但就是不能生成对应的表,这就纳闷了!! 类: public class Market { private Long id; private ...
- JS和JQuery的比较
一. Jquery它是javascript的一个轻量级框架,是对javascript进行封装. 二.JQuery和JS都有加载函数,但表达方式不同. 1.JS中的加载函数: //整个文档加载完毕后执行 ...
- 【转】ANDROID自定义视图——onLayout源码 流程 思路详解
转载(http://blog.csdn.net/a396901990) 简介: 在自定义view的时候,其实很简单,只需要知道3步骤: 1.测量——onMeasure():决定View的大小 2.布局 ...
- C# 读取Text文本,写入Text文本
//读取 private void showMess() { this.dataGridViewX2.Rows.Clear(); //将车辆信息一行行添加到datagreatview 里面 Strea ...
- 【C#】特性标签中的属性解释
第一个为特性作用于类,或者接口(interface) 第二个为是否允许重叠定义,就是连续写两个特性标签 第三个为是否继承,当继承时候,除输出子类外,父类也将输出
- 华硕X550VC安装ubuntu后wifi无法连接问题
在网上找了很多资料比如重新编译内核,想办法连上有线网络然后更新驱动,下载离线驱动安装包…… 等等方法 其中有些方法实际测试的时候失败了,文章是几年前的,可能缺少某些依赖.上个网都这么麻烦实在让人疲惫. ...