pandas 之 特征工程
import numpy as np
import pandas as pd
So far(到目前为止) in this chapter we've been concerned with rearranging data. Filterng, cleaning, and other transformations are another class of important oprations.
数据去重
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:
data = pd.DataFrame({
'k1': ['one', 'two']*3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]
})
data
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
6 | two | 4 |
The DataFrame method duplicated returns a boolean Series indcating whether each rows is a duplicate (has been observed in a previous row) or not:
"df.duplicated() 对每一行数据进行重复判断"
data.duplicated()
'df.duplicated() 对每一行数据进行重复判断'
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
Relatedly, drop_dumplicates returns a DataFrame where the duplicated array is False.
"df.drop_duplicates() 删除重复行"
data.drop_duplicates()
'df.drop_duplicates() 删除重复行'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
Both of these methods by default consider of the columns; alternatively(非此即彼), you can specify any subset of them to detect(察觉) duplicates. Suppose we had an additional column of values and wanted to filter duplicates only base on the 'k1' columns:
data['v1'] = range(7)
"指定子集, 去判断, 删除整行 "
data.drop_duplicates(['k1'])
'指定子集, 去判断, 删除整行 '
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | v1 | |
---|---|---|---|
0 | one | 1 | 0 |
1 | two | 1 | 1 |
duplicated and drop_duplicates by default keep the first observed value combination. Passing keep='last' will return the last one:
保留第一个观察值组合。传递keep='last'将返回最后一个
data
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | v1 | |
---|---|---|---|
0 | one | 1 | 0 |
1 | two | 1 | 1 |
2 | one | 2 | 2 |
3 | two | 3 | 3 |
4 | one | 3 | 4 |
5 | two | 4 | 5 |
6 | two | 4 | 6 |
"[k1, k2], 重复的为: [two, 4] 5; [two, 4] 6, 传入last, 保留最后一个做代表"
data.drop_duplicates(['k1', 'k2'], keep='last')
'[k1, k2], 重复的为: [two, 4] 5; [two, 4] 6, 传入last, 保留最后一个做代表'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | v1 | |
---|---|---|---|
0 | one | 1 | 0 |
1 | two | 1 | 1 |
2 | one | 2 | 2 |
3 | two | 3 | 3 |
4 | one | 3 | 4 |
6 | two | 4 | 6 |
数据映射转换函数 map
For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical(假设) data collected about various kinds of meat:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
food | ounces | |
---|---|---|
0 | bacon | 4.0 |
1 | pulled pork | 3.0 |
2 | bacon | 12.0 |
3 | Pastrami | 6.0 |
4 | corned beef | 7.5 |
5 | Bacon | 8.0 |
6 | pastrami | 3.0 |
7 | honey ham | 5.0 |
8 | nova lox | 6.0 |
Suppose you wanted to add a column indicating the type of animal that each food came from. Let's write down a mapping of each distinct meat type to the type of animal:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
The map method on a Series accepts a function or dict-like object containing a maping, but here we have a small ploblem in that some of the meats are capitalized and others are not. Thus, we need to convert each value to lowercase using the str.lower Series method:
(Series的map(), 接收一个mapping函数or字典对象, 在本例中, 我们首先要进行大小写转换)
lowercased = data['food'].str.lower()
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
"map() 值为映射字段的值"
data['animal'] = lowercased.map(meat_to_animal)
data
'map() 值为映射字段的值'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
food | ounces | animal | |
---|---|---|---|
0 | bacon | 4.0 | pig |
1 | pulled pork | 3.0 | pig |
2 | bacon | 12.0 | pig |
3 | Pastrami | 6.0 | cow |
4 | corned beef | 7.5 | cow |
5 | Bacon | 8.0 | pig |
6 | pastrami | 3.0 | cow |
7 | honey ham | 5.0 | pig |
8 | nova lox | 6.0 | salmon |
"We could also have passed a function that does all the work"
data['food'].map(lambda x: meat_to_animal[x.lower()])
'We could also have passed a function that does all the work'
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
Using map is a convenient way to perform element-wise transformations and other data cleaning-related operations.
使用 map() 是处理相关联字段,的一种快捷方式
# cj test
cj_df = pd.DataFrame({
'a':[1,3],
'b':[2,4]
})
cj_df
'df[新增字段] = df[关联的键].map(对应的值-字典)'
cj_df['c'] = cj_df['a'].map({1:'cj', 3:'youge'})
cj_df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
'df[新增字段] = df[关联的键].map(对应的值-字典)'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | |
---|---|---|---|
0 | 1 | 2 | cj |
1 | 3 | 4 | youge |
异常值替换 Replacing
Filling in missing data with the fillna method is a special case of more general value replacement. As you've already seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. Let's consider this Series:
(在处理缺失值的时候, fillna其实只是, 替换的一种特例, 后面我们看到了 map 可对数据集进行映射, 而 replace 则更为通用化和灵活处理)
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
The -999 values might be sentinel values(哨兵值, 标记值) for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series(unless you pass inplace=True)
"先用df.replace(old, NA), 将其让 pandas能识别"
data.replace(-999, np.nan, inplace=False)
'先用df.replace(old, NA), 将其让 pandas能识别'
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
If you want to replace multiple values as once, you instead pass a list and the the substitute(替代的) value:
data.replace([-999, -1000], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
"To use a different replacement for each value, pass a list of substitutes"
data.replace([-999, -1000], [np.nan, 0])
'To use a different replacement for each value, pass a list of substitutes'
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
"The argument passed can also be a dict"
data.replace({-999:np.nan, -1000:0})
'The argument passed can also be a dict'
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
The data.replace method is distinct from data.str.replace, which performs string substitution element-wise. We look at these string methods on Series later in the chapter.
重命名轴索引
Like values a Series, axis labels can be similarly transformaed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-palce without creating a new data structure. Here's a simple example.
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
New York | 8 | 9 | 10 | 11 |
Like a Series, the axis indexes hava a map method:
"map 通过原来的映射"
transform = lambda x: x[:4].upper()
data.index.map(transform)
'map 通过原来的映射'
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
"You can assign to index, modifying the DF in-place"
"不提供inplace哦"
data.index = data.index.map(transform)
data
'You can assign to index, modifying the DF in-place'
'不提供inplace哦'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
OHIO | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
If you want to create a transformed version of a dataset without modifying the original, a useful method is rename:
data.rename(index=str.title, columns=str.upper)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
ONE | TWO | THREE | FOUR | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colo | 4 | 5 | 6 | 7 |
New | 8 | 9 | 10 | 11 |
Notably(值得注意的是), rename can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:
data.rename(
index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | peekaboo | four | |
---|---|---|---|---|
INDIANA | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
rename saves you from the chore(零星的) of copying the DataFrame manually(手动地) and assigning to its index and columns attributes. Should you wish to modify a dataset in-place.
data.rename(index={'OHIO':'cj'}, inplace=True)
data
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
cj | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
连续数据离散化和装箱
Continuous data is offen discretized(离散化) or otherwise separated into "bins" for analysis. Suppose you hava data about a group of people in a study, and you want to group them into discrete age buckets:
(对连续变量进行离散化, 分桶)
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let's divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and order. To do so, you have to use cut, a function in pandas:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
"cut 通常配合 pd.value_counts() 实现分组统计"
pd.value_counts(cats)
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
'cut 通常配合 pd.value_counts() 实现分组统计'
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
The object pandas returns is a special Categorical object. The output you see describels the bins coputed by pandas.cut. You can treat it like an array of strings indicating the bin name; internally(内部地) it contains a categories array specifying the distinct category names alone with a labeling for the ages data in the codes attribute:
(离散化后的对象, 像一个array, 值是区间)
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
pd.value_counts(cats) # 统计一波, 可做直方图哦
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
# 这里就不显示图片了
pd.value_counts(cats).plot("bar")
Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut
Consistent with mathematical notation(符号) for intervals, a parenthesis(插入) means that the side is open, while the square bracket means it is closed(inclusive). You can change which side is closed by passing right=False: (区间的开,闭可以灵活设置)
pd.cut(ages, [18,26,36,61,100], right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
You can also pass your own bin names by passing a list or array to the labels option:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
tmp = pd.cut(ages, bins, labels=group_names)
pd.value_counts(tmp)
Youth 5
MiddleAged 3
YoungAdult 3
Senior 1
dtype: int64
If you pass an integer number of bins to cut instead of explicit bin edges(明晰的边界), it will compute equal-lenth bins base on the nimimum and maximum values in the data. Consider the case of some uniformly distributed data chopped(砍断) into fourths:
data = np.random.rand(20)
tmp = pd.cut(data, 4, precision=2)
tmp
pd.value_counts(tmp)
[(0.55, 0.76], (0.55, 0.76], (0.55, 0.76], (0.34, 0.55], (0.55, 0.76], ..., (0.14, 0.34], (0.76, 0.97], (0.55, 0.76], (0.34, 0.55], (0.14, 0.34]]
Length: 20
Categories (4, interval[float64]): [(0.14, 0.34] < (0.34, 0.55] < (0.55, 0.76] < (0.76, 0.97]]
(0.55, 0.76] 6
(0.34, 0.55] 6
(0.14, 0.34] 5
(0.76, 0.97] 3
dtype: int64
The pricision=2 option limits the dicimal precision to two digits.
A closely related function, qcut, bins the data based on sample quantiles(样本分位数). Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:
(qcut 是按样本分为数来选的, 每个分区的样本点是一样的, 可控制)
data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into quartiles
cats
pd.value_counts(cats)
[(-2.855, -0.667], (0.0288, 0.663], (0.0288, 0.663], (0.663, 3.065], (-0.667, 0.0288], ..., (-0.667, 0.0288], (0.0288, 0.663], (0.0288, 0.663], (0.663, 3.065], (0.663, 3.065]]
Length: 1000
Categories (4, interval[float64]): [(-2.855, -0.667] < (-0.667, 0.0288] < (0.0288, 0.663] < (0.663, 3.065]]
(0.663, 3.065] 250
(0.0288, 0.663] 250
(-0.667, 0.0288] 250
(-2.855, -0.667] 250
dtype: int64
Similar to cut you can pass your own quantiles (number between 0 and 1, inclusive):
tmp = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1])
tmp
pd.value_counts(tmp)
[(-2.855, -1.311], (0.0288, 1.256], (0.0288, 1.256], (0.0288, 1.256], (-1.311, 0.0288], ..., (-1.311, 0.0288], (0.0288, 1.256], (0.0288, 1.256], (0.0288, 1.256], (1.256, 3.065]]
Length: 1000
Categories (4, interval[float64]): [(-2.855, -1.311] < (-1.311, 0.0288] < (0.0288, 1.256] < (1.256, 3.065]]
(0.0288, 1.256] 400
(-1.311, 0.0288] 400
(1.256, 3.065] 100
(-2.855, -1.311] 100
dtype: int64
We'll return to cut and qcut later in the chapter during our discussion of aggregation and group oprations, as these discretization functions are especially usefull for quantile and group analysis.
检测和过滤异常值
Filteringor transform outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:
(异常值检测, 通常就是一个数组操作)
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.004497 | 0.059209 | -0.041191 | 0.038118 |
std | 0.993640 | 1.011040 | 1.020444 | 0.999947 |
min | -3.147704 | -2.843186 | -4.352625 | -3.129246 |
25% | -0.695843 | -0.660866 | -0.727212 | -0.640414 |
50% | 0.022815 | -0.010976 | -0.024072 | 0.059942 |
75% | 0.699713 | 0.774702 | 0.674284 | 0.712486 |
max | 3.005114 | 3.279071 | 3.033638 | 3.515276 |
Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:
col = data[2] # 名字为 2 的列
col[np.abs(col) > 3]
26 -3.495486
324 -3.130600
364 3.007297
380 3.033638
566 -4.352625
791 3.010874
833 -3.226842
Name: 2, dtype: float64
To select all rows having a value exceeding 3 or -3 you can use the any method on a boolean DataFrame:
data[(np.abs(data)>3).any(1)] # 选取所有行, 值在-3,3之间
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
25 | 0.756734 | -0.109773 | 0.737890 | -3.025528 |
26 | -0.766688 | 0.791026 | -3.495486 | 0.689195 |
158 | -0.403342 | 0.347707 | 1.014756 | 3.279205 |
324 | -0.070037 | -0.240627 | -3.130600 | 0.104002 |
339 | -1.450547 | 1.596675 | 0.609930 | 3.317804 |
364 | -1.005375 | -0.176153 | 3.007297 | -0.488084 |
380 | -0.520558 | -1.530794 | 3.033638 | -0.437202 |
391 | 0.991961 | -0.441668 | -1.225294 | 3.515276 |
394 | 0.201471 | 3.006423 | 0.052278 | 0.329850 |
404 | 1.129113 | 3.279071 | -0.251223 | -0.479738 |
439 | 0.535682 | 3.241431 | 1.109137 | -0.726348 |
491 | -3.147704 | 0.563707 | 0.017993 | -1.139543 |
566 | -1.652402 | -1.073253 | -4.352625 | -0.036447 |
584 | 3.005114 | -1.024521 | -0.213738 | 1.480222 |
630 | -3.023823 | -0.623671 | -1.414060 | -0.996899 |
674 | 0.045482 | -0.189843 | 0.817160 | 3.027287 |
760 | 2.101176 | 3.251508 | -1.328292 | -0.406980 |
791 | 1.190759 | 0.527168 | 3.010874 | -1.135035 |
833 | 0.321461 | -0.049159 | -3.226842 | 0.264828 |
885 | -3.031641 | 1.004770 | -0.674609 | 0.702201 |
933 | 0.603761 | -0.154076 | 0.061579 | -3.129246 |
Values can be set based on these criteria. Here is code to cap values outside the interval -3 to 3.
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.004298 | 0.058431 | -0.039037 | 0.037133 |
std | 0.993000 | 1.008671 | 1.012805 | 0.995857 |
min | -3.000000 | -2.843186 | -3.000000 | -3.000000 |
25% | -0.695843 | -0.660866 | -0.727212 | -0.640414 |
50% | 0.022815 | -0.010976 | -0.024072 | 0.059942 |
75% | 0.699713 | 0.774702 | 0.674284 | 0.712486 |
max | 3.000000 | 3.000000 | 3.000000 | 3.000000 |
The statement np.sign(data) produces 1 and -1 base on whether the values in data are positive or negative:
np.sign(data).head() # np.sing(data), 正数赋值为1, 零为 0, 负数为-1
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | -1.0 | 1.0 | -1.0 |
1 | -1.0 | -1.0 | 1.0 | -1.0 |
2 | -1.0 | -1.0 | 1.0 | 1.0 |
3 | -1.0 | -1.0 | -1.0 | -1.0 |
4 | -1.0 | 1.0 | 1.0 | -1.0 |
# cj test
np.sign(100)
np.sign(0)
np.sign(-888)
1
0
-1
排列和随机采样
Permuting(randomly reordering 随机排列) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling permutation with the lenght of the axis you want to permute produces an array of integers indicating the new ordering:
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
"随机排列 0 - 5"
sampler = np.random.permutation(5)
sampler
'随机排列 0 - 5'
array([2, 1, 0, 3, 4])
That array can then be used in iloc-based indexing or the equivalent take function:
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
df.take(sampler) # 按行的随机 [2,1,3,0,4]
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
2 | 8 | 9 | 10 | 11 |
1 | 4 | 5 | 6 | 7 |
0 | 0 | 1 | 2 | 3 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
To select a random subset without replacement, you can use the sample method on Series and DataFrame.
"随机选3行, 不带重复"
df.sample(n=3)
'随机选3行, 不带重复'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
4 | 16 | 17 | 18 | 19 |
To generate a sample with replacement (to allow repeat choices), pass replace=True to sample:
choices = pd.Series([5, 7, -1, 6, 4])
"随机, 可重复采样, boosting"
draws = choices.sample(n=10, replace=True)
draws
'随机, 可重复采样, boosting'
1 7
1 7
2 -1
4 4
0 5
3 6
2 -1
4 4
2 -1
4 4
dtype: int64
虚拟变量
- 哑变量
- one-hot编码
Another type of transformation for statistical modeling or machine learing applications is converting a categorical variable into "dummy" or "indicator" matrix or DataFrame with k columns containing all is 0s.(热独编码呗) pandas has a get_dummies function for doing this, though devising(发明) one youself is not difficult(尽管自己实现也不难). Let's return to an earlier example DataFrame:
df = pd.DataFrame({'key':"b,b,a,c,a,b".split(','),
'data1': range(6)})
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | b | 5 |
"热独编码而已"
pd.get_dummies(df['key'])
'热独编码而已'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data.get_dummies has a prefix argument for doing this:
dummies = pd.get_dummies(df['key'], prefix='key')
"添加新列, [[]], 这种参数, 试一试就ok"
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
'添加新列, [[]], 这种参数, 试一试就ok'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
data1 | key_a | key_b | key_c | |
---|---|---|---|---|
0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | 1 | 0 |
2 | 2 | 1 | 0 | 0 |
3 | 3 | 0 | 0 | 1 |
4 | 4 | 1 | 0 | 0 |
5 | 5 | 0 | 1 | 0 |
If a row in DataFrame belongs to multiple categories, things are a bit more complicated Let's look at the MovieLens 1 M dataset, which is investigated(研究) in more detail in Chapter 14.
movies = pd.read_table("../datasets/movielens/movies.dat", sep="::",
header=None)
# pandas 啥都能读, 而且还能自己编码成utf8
c:\python\python36\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
movies.info()
movies.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
0 3883 non-null int64
1 3883 non-null object
2 3883 non-null object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
# 自己重写取列名, 给属性columns, df原地的哦
movies.columns = ['movie_id', "title", 'genres']
movies.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
Adding indicator variables(添加指标变量) for each genre requires a little bit of wrangling. First, we exract the list of unique genres in the dataset:
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
"去重后"
genres = pd.unique(all_genres)
genres
'去重后'
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
One way to construct the indicator DataFrame is to start with a DataFrame of all zeros:
# zero_matrix = np.zeros((movies.shap[0], len(genres)))
zero_matrix = np.zeros((len(movies), len(genres)))
Now, iterate through each movie and set entries in each row of dummies to it. To do this, we use the dummies. columns to compute the column indices for each genre:
gen = movies.genres[0] # 传数字, 行
"Animation|Children's|Comedy"
gen.split('|')
['Animation', "Children's", 'Comedy']
dummies.columns.get_indexer(gen.split("|"))
array([-1, -1, -1], dtype=int64)
Then, we can use .iloc to set values based on thees indices:
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(genres.split('|'))
dummies.iloc[i, indices] = 1
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-62-c595faeffed8> in <module>
1 for i, gen in enumerate(movies.genres):
----> 2 indices = dummies.columns.get_indexer(genres.tostring.split('|'))
3 dummies.iloc[i, indices] = 1
AttributeError: 'builtin_function_or_method' object has no attribute 'split'
arr = np.array([1,2,3])
arr.tolist().split(',')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-69-a97f9bb706d8> in <module>
----> 1 arr.tolist().split(',')
AttributeError: 'list' object has no attribute 'split'
For much large data, this method of consturction indicator variables with multiple membership is not especially speedy. It would be better to write a lower-level function that writes directly to a NumPy array, and then wrap the result in a DataFrame.
对于大量数据,这种具有多个成员资格的建立指标变量的方法并不是特别快。 编写一个直接写入NumPy数组的低级函数,然后将结果包装在DataFrame中会更好。
np.random.seed(123456) # cj 还是会变, 有点打脸, 都是正数吧
values = np.random.rand(10)
values
array([0.37301223, 0.44799682, 0.12944068, 0.85987871, 0.82038836,
0.35205354, 0.2288873 , 0.77678375, 0.59478359, 0.13755356])
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
(0.0, 0.2] | (0.2, 0.4] | (0.4, 0.6] | (0.6, 0.8] | (0.8, 1.0] | |
---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 1 | 0 | 0 | 0 |
5 | 0 | 1 | 0 | 0 | 0 |
6 | 0 | 0 | 1 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 1 |
8 | 1 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 1 | 0 | 0 |
We set the random seed with numpy.random.seed to make the example deterministic. We will look again at pandas.get_dummies later in the book. (cj, 设置的随机种子, 目的其实是为了做测试, 保证在n次的抽样, 结果是一样的.)
pandas 之 特征工程的更多相关文章
- Python机器学习笔记 使用sklearn做特征工程和数据挖掘
特征处理是特征工程的核心部分,特征工程是数据分析中最耗时间和精力的一部分工作,它不像算法和模型那样式确定的步骤,更多的是工程上的经验和权衡,因此没有统一的方法,但是sklearn提供了较为完整的特征处 ...
- AI学习---特征工程【特征抽取、特征预处理、特征降维】
学习框架 特征工程(Feature Engineering) 数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已 什么是特征工程: 帮助我们使得算法性能更好发挥性能而已 sklearn主 ...
- sklearn—特征工程
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
- Auto-ML之自动化特征工程
1. 引言 个人以为,机器学习是朝着更高的易用性.更低的技术门槛.更敏捷的开发成本的方向去发展,且Auto-ML或者Auto-DL的发展无疑是最好的证明.因此花费一些时间学习了解了Auto-ML领域的 ...
- python 机器学习库 —— featuretools(自动特征工程)
文档:https://docs.featuretools.com/#minute-quick-start 所谓自动特征工程,即是将人工特征工程的过程自动化.以 featuretools 为代表的自动特 ...
- 如何用Python做自动化特征工程
机器学习的模型训练越来越自动化,但特征工程还是一个漫长的手动过程,依赖于专业的领域知识,直觉和数据处理.而特征选取恰恰是机器学习重要的先期步骤,虽然不如模型训练那样能产生直接可用的结果.本文作者将使用 ...
- sklearn中的数据预处理和特征工程
小伙伴们大家好~o( ̄▽ ̄)ブ,沉寂了这么久我又出来啦,这次先不翻译优质的文章了,这次我们回到Python中的机器学习,看一下Sklearn中的数据预处理和特征工程,老规矩还是先强调一下我的开发环境是 ...
- Python之特征工程-3
一.什么是特征工程?其实也是数据处理的一种方式,和前面的原始数据不一样的是,我们在原始数据的基础上面,通过提取有效特征,来预测目标值.而想要更好的去得出结果,包括前面使用的数据处理中数据特征提取,新增 ...
- sklearn特征工程
目录 一. 特征工程是什么? 2 ①特征使用方案 3 ②特征获取方案 4 ③特征处理 4 1. 特征清洗 4 2. 数据预处理 4 3. 特 ...
随机推荐
- 8259A的初始化(多片)
1.主从式8259A的初始化设置: 初始化设置如下: (1)中断触发方式:边沿触发 (2)中断屏蔽方式:常规屏蔽方式,即使用OCW1向IMR写屏码 (3)中断优先级排队方式:固定优先级的完全嵌套方式 ...
- IIS 报错 Cannot open database "test4" requested by the login. The login failed. Login failed for user 'IIS APPPOOL\test1'.
报错: Cannot open database "test4" requested by the login. The login failed. Login failed fo ...
- GoldenDict词典配置
词典下载网址:http://download.huzheng.org/ 将下载后的词典解压放入 /usr/share/goldendict/dicts 下 程序设置里扫描文件夹,搜索出词典信息 设置自 ...
- 洛谷 p1047 校门外的树 线段树做法
非常easy, 注意一下它是两端开始,也就是说0的位置也有一棵树就好了 我由于太弱了,一道红题交了4,5遍 由于树的砍了就没了,lazy标记最大就是1; 直接贴代码吧 #include<bits ...
- ES6解构赋值常见用法
解构赋值出现的契机: let obj = { a: 1, b: 2 } // 取值 let a = obj.a let b = obj.b 问题核心: 每次取值既要确定对象属性名,还得重新定义一个变量 ...
- Python内置函数---ord()
描述: ord() 函数是 chr() 函数(对于8位的ASCII字符串)或 unichr() 函数(对于Unicode对象)的配对函数,它以一个字符(长度为1的字符串)作为参数,返回对应的 ASCI ...
- ssl证书转换cer转pem
.pem证书转.cer证书 openssl x509 -outform der -in demo.pem -out demo.cer .cer证书转.pem证书 openssl x509 -infor ...
- quantmod
-quantmod(数据和图形) -TTR(技术分析) -blooter(账户管理) -FinancialInstrument(金融产品) -quantstrast(策略模型) -Performanc ...
- prometheus安装(docker)
参考:https://github.com/songjiayang/prometheus_practice https://github.com/kjanshair/docker-prometheus ...
- [环境部署] Linux搭建SVN服务器之Centos篇
使用 service iptables stop 关闭防火墙 安装步骤如下: 1.yum install subversion2.输入rpm -ql subversion查看安装位置,如下:rpm - ...