pandas数据处理

1、删除重复元素

import numpy as np

import pandas  as pd

from pandas import Series,DataFrame

df = DataFrame({"color":["red","white","red","green"], "size":[10,20,10,30]})

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	size
0	red	10
1	white	20
2	red	10
3	green	30

使用duplicated()函数检测重复的行，返回元素为布尔类型的Series对象，每个元素对应一行，如果该行不是第一次出现，则元素为True

df.duplicated()

0    False

1    False

2     True

3    False

dtype: bool

使用drop_duplicates()函数删除重复的行

df.drop_duplicates()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	size
0	red	10
1	white	20
3	green	30

如果使用pd.concat([df1,df2],axis = 1)生成新的DataFrame，新的df中columns相同，使用duplicate()和drop_duplicates()不会出错！！！！！

df2  =pd.concat((df,df),axis = 1)

df2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	size	color	size
0	red	10	red	10
1	white	20	white	20
2	red	10	red	10
3	green	30	green	30

df2.duplicated()

0    False

1    False

2     True

3    False

dtype: bool

df2.drop_duplicates()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	size	color	size
0	red	10	red	10
1	white	20	white	20
3	green	30	green	30

2. 映射

映射的含义：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定

需要使用字典：

map = { 'label1':'value1', 'label2':'value2', ... }

包含三种操作：

replace()函数：替换元素
最重要：map()函数：新建一列
rename()函数：替换索引

1) replace()函数：替换元素

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	size
0	red	10
1	white	20
2	red	10
3	green	30

使用replace()函数，对values进行替换操作

首先定义一个字典

color = {"red":10,"green":20}

调用.replace()

df.replace(color, inplace=True)

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	size
0	10	10
1	white	20
2	10	10
3	20	30

replace还经常用来替换NaN元素

df.loc[1] = np.nan

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	size
0	10	10.0
1	NaN	NaN
2	10	10.0
3	20	30.0

v = {np.nan:0.1}

df.replace(v)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	size
0	10.0	10.0
1	0.1	0.1
2	10.0	10.0
3	20.0	30.0

============================================

练习19：

假设张三李四的课表里有满分的情况，老师认为是作弊，把所有满分的情况（包括150,300分）都记0分，如何实现？

============================================

2) map()函数：新建一列

使用map()函数，由已有的列生成一个新列

适合处理某一单独的列。

df = DataFrame(np.random.randint(0,150,size  =(4,4)),columns = ["Python","Java","PHP","HTML"],

               index = ["张三","旭日","阳刚","木兰"])

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML
张三	72	33	12	62
旭日	128	12	92	127
阳刚	133	54	89	31
木兰	90	144	136	118

仍然是新建一个字典

v = {72:90,128:100,133:134,90:43}

df["Go"] = df["Python"].map(v)

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go
张三	72	33	12	62	90
旭日	128	12	92	127	100
阳刚	133	54	89	31	134
木兰	90	144	136	118	43

map()函数中可以使用lambda函数

df["C"] = df["Go"].map(lambda x : x - 40)

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go	C
张三	72	33	12	62	90	50
旭日	128	12	92	127	100	60
阳刚	133	54	89	31	134	94
木兰	90	144	136	118	43	3

def mp(x):

    if x < 50:

        return "不及格"

    else:

        return "优秀"

#添加回函数 这个时候可来更加复杂的操作

df["score"] = df['C'].map(mp)

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go	C	score
张三	72	33	12	62	90	50	优秀
旭日	128	12	92	127	100	60	优秀
阳刚	133	54	89	31	134	94	优秀
木兰	90	144	136	118	43	3	不及格

transform()和map()类似

df["score2"] = df["C"].transform(mp)

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go	C	score	score2
张三	72	33	12	62	90	360	优秀	优秀
旭日	128	12	92	127	100	400	优秀	优秀
阳刚	133	54	89	31	134	536	优秀	优秀
木兰	90	144	136	118	43	172	不及格	不及格

使用map()函数新建一个新列

#可以不可以修改当前的列？？？

df["C"] = df["C"].map(lambda x : x*2)

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go	C	score	score2
张三	72	33	12	62	90	360	优秀	优秀
旭日	128	12	92	127	100	400	优秀	优秀
阳刚	133	54	89	31	134	536	优秀	优秀
木兰	90	144	136	118	43	172	不及格	不及格

============================================

练习20：

新增两列，分别为张三、李四的成绩状态，如果分数低于90，则为"failed"，如果分数高于120，则为"excellent"，其他则为"pass"

【提示】使用函数作为map的参数

============================================

3) rename()函数：替换索引

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go	C	score	score2
张三	72	33	12	62	90	720	优秀	优秀
旭日	128	12	92	127	100	800	优秀	优秀
阳刚	133	54	89	31	134	1072	优秀	优秀
木兰	90	144	136	118	43	344	不及格	不及格

def cols(x):

    if x == "Python":

        return "大蟒蛇"

    if x == "PHP":

        return "php"

    else:

        return x

inds = {'张三':"Zhang sir", '木兰':"Miss hua"}

仍然是新建一个字典

使用rename()函数替换行索引

df.rename(index = inds, columns = cols)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	大蟒蛇	Java	php	HTML	Go	C	score	score2
Zhang sir	72	33	12	62	90	720	优秀	优秀
旭日	128	12	92	127	100	800	优秀	优秀
阳刚	133	54	89	31	134	1072	优秀	优秀
Miss hua	90	144	136	118	43	344	不及格	不及格

3. 异常值检测和过滤

使用describe()函数查看每一列的描述性统计量

df.describe()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go	C
count	4.00000	4.000000	4.000000	4.000000	4.000000	4.000000
mean	105.75000	60.750000	82.250000	84.500000	91.750000	734.000000
std	29.57899	58.088295	51.525883	45.814845	37.562171	300.497365
min	72.00000	12.000000	12.000000	31.000000	43.000000	344.000000
25%	85.50000	27.750000	69.750000	54.250000	78.250000	626.000000
50%	109.00000	43.500000	90.500000	90.000000	95.000000	760.000000
75%	129.25000	76.500000	103.000000	120.250000	108.500000	868.000000
max	133.00000	144.000000	136.000000	127.000000	134.000000	1072.000000

使用std()函数可以求得DataFrame对象每一列的标准差

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go	C	score	score2
张三	72	33	12	62	90	720	优秀	优秀
旭日	128	12	92	127	100	800	优秀	优秀
阳刚	133	54	89	31	134	1072	优秀	优秀
木兰	90	144	136	118	43	344	不及格	不及格

df.std()

Python     29.578990

Java       58.088295

PHP        51.525883

HTML       45.814845

Go         37.562171

C         300.497365

dtype: float64

根据每一列的标准差，对DataFrame元素进行过滤。

借助any()函数, 测试是否有True，有一个或以上返回True，反之返回False

对每一列应用筛选条件,去除标准差太大的数据

df.drop(["score","score2"], axis = 1, inplace=True)

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Python	Java	PHP	HTML	Go	C
张三	72	33	12	62	90	720
旭日	128	12	92	127	100	800
阳刚	133	54	89	31	134	1072
木兰	90	144	136	118	43	344

df2 = df.stack().unstack(level = 0)

df2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	张三	旭日	阳刚	木兰
Python	72	128	133	90
Java	33	12	54	144
PHP	12	92	89	136
HTML	62	127	31	118
Go	90	100	134	43
C	720	800	1072	344

import numpy as np

cond = np.abs(df2) < df2.std()*2

cond

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	张三	旭日	阳刚	木兰
Python	True	True	True	True
Java	True	True	True	True
PHP	True	True	True	True
HTML	True	True	True	True
Go	True	True	True	True
C	False	False	False	False

df2.std()

张三    273.401109

旭日    292.212537

阳刚    403.757064

木兰    103.765922

dtype: float64

df.std(axis = 1)

张三    273.401109

旭日    292.212537

阳刚    403.757064

木兰    103.765922

dtype: float64

df2[cond].dropna()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	张三	旭日	阳刚	木兰
Python	72.0	128.0	133.0	90.0
Java	33.0	12.0	54.0	144.0
PHP	12.0	92.0	89.0	136.0
HTML	62.0	127.0	31.0	118.0
Go	90.0	100.0	134.0	43.0

删除特定索引df.drop(labels,inplace = True)

============================================

练习21：

新建一个形状为10000*3的标准正态分布的DataFrame(np.random.randn)，去除掉所有满足以下情况的行：其中任一元素绝对值大于3倍标准差

============================================

n = np.random.randn(10000,3)

df = DataFrame(n)

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2
0	2.881592	0.536820	1.216572
1	0.456766	1.395878	1.768264
2	1.708867	1.622249	0.335690
3	0.194154	-0.487591	-0.015412
4	-2.255826	-0.237842	1.419884
5	1.049099	-0.366917	-0.042190
6	0.191674	-2.372953	1.019347
7	-0.838643	-0.399063	-1.339320
8	1.517263	-0.761005	-1.950791
9	0.251293	0.691856	-0.434976
10	-0.393337	-0.840542	1.051823
11	0.519547	-0.960125	0.693721
12	0.675356	0.742952	-1.987214
13	-1.073620	-1.786886	0.286581
14	-1.137472	-1.294179	-1.650784
15	-0.211439	-0.398124	-0.564845
16	0.150546	-0.094917	0.389879
17	0.202585	0.345154	-0.579804
18	-0.591010	-0.963711	1.492271
19	1.184359	-0.888860	-0.377440
20	-1.122213	0.263416	1.482226
21	-0.181044	-0.890953	1.385926
22	-1.860743	1.028910	0.016576
23	1.289668	0.079026	0.391087
24	-1.501513	1.269525	0.026053
25	-0.845240	0.744394	-1.451082
26	-1.094819	-0.503675	0.403650
27	1.037809	-0.475193	-1.582079
28	-1.655881	-0.532378	0.668746
29	2.176592	-1.564236	1.892409
...	...	...	...
9970	0.163431	-0.453160	-0.551507
9971	-1.818862	-0.315904	0.254854
9972	-0.284665	0.446465	-0.095406
9973	1.951441	0.167062	1.005489
9974	-0.139046	0.300747	-0.964243
9975	-0.292296	0.733086	1.749265
9976	0.565221	0.365676	0.724422
9977	0.554723	1.523374	-2.181834
9978	-1.321702	-2.075783	0.570540
9979	0.619274	-0.393143	-0.809066
9980	0.879297	-0.476391	-0.004182
9981	1.230847	0.951403	2.314687
9982	0.645433	0.313307	0.831975
9983	-0.317260	-0.246456	0.704056
9984	-0.698464	0.002091	0.498848
9985	0.593881	1.192555	-0.025465
9986	-1.343395	-1.148288	0.153664
9987	1.442074	-1.500158	-0.105832
9988	0.767976	0.209889	-0.486307
9989	0.832209	-0.969938	-0.664690
9990	-0.872977	0.166470	-0.534711
9991	-1.368020	-0.477498	0.157921
9992	0.449316	-0.021680	0.109007
9993	-0.967712	1.411765	0.529959
9994	-0.007388	0.807077	0.295686
9995	-0.241627	0.256662	0.890862
9996	-0.082404	1.090093	0.180587
9997	-1.086674	0.879875	-1.547565
9998	-0.639018	0.176242	-0.230805
9999	0.487361	-0.096955	0.262908

10000 rows × 3 columns

cond = np.abs(df) > df.std()*3

cond

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2
0	False	False	False
1	False	False	False
2	False	False	False
3	False	False	False
4	False	False	False
5	False	False	False
6	False	False	False
7	False	False	False
8	False	False	False
9	False	False	False
10	False	False	False
11	False	False	False
12	False	False	False
13	False	False	False
14	False	False	False
15	False	False	False
16	False	False	False
17	False	False	False
18	False	False	False
19	False	False	False
20	False	False	False
21	False	False	False
22	False	False	False
23	False	False	False
24	False	False	False
25	False	False	False
26	False	False	False
27	False	False	False
28	False	False	False
29	False	False	False
...	...	...	...
9970	False	False	False
9971	False	False	False
9972	False	False	False
9973	False	False	False
9974	False	False	False
9975	False	False	False
9976	False	False	False
9977	False	False	False
9978	False	False	False
9979	False	False	False
9980	False	False	False
9981	False	False	False
9982	False	False	False
9983	False	False	False
9984	False	False	False
9985	False	False	False
9986	False	False	False
9987	False	False	False
9988	False	False	False
9989	False	False	False
9990	False	False	False
9991	False	False	False
9992	False	False	False
9993	False	False	False
9994	False	False	False
9995	False	False	False
9996	False	False	False
9997	False	False	False
9998	False	False	False
9999	False	False	False

10000 rows × 3 columns

drop_index = df[cond.any(axis = 1)].index

df2 = df.drop(drop_index)

df2.shape

(9927, 3)

4. 排序

使用.take()函数排序

可以借助np.random.permutation()函数随机排序

df = DataFrame(np.random.randint(0,150,size = (4,4)), columns = ["python","java","php","html"],

               index = ["张三","旭日","阳刚","木兰"])

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	python	java	php	html
张三	20	109	25	43
旭日	83	98	22	39
阳刚	144	19	139	131
木兰	142	72	11	103

df.take([3,2,0])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	python	java	php	html
木兰	142	72	11	103
阳刚	144	19	139	131
张三	20	109	25	43

indices = np.random.permutation(3)

indices

array([2, 0, 1])

df.take(indices)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	python	java	php	html
阳刚	144	19	139	131
张三	20	109	25	43
旭日	83	98	22	39

随机抽样

当DataFrame规模足够大时，直接使用np.random.randint()函数，就配合take()函数实现随机抽样

df2 = DataFrame(np.random.randn(10000,3))

df2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2
0	-0.025224	-0.565157	0.844056
1	-0.035657	-0.402001	0.472989
2	0.699627	0.803896	0.501812
3	0.245390	1.377563	1.070162
4	-0.748127	-1.863395	-1.011189
5	1.064520	0.941913	1.040098
6	0.342517	-0.420390	0.105190
7	-1.337581	1.902223	-1.237730
8	0.960661	0.510905	-0.702202
9	0.228292	-1.237225	-0.725750
10	0.894908	0.255933	1.285206
11	-0.112649	0.073029	0.226987
12	0.847398	1.278539	-0.316305
13	0.709176	-0.054754	-0.626551
14	-1.492717	-0.270664	1.138691
15	-1.050701	0.731788	1.515430
16	0.033859	-0.481181	0.449713
17	1.908899	0.013049	1.168205
18	2.003074	-0.694794	-2.443718
19	-0.305153	1.659422	-2.338938
20	-0.595257	0.649238	-0.782337
21	-0.143291	-0.661235	-1.292414
22	-0.451794	0.380953	1.187246
23	0.258405	0.352720	-0.671535
24	-1.710904	-1.020995	1.160462
25	-0.790192	0.688780	0.088410
26	-0.174850	-1.112568	-1.633942
27	0.213165	1.020418	0.533577
28	-0.853166	0.192139	2.363981
29	-0.197083	0.637195	-1.911048
...	...	...	...
9970	-1.317070	0.660991	0.393611
9971	-0.947604	-1.415052	1.662456
9972	0.517894	0.179165	0.489423
9973	-0.189215	0.657269	0.047626
9974	1.126898	-0.085763	-1.709755
9975	0.359945	0.411918	0.668606
9976	-0.491320	-1.247942	0.887130
9977	0.736900	0.136471	0.079652
9978	1.469600	0.852718	-0.141616
9979	1.110100	-0.394567	0.997196
9980	-0.581172	-1.658739	1.657382
9981	-1.173605	1.491162	-0.760518
9982	0.097367	0.252979	-1.697217
9983	0.079267	-1.369900	-0.870134
9984	-0.376669	-0.583582	0.250551
9985	0.419189	-0.367227	-0.496057
9986	-0.140032	0.202857	-0.476418
9987	-0.227373	-0.463283	0.559428
9988	-1.595745	0.392217	-0.160671
9989	0.007461	0.840525	0.841650
9990	1.266712	-1.190441	-0.983106
9991	-1.641171	-0.463228	-0.572552
9992	-1.494818	-0.851275	-0.443659
9993	-0.106178	-0.199535	1.542675
9994	-0.433710	-0.561674	-2.116589
9995	0.776234	-1.814600	0.539298
9996	0.099580	-0.133758	1.239752
9997	0.165359	1.558473	0.135779
9998	-0.870957	1.140052	0.056586
9999	-0.390214	-0.152384	-1.184713

10000 rows × 3 columns

indices = np.random.randint(0,10000,size  =10)

df2.take(indices)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	0	1	2
7761	-0.571778	0.099853	0.026818
1380	0.131348	0.017664	0.983131
733	0.986849	0.262630	-0.551597
2094	-0.514843	0.735007	-1.217740
3761	-1.863511	0.421299	-0.082948
4777	2.176549	1.485876	-0.087476
5176	1.748004	0.498117	1.088707
9606	-1.106140	-1.356788	-1.098564
4736	-1.377561	-0.461284	-1.994532
7965	0.588313	0.024674	0.059207

============================================

练习22：

假设有张三李四王老五的期中考试成绩ddd2，对着三名同学随机排序

============================================

5. 数据聚合

数据聚合是数据处理的最后一步，通常是要使每一个数组生成一个单一的数值。

数据分类处理：

分组：先把数据分为几组
用函数处理：为不同组的数据应用不同的函数以转换数据
合并：把不同组得到的结果合并起来

数据分类处理的核心：

groupby()函数

df.std()

如果想使用color列索引，计算price1的均值，可以先获取到price1列，然后再调用groupby函数，用参数指定color这一列

df = DataFrame({'color':["red","white","red","cyan","cyan","green","white","cyan"],

                "price":np.random.randint(0,8,size  =8),

               "weight":np.random.randint(50,55,size = 8)}

              )

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	price	weight
0	red	2	53
1	white	5	51
2	red	5	54
3	cyan	5	53
4	cyan	7	52
5	green	5	54
6	white	6	50
7	cyan	0	54

使用.groups属性查看各行的分组情况：

df_sum_price = df.groupby(["color"])[["price"]].sum()

df_sum_price

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	price
color
cyan	12
green	5
red	7
white	11

df_mean_weight  = df.groupby(["color"])[["weight"]].mean()

df_mean_weight

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	weight
color
cyan	53.0
green	54.0
red	53.5
white	50.5

#级联

pd.concat([df,df_sum_price],axis = 1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	price	weight	price
0	red	2.0	53.0	NaN
1	white	5.0	51.0	NaN
2	red	5.0	54.0	NaN
3	cyan	5.0	53.0	NaN
4	cyan	7.0	52.0	NaN
5	green	5.0	54.0	NaN
6	white	6.0	50.0	NaN
7	cyan	0.0	54.0	NaN
cyan	NaN	NaN	NaN	12.0
green	NaN	NaN	NaN	5.0
red	NaN	NaN	NaN	7.0
white	NaN	NaN	NaN	11.0

df_mean_weight

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	weight
color
cyan	53.0
green	54.0
red	53.5
white	50.5

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	price	weight
0	red	2	53
1	white	5	51
2	red	5	54
3	cyan	5	53
4	cyan	7	52
5	green	5	54
6	white	6	50
7	cyan	0	54

df_sum = df.merge(df_mean_weight, how = "outer")

df_sum

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	price	weight
0	red	2.0	53.0
1	cyan	5.0	53.0
2	white	5.0	51.0
3	red	5.0	54.0
4	green	5.0	54.0
5	cyan	0.0	54.0
6	cyan	7.0	52.0
7	white	6.0	50.0
8	NaN	NaN	53.5
9	NaN	NaN	50.5

============================================

练习23：

假设菜市场张大妈在卖菜，有以下属性：

菜品(item)：萝卜，白菜，辣椒，冬瓜

颜色(color)：白，青，红

重量(weight)

价格(price)

要求以属性作为列索引，新建一个ddd
对ddd进行聚合操作，求出颜色为白色的价格总和
对ddd进行聚合操作，求出萝卜的所有重量(包括白萝卜，胡萝卜，青萝卜）以及平均价格
使用merge合并总重量及平均价格

============================================

6.0 高级数据聚合

可以使用pd.merge()函数将聚合操作的计算结果添加到df的每一行

使用groupby分组后调用加和等函数进行运算，让后最后可以调用add_prefix()，来修改列名

可以使用transform和apply实现相同功能

在transform或者apply中传入函数即可

df["columns"] = df["color"].map(sum)

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-132-6a8b5973654d> in <module>()

----> 1 df["columns"] = df["color"].map(sum)

      2 df

C:\anaconda\lib\site-packages\pandas\core\series.py in map(self, arg, na_action)

   2352         else:

   2353             # arg is a function

-> 2354             new_values = map_f(values, arg)

   2355

   2356         return self._constructor(new_values,

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

TypeError: unsupported operand type(s) for +: 'int' and 'str'

sum([10,2])

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	price	weight
0	red	2	53
1	white	5	51
2	red	5	54
3	cyan	5	53
4	cyan	7	52
5	green	5	54
6	white	6	50
7	cyan	0	54

df.groupby("color")[["price","weight"]].transform(sum)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	price	weight
0	7	107
1	11	101
2	7	107
3	12	159
4	12	159
5	5	54
6	11	101
7	12	159

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	color	price	weight
0	red	2	53
1	white	5	51
2	red	5	54
3	cyan	5	53
4	cyan	7	52
5	green	5	54
6	white	6	50
7	cyan	0	54

df.groupby("color")[["price","weight"]].apply(sum)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	price	weight
color
cyan	12	159
green	5	54
red	7	107
white	11	101

transform()与apply()函数还能传入一个函数或者lambda

df = DataFrame({'color':['white','black','white','white','black','black'], 'status':['up','up','down','down','down','up'], 'value1':[12.33,14.55,22.34,27.84,23.40,18.33], 'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})

apply的操作对象，也就是传给lambda的参数是整列的数组

============================================

练习24：

使用transform与apply实现练习23的功能

============================================

（六）pandas 日常使用技巧的更多相关文章

2.Nginx日常维护技巧
Nginx日常维护技巧 Nginx配置正确性检查 nginx提供了配置文件调试功能,可以快速定义配置文件存在的问题.执行如下命令检测配置文件的正确性: [root@localhost 桌面]# whi ...
python3.4学习笔记(六) 常用快捷键使用技巧，持续更新
python3.4学习笔记(六) 常用快捷键使用技巧,持续更新安装IDLE后鼠标右键点击*.py 文件,可以看到Edit with IDLE 选择这个可以直接打开编辑器.IDLE默认不能显示行号,使 ...
Pandas使用实用技巧
Pandas实用使用技巧 1 列拆分成行常见的需求是将某一列根据指定的分隔符拆分成多列.现有需求,根据指定的分隔符将其拆分为多行. 例: df = A B 0 a f 1 b;c h;g 2 d k ...
TotalCommander 之日常使用技巧
一. 常用操作常用的操作如查看.复制.移动.删除退出已经在Total Commander下方列出,选择好文件后单击相应的按钮或是按下相应的快捷键(F3~F7)就可以完成操作.也可以像Windows中 ...
Pandas一些小技巧
Pandas有一些不频繁使用容易忘记的小技巧 1.将不同Dataframe写在一个Excel的不同Sheet,或添加到已有Excel的不同Sheet(同名Sheet会覆盖) from pandas i ...
数据清理，预处理 pandas dataframe 操作技巧总结
dsoft2 = data1.loc[(data1['程'] == "轻") | (data1['程'] == "中")]设置x下标plt.xticks(np. ...
Ansible 日常使用技巧 - 运维总结
Ansible默认只会创建5个进程并发执行任务,所以一次任务只能同时控制5台机器执行.如果有大量的机器需要控制,例如20台,Ansible执行一个任务时会先在其中5台上执行,执行成功后再执行下一批5台 ...
[No00008F]PLSQL自动登录，记住用户名密码&日常使用技巧
配置启动时的登录用户名和密码这是个有争议的功能,因为记住密码会给带来数据安全的问题. 但假如是开发用的库,密码甚至可以和用户名相同,每次输入密码实在没什么意义,可以考虑让PLSQL Develope ...
Windows Server服务器日常管理技巧
高效管理服务器一直离不开有效的服务器管理技巧,尽管你已经掌握了不少这方面的技巧,但服务器还有许许多多的技巧在等着你的总结,等着你的挖掘;这不,下面的一些服务器管理窍门就是笔者在最近的工作中总结出来的, ...

随机推荐

658.找到K个最接近的元素
2020-03-10 找到 K 个最接近的元素给定一个排序好的数组,两个整数 k 和 x,从数组中找到最靠近 x(两数之差最小)的 k 个数.返回的结果必须要是按升序排好的.如果有两个数与 x 的 ...
yii2中的场景使用
下面给大家介绍一下 yii2.0 场景的使用.小伙多唠叨一下了,就是担心有的人还不知道,举个简单的例子,现在在 post表里面有 title image content 三个的字段,当我创建一个 po ...
D2大全
年初看到cnblogs上有人说看这本旧书,自己也只是瞟了下,后来在看些OOP东西时,想想没事也看看老古董,于是网购了一本电子版可参考下,它们是怎么一步步来,还没来得及多看,贴图于此.
当小程序的flex布局遇到button时，justify-content不起作用的原因及解决方案
当小程序的flex布局遇到button时发现justify-content不起作用,无论怎么设置都是space-around的效果. 经过排查,发现原因是小程序button中的默认样式中的margi ...
Idea创建Scala的Maven项目
Idea版本(2018.1.5) Scala版本(2.11.0) Java版本(1.8.0_151) 创建Scala的Maven项目 Idea新建项目如图,输入GroupId和ArtifactId之后 ...
一篇文章教会你使用Python定时抓取微博评论
[Part1--理论篇] 试想一个问题,如果我们要抓取某个微博大V微博的评论数据,应该怎么实现呢?最简单的做法就是找到微博评论数据接口,然后通过改变参数来获取最新数据并保存.首先从微博api寻找抓取评 ...
ca13a_c++_顺序容器的操作6删除元素
/*ca13a_c++_顺序容器的操作6删除元素c.erase(p) //删除迭代器p指向的位置c.erase(b,e) //删除b to e之间的数据,迭代器b包括,e不包括c.clear()//删 ...
ca33a_demo_c++_新旧代码的兼容char数组与vector_string相互转换
/*ca33a_demo_c++33_CppPrimer_新旧代码的兼容_txwtech旧代码:数组和c风格字符串新代码:vector和string相互转换:c风格字符串<- ->stri ...
国外一教授坦言，用这方法能迅速成为python程序员，但都不愿意说_编程小十
越来越多的人学习python,但你学习python用了多长的时间?#Python# 你知道如何才能迅速掌握并成为python程序员吗? 有这样的一位国外的教授说,要迅速成为python程序员,几乎 ...
Oracle调用Java方法（上）如何使用LoadJava命令和如何将简单的Jar包封装成Oracle方法
最近在工作中遇到了遇到了一个需求需要将TIPTOP中的数据导出成XML并上传到FTP主机中,但是4GL这方面的文档比较少最终决定使用Oracle调用Java的方法,在使用的过程中发现有很多的坑,大部分 ...

（六）pandas 日常使用技巧

pandas数据处理

1、删除重复元素

2. 映射

1) replace()函数：替换元素

2) map()函数：新建一列

3) rename()函数：替换索引

3. 异常值检测和过滤

4. 排序

随机抽样

5. 数据聚合

6.0 高级数据聚合

可以使用transform和apply实现相同功能

transform()与apply()函数还能传入一个函数或者lambda

（六）pandas 日常使用技巧的更多相关文章

随机推荐

热门专题