1.Pandas读取数据

一般错误

import pandas as pd
pd.read_csv(r'D:\数据分析\02_Pandas\pandas\food_info.csv')

out:

---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-15-cc3e7efb5b57> in <module>()
1 import pandas as pd
----> 2 pd.read_csv(r'D:\数据分析\02_Pandas\pandas\food_info.csv')
...
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source() OSError: Initializing from file failed

使用read_csv()函数时,可能会报错如上。

显示以上错误,网上解答说可能是路径含有中文名,或者只能读取当前文件夹下的文件;

设置:engine=“python”解决问题!! 默认engine=“C”,使用C时速度较快,但是包含中文时出错。

数据读取 

In:

# read_csv函数,读取数据

import pandas as pd

food_info = pd.read_csv(r'D:\数据分析\02_Pandas\pandas\food_info.csv', engine='python')
# print(food_info) # pandas读取文件后,数据类型为dataframe格式,
# numpy读取文件后的数据类型为array个数
print(type(food_info)) # dataframe.dtypes 返回每列数据的类型;object是pandas中的字符型数据
# numpy中的dtype返回整个文件的数据类型,因此要求数据类型一致;不一致时自动类型转换
print(food_info.dtypes)

Out:

<class 'pandas.core.frame.DataFrame'>
NDB_No int64
Shrt_Desc object
...
FA_Mono_(g) float64
FA_Poly_(g) float64
Cholestrl_(mg) float64
dtype: object

一般属性

In:

# head(),tail()方法的使用

# print(food_info.head())  # 默认显示前5行
print(food_info.head(3)) # head()函数:读取数据的前3行
# food_info.tail(4) # tail()函数:显示末尾4行
# print(food_info.columns) # columns属性:返回所有列名
print(food_info.shape) # shape属性:显示行列数

Out:

   NDB_No                 Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
0 1001 BUTTER WITH SALT 15.87 717 0.85
1 1002 BUTTER WHIPPED WITH SALT 15.87 717 0.85
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28 Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
0 81.11 2.11 0.06 0.0 0.06
1 81.11 2.11 0.06 0.0 0.06
2 99.48 0.00 0.00 0.0 0.00 ... [3 rows x 36 columns]
(8618, 36)

2.Pandas索引与计算

loc与iloc

In:

# loc与iloc索引取值

# print(food_info.loc[0])  # 通过loc方法,获取第一行数据的值
print(food_info.iloc[0]) # 通过iloc方法,同样获取结果
# print(food_info.loc[:3]) # 传入切片,获取若干行
# print(food_info.loc[[2, 3, 5]]) # 以列表的形式传入索引,获取索引对应的数据

Out:

NDB_No                         1001
Shrt_Desc BUTTER WITH SALT
Water_(g) 15.87
Energ_Kcal 717
Protein_(g) 0.85
...
Vit_K_(mcg) 7
FA_Sat_(g) 51.368
FA_Mono_(g) 21.021
FA_Poly_(g) 3.043
Cholestrl_(mg) 215
Name: 0, dtype: object

切片操作

In:

# 切片操作

# food_info.loc[3:6]  # 取出3,4,5,6,共四个样本
# food.info.iloc[3:6] # 只能取到3,4,5共三个样本 index = [2, 3, 5]
# food_info.loc[index]
food_info.iloc[index] # 此时loc和iloc无区别

Out:

    NDB_No    Shrt_Desc    Water_(g)    Energ_Kcal    Protein_(g)    Lipid_Tot_(g)    Ash_(g)    Carbohydrt_(g)    Fiber_TD_(g)    Sugar_Tot_(g)    ...    Vit_A_IU    Vit_A_RAE    Vit_E_(mg)    Vit_D_mcg    Vit_D_IU    Vit_K_(mcg)    FA_Sat_(g)    FA_Mono_(g)    FA_Poly_(g)    Cholestrl_(mg)
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00 0.0 0.00 ... 3069.0 840.0 2.80 1.8 73.0 8.6 61.924 28.732 3.694 256.0
3 1004 CHEESE BLUE 42.41 353 21.40 28.74 5.11 2.34 0.0 0.50 ... 721.0 198.0 0.25 0.5 21.0 2.4 18.669 7.778 0.800 75.0
5 1006 CHEESE BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0 0.45 ... 592.0 174.0 0.24 0.5 20.0 2.3 17.410 8.013 0.826 100.0
3 rows × 36 columns

说明:
loc是根据索引取值,[start: end]表示的范围从start到end
iloc是根据行号取值,[start: end]表示的范围从start到end-1(即不包括end)

In:

# 取某一列数据的三种方法

# 方法1:传入列名,获取该列的值
column = food_info['NDB_No']
print(column.head(3))
print('---' * 10) # 同样以列表形式传入多列的名称,获取多列数据
print(food_info[['NDB_No', 'Water_(g)']].head(3))
print('---' * 10) # 方法2:通过loc方法,逗号左边传入行数(切片),右边传入列名
water_loc = food_info.loc[:, 'Water_(g)']
print(water_loc.head(3))
print('---' * 10) # 方法3:通过iloc方法,逗号左边传入行数(切片),右边传入列号索引
water_iloc = food_info.iloc[:, 2]
print(water_iloc.head(3))

Out:

0    1001
1 1002
2 1003
Name: NDB_No, dtype: int64
------------------------------
NDB_No Water_(g)
0 1001 15.87
1 1002 15.87
2 1003 0.24
------------------------------
0 15.87
1 15.87
2 0.24
Name: Water_(g), dtype: float64
------------------------------
0 15.87
1 15.87
2 0.24
Name: Water_(g), dtype: float64

In:

# 获取两列或多列数据的三种方法

cols = ['Protein_(g)', 'Water_(g)']

# 方法一: 直接传入列名(由列名构成的list结构)
cols_value = food_info[cols]
print(cols_value.head(3)) # 显示三行
print('---' * 10) # 方法二: 通过loc方法,逗号左边传入要取的行数,右边传入列名
cols_loc = food_info.loc[:, cols]
print(cols_loc.head(3))
print('---'*10) # 方法三:通过iloc方法,逗号左边传入要取的行数,右边传入列名索引的list
cols_iloc = food_info.iloc[:, [2, 4]]
print(cols_iloc.head(3))

Out:

Protein_(g)  Water_(g)
0 0.85 15.87
1 0.85 15.87
2 0.28 0.24
------------------------------
Protein_(g) Water_(g)
0 0.85 15.87
1 0.85 15.87
2 0.28 0.24
------------------------------
Water_(g) Protein_(g)
0 15.87 0.85
1 15.87 0.85
2 0.24 0.28

In:

# 获取所有列中以g为结尾的列

# 获取所有列名,并使用tolist()方法转化为列表
cols = food_info.columns.tolist()
new_col = [] # 对列名进行遍历
for col in cols:
if col.endswith('(g)'): # 判断列名是否以g为结尾
new_col.append(col) # 获取所有以g为结尾的列名 new_data = food_info[new_col] # 以列名为索引,取对应的值
new_data.head(3) #显示前3行

Out:

    Water_(g)    Protein_(g)    Lipid_Tot_(g)    Ash_(g)    Carbohydrt_(g)    Fiber_TD_(g)    Sugar_Tot_(g)    FA_Sat_(g)    FA_Mono_(g)    FA_Poly_(g)
0 15.87 0.85 81.11 2.11 0.06 0.0 0.06 51.368 21.021 3.043
1 15.87 0.85 81.11 2.11 0.06 0.0 0.06 50.489 23.426 3.012
2 0.24 0.28 99.48 0.00 0.00 0.0 0.00 61.924 28.732 3.694

3.Pandas数据预处理

基础运算

In:

# 加减乘除操作

# print(food_info['Iron_(mg)'])

 #对应相除,对每个数据操作
div_1000 = food_info['Iron_(mg)']/1000
print(div_1000.head(3))
print('---'*10) # 对应相加,其他操作类似,依次对应元素操作
add_1000 = food_info['Iron_(mg)'] + 1000
print(add_1000.head(3))
print('---'*10) # 取出数据中的两列,然后相乘。也是对应元素相乘
water_energy = food_info['Iron_(mg)'] * food_info['Water_(g)']
print(water_energy.head(3))
print('---'*10) # 将water_energy列做为新元素列,添加到food_info中是以字典的形式更新
print(food_info.shape)
food_info['water_energy'] = water_energy
print(food_info.shape)

Out:

0    0.00002
1 0.00016
2 0.00000
Name: Iron_(mg), dtype: float64
------------------------------
0 1000.02
1 1000.16
2 1000.00
Name: Iron_(mg), dtype: float64
------------------------------
0 0.3174
1 2.5392
2 0.0000
dtype: float64
------------------------------
(8618, 36)
(8618, 37)

归一化操作

In:

# 归一化操作

# 用.max()函数,获取列最大值
max_calories = food_info['Energ_Kcal'].max()
print(max_calories)
print('---'*10) norm_calories = food_info['Energ_Kcal'] / max_calories
print(norm_calories.head(3))

Out:

902
------------------------------
0 0.794900
1 0.794900
2 0.971175
Name: Energ_Kcal, dtype: float64

排序操作

In:

# 排序操作

# 参数1:对该列进行排序;
# 参数二:是否覆盖该列;默认升序
# 缺失值显示NaN,默认缺失值排在最后
food_info.sort_values('Sodium_(mg)', inplace=True)
print(food_info['Sodium_(mg)'].head(3))
print('---'*10) # 指定参数ascending=False,改为降序
food_info.sort_values('Sodium_(mg)', inplace=True, ascending=False)
print(food_info['Sodium_(mg)'].head(3))
print('---'*10)

Out:

760     0.0
787 0.0
2270 0.0
Name: Sodium_(mg), dtype: float64
------------------------------
276 38758.0
5814 27360.0
6192 26050.0
Name: Sodium_(mg), dtype: float64

In:

import pandas as pd
import numpy as np # 先读取泰坦尼克号收据
titantic_survival = pd.read_csv(r'D:\数据分析\02_Pandas\pandas\titanic_train.csv', engine='python')
titantic_survival.head() # 显示前5行

Out:

PassengerId    Survived    Pclass    Name    Sex    Age    SibSp    Parch    Ticket    Fare    Cabin    Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

缺失值判断

In:

# 读取年龄列
age = titantic_survival['Age'] # 判断age列中的值是否缺失
# 缺省值Nan,返回布尔值;
# 缺失值返回True;否则返回False
age_is_null = pd.isnull(age) # 以age_is_null 为索引值,获取所有缺失值
age[age_is_null].head
# 利用len函数,得到缺失值的数量。
print(len(age[age_is_null]))

Out:

177

In:

# 传入没有缺失值的索引,获取所有没有缺失值的行

good_age = titantic_survival['Age'][age_is_null == False]
good_age.head()

Out:

0    22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

4.Pandas的常用函数

平均值函数

In:

# mean函数获取平均值

mean_age = titantic_survival['Age'].mean()
print(mean_age) # 求平均值要求所有值不能有缺失
# mean函数会自动过滤缺失值

Out:

29.69911764705882

缺失值处理函数

In:

# fillna函数对缺失值进行处理

age_series = titantic_survival['Age']

#用平均值替换缺失值
# age_series.fillna(mean_age, inplace=True)
# print(age_series.head(3))

In:

# dropna函数对缺失值处理

# 按列删除,删除有缺失值的列
# axis=1按照列删除
drop_nan_cols = titantic_survival.dropna(axis=1) # 按行删除,删除age和sex列中有缺失的行
# axis=0表示按行删除
# subset表示对哪个列进行操作
dropped_nan = titantic_survival.dropna(axis=0, subset=['Age', 'Sex'])
dropped_nan.head()

Out:

PassengerId    Survived    Pclass    Name    Sex    Age    SibSp    Parch    Ticket    Fare    Cabin    Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

透视表操作

In:

# 统计不同等级船舱的平均费用

passenget_classes = [1, 2, 3]  # 船舱等级
fares_by_class = {} for this_class in passenget_classes:
# 以船舱等级为this_class为索引,得到this_class的所有行
class_rows = titantic_survival[titantic_survival["Pclass"] == this_class]
# 从得到数据中,取Fare列对应的所有值
class_fare = class_rows['Fare']
# 取平均
fare_mean = class_fare.mean()
# 统计
fares_by_class[this_class] = fare_mean print(fares_by_class)

Out:

{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}

In:

# 利用pivot_table函数,实现上述功能。

passenger_survival = titantic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)
print(passenger_survival) # index参数:设置索引,以index给定值为基准进行统计
# values参数:设置值, 要统计的列
# aggfunc参数:设置对values的运算函数,按照哪种方式进行统计

Out:

        Survived
Pclass
1 0.629630
2 0.472826
3 0.242363

In:

# pivot_table函数,统计船舱中乘客的平均年龄

passenger_age = titantic_survival.pivot_table(index='Pclass', values='Age')
print(passenger_age)
# 默认aggfunc参数为np.mean

Out:

              Age
Pclass
1 38.233441
2 29.877630
3 25.140620

In:

# pivot_table函数,统计登船地点的价格与获救人数多个指标

port_stats = titantic_survival.pivot_table(index='Embarked', values=['Fare', 'Survived'], aggfunc=np.sum)
print(port_stats)

Out:

                Fare  Survived
Embarked
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217

In:

# loc函数,获取具体列中的元素

row_index_age = titantic_survival.loc[83, 'Age']  # 获取83号样本的年龄
row_index_cls = titantic_survival.loc[83, 'Pclass'] # 获取83号样本的船舱等级

设置索引

In:

# reset_index函数,设置索引

# 首先利用sort_values按照年龄降序排序
new_titantic = titantic_survival.sort_values('Age', ascending=False)
# 利用reset_index函数重新设置索引,并删除缺失值;drop=True删掉之前的索引值
new_titantic_reindex = new_titantic.reset_index(drop=True) print(new_titantic_reindex.loc[:3])

Out:

   PassengerId  Survived  Pclass                                  Name   Sex  \
0 631 1 1 Barkworth, Mr. Algernon Henry Wilson male
1 852 0 3 Svensson, Mr. Johan male
2 494 0 1 Artagaveytia, Mr. Ramon male
3 97 0 1 Goldschmidt, Mr. George B male

apply函数

In:

# apply函数,自定义函数实现特定功能

def hundred_row(columns):

    # 获取每列中的第100号样本
item = columns.loc[99]
return item # 使用apply函数,传入函数对象,执行函数中的功能
hundred_row = titantic_survival.apply(hundred_row)
print(hundred_row) # dataframe.apply(function, axis) 对一行或一列做一些操作
# axis=0时,对dataframe的每一列进行操作
# apply函数对dataframe的每一列传入到function中,并获取返回值
# 和线程池中的apply函数原理相同

Out:

PassengerId                  100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object

In:

# apply函数,定义函数,重新定义船舱等级

def which_class(row):
pclass = row['Pclass']
if pd.isnull(pclass):
return "Unknow"
elif pclass == 1:
return "First class"
elif pclass == 2:
return "Second class"
else:
return "Third class" classes = titantic_survival.apply(which_class, axis=1)
print(classes.head())

Out:

0    Third class
1 First class
2 Third class
3 First class
4 Third class
dtype: object

In:

# apply函数,统计年龄是否成年

def gen_age_label(row):
age = row['Age'] if pd.isnull(age):
return "Unknow"
elif age < 18:
return "Minor"
else:
return "Adult" age_lab = titantic_survival.apply(gen_age_label, axis=1)
print(age_lab.head())

Out:

0    Adult
1 Adult
2 Adult
3 Adult
4 Adult
dtype: object

In:

# pivot_table函数,统计成年人获救概率

titantic_survival['age_lab'] = age_lab
age_suvival = titantic_survival.pivot_table(index='age_lab', values='Survived')
print(age_suvival)

Out:

         Survived
age_lab
Adult 0.361183
Minor 0.539823

In:

# apply函数, 统计每一列中缺失值个数

def not_null(columns):

    # 判断每列中的值是否为缺失值
col_null_index = pd.isnull(columns)
# 获取所有缺失值
null = columns[col_null_index] return len(null) # 返回缺失值数量 null_count = titantic_survival.apply(not_null)
print(null_count)

Out:

PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

5.Pandas的series结构

series

In:

# dataframe中的一行或一列称之为series
# series就是dataframe的一个子集 import pandas as pd
import numpy as np fandango = pd.read_csv(r"D:\数据分析\02_Pandas\pandas\fandango_score_comparison.csv", engine='python')
series_film = fandango['FILM'] print(type(series_film))
print(series_film.head())
print(series_film[:5]) series_rt = fandango['RottenTomatoes']
series_rt[:4]

Out:

<class 'pandas.core.series.Series'>
0 Avengers: Age of Ultron (2015)
1 Cinderella (2015)
2 Ant-Man (2015)
3 Do You Believe? (2015)
4 Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0 Avengers: Age of Ultron (2015)
1 Cinderella (2015)
2 Ant-Man (2015)
3 Do You Believe? (2015)
4 Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object

构造series

In:

# 构造一个series结构

from pandas import Series

# 拿到电影名, values返回该列的所有值,返回ndarray
film_name = series_film.values
# 拿到评分值,返回ndarray
film_rt = fandango['RottenTomatoes'].values # 构造新series,设置索引为电影名
series_new = Series(film_rt, index=film_name)
print(series_new.head()) # 设置索引后,即可按照索引取值,相当于额外增加一条索引
series_new[['Cinderella (2015)', 'Ant-Man (2015)']]
# 也可以按照原来的数字索引取值
series_new[:2]

Out:

Avengers: Age of Ultron (2015)    74
Cinderella (2015) 85
Ant-Man (2015) 80
Do You Believe? (2015) 18
Hot Tub Time Machine 2 (2015) 14
dtype: int64
Avengers: Age of Ultron (2015) 74
Cinderella (2015) 85
dtype: int64

注:

利用series_new.loc[:3]会报错!!
因为loc方法的索引可以为整型,字符型,但是这里创建的series结构的时候,指定的index是字符串,不是整型,所以会报错。

series索引

In:

original__index = series_new.index.tolist()
# 排序
sorted_index = sorted(original__index) # reindex(或set_index)函数重新设置索引
sorted_by_reindex = series_new.reindex(sorted_index)
print(sorted_by_reindex.head())

Out:

'71 (2015)                    97
5 Flights Up (2015) 52
A Little Chaos (2015) 40
A Most Violent Year (2014) 90
About Elly (2015) 97
dtype: int64

series基本运算

In:

# series结构基本操作

# 相加,对应位置相加,其他运算类似
print(np.add(series_new, series_new).head()) # 判断(得到True和False做索引)
print((series_new > 50).head())

Out:

Avengers: Age of Ultron (2015)    148
Cinderella (2015) 170
Ant-Man (2015) 160
Do You Believe? (2015) 36
Hot Tub Time Machine 2 (2015) 28
dtype: int64
Avengers: Age of Ultron (2015) True
Cinderella (2015) True
Ant-Man (2015) True
Do You Believe? (2015) False
Hot Tub Time Machine 2 (2015) False
dtype: bool

02_Pandas基本使用的更多相关文章

随机推荐

  1. (二)django--带APP的网站

    1.打开终端,进入到django项目,创建APP应用:python manage.py startapp news 2.在settings.py中进行注册 3.在news下新建views.py,和ur ...

  2. react框架安装和使用

    react 其实react跟vue差不多, 区别:vue-  双向数据绑定, react  单向数据绑定. 中文文档:https://react.docschina.org/ 第一步:安装方式,不能直 ...

  3. 【阿里云IoT+YF3300】8.物联网设备用户脚本开发

    除了我们必须熟悉的网页脚本,比如JavaScript.其实在工业自动化中,组态软件是必备脚本的,只是有的脚本语言风格类似C或类似Basic而已.比如昆仑通泰的组态屏中的组态软件.通过安装组态软件可以简 ...

  4. 在虚拟机中使用DHCP动态管理主机地址

    小知识 DHCP协议服务能够自动化的管理局域网内的主机IP地址,有效的提升IP地址使用率,提高配置效率,减少管理与维护成本.简而言之,就是ip地址分配. *****五星重点 所需要的服务:dhcp 下 ...

  5. 使用 HTML5 WebSocket 构建实时 Web 应用

    原文地址:http://www.ibm.com/developerworks/cn/web/1112_huangxa_websocket/ HTML5 WebSocket 简介和实战演练 本文主要介绍 ...

  6. 基础练习1——ls的实现与递归

    学习贵在坚持,兜兜转转,发现还是从基础做起吧,打好基础,才会长期的坚持下去... 第一个练习:shell命令 “ls"的实现与递归 1.简介:ls 的作用是列举当前目录下所有的目录和文件. ...

  7. CSPS模拟 89

  8. CSPS模拟 64

    觉悟试炼场 暴力没打满有点遗憾 T2莫队没想到有点遗憾 T1 Trade 反悔贪心? 赛时猜了个解法,结果过样例过对拍就交了. 贪心依据:如果目前买入a有机会在b卖出赚钱,则a在任何最优方案中都被购买 ...

  9. LeetCode刷题总结-数组篇(番外)

    本期共7道题,三道简单题,四道中等题. 此部分题目是作者认为有价值去做的一些题,但是其考察的知识点不在前三篇总结系列里面. 例1解法:采用数组索引位置排序的思想. 例2解法:考察了组合数学的组合公式应 ...

  10. 通俗易懂了解Vue组件的通信方式

    1.前言 Vue框架倡导组件化开发,力求将一个大的项目拆分成若干个小的组件,就如同我们小时玩堆积木一样,一个大房子是由若干个小积木组成.组件化开发最大问题就是组件之间数据能够流通,即组件之间能够通信. ...