机器学习之数据预处理，Pandas读取excel数据

Python读写excel的工具库很多，比如最耳熟能详的xlrd、xlwt，xlutils，openpyxl等。其中xlrd和xlwt库通常配合使用，一个用于读，一个用于写excel。xlutils结合xlrd可以达到修改excel文件目的。openpyxl可以对excel文件同时进行读写操作。

而说到数据预处理，pandas就体现除了它的强大之处，并且它还支持可读写多种文档格式，其中就包括对excel的读写。本文重点就是介绍pandas对excel数据集的预处理。

机器学习常用的模型对数据输入都是有要求的，多数机器学习算法最基本的要求是训练数据要转换成数值格式。当然，也有像决策树算法这种不需要转换为数值的算法，这里不做特例讨论。

pandas读取excel文件的函数是pandas.read_excel()，主要参数包括：

io : 读取的excel文档地址，

string, path object (pathlib.Path or py._path.local.LocalPath),

file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx

sheet_name : 读取的excel指定的sheet页

string, int, mixed list of strings/ints, or None, default 0

Strings are used for sheet names, Integers are used in zero-indexed sheet positions.

Lists of strings/integers are used to request multiple sheets.

Specify None to get all sheets.

str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.

Available Cases

Defaults to 0 -> 1st sheet as a DataFrame

1 -> 2nd sheet as a DataFrame

“Sheet1” -> 1st sheet as a DataFrame

[0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames

None -> All sheets as a dictionary of DataFrames

header : 设置读取的excel第一行是否作为列名称

int, list of ints, default 0

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

names :设置每列的名称，数组形式参数

　　　array-like, default None

List of column names to use. If file contains no header row, then you should explicitly pass header=None

index_col :设置读取的excel第一列是否作为行名称

　　　int, list of ints, default None

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

usecols :执行需要读取的数据列，通常载入的excel包含不需要的列

　　　　int or list, default None

If None then parse all columns,

If int then indicates last column to be parsed

If list of ints then indicates list of column numbers to be parsed

If string then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

下满是一些pandas读取excel数据的示例：

将数据集写入excel文件：

>>> df_out = pd.DataFrame([('string1', 1),

...                        ('string2', 2),

...                        ('string3', 3)],

...                       columns=['Name', 'Value'])

>>> df_out

      Name  Value

0  string1      1

1  string2      2

2  string3      3

>>> df_out.to_excel('tmp.xlsx')

读取excel文件：

>>> pd.read_excel('tmp.xlsx')

      Name  Value

0  string1      1

1  string2      2

2  string3      3

参数index_col and header 都设置为None表示不读取excel的第一行和第一列作为标题和默认索引：

>>> pd.read_excel('tmp.xlsx', index_col=None, header=None)

     0        1      2

0  NaN     Name  Value

1  0.0  string1      1

2  1.0  string2      2

3  2.0  string3      3

甚至可以专门制定列的格式：

>>> pd.read_excel('tmp.xlsx', dtype={'Name':str, 'Value':float})

      Name  Value

0  string1    1.0

1  string2    2.0

2  string3    3.0

下面是综合示例：读取text.xlsx文件的sheet1页，仅载入D:F列的数据。这里F列是类别标签，需要类别1和类别2转换为数字，应用于机器学习的输入建模。

import pandas as pd

def reader(path,sheet):

    return pd.read_excel(path, sheet_name=sheet, usecols='D:F')

trainrd = reader('text.xlsx','sheet1')

trainrd.head(5)  #查看前5行数据

trainrd['x']=0  #新建一列x

trainrd.loc[trainrd['类别']=='类别1','x']=0 #将类别列的文字转换为数字

trainrd.loc[trainrd['类别']=='类别2','x']=1

机器学习之数据预处理，Pandas读取excel数据的更多相关文章

pandas玩转excel-> (2)如何利用pandas读取excel数据文件
import pandas as pd #将excel文件读到内存中,形成dataframe,并命名为peoplepeople=pd.read_excel('D:/python结果/task2/Peo ...
Python的工具包[1] -> pandas数据预处理 -> pandas 库及使用总结
pandas数据预处理 / pandas data pre-processing 目录关于 pandas pandas 库 pandas 基本操作 pandas 计算 pandas 的 Series ...
Python利用pandas处理Excel数据的应用
Python利用pandas处理Excel数据的应用最近迷上了高效处理数据的pandas,其实这个是用来做数据分析的,如果你是做大数据分析和测试的,那么这个是非常的有用的!!但是其实我们平时在做 ...
【python基础】利用pandas处理Excel数据
参考:https://www.cnblogs.com/liulinghua90/p/9935642.html 一.安装第三方库xlrd和pandas 1:pandas依赖处理Excel的xlrd模块, ...
[Pandas]利用Pandas处理excel数据
Python 处理excel的第三包有很多,比如XlsxWriter.xlrd&xlwt.OpenPyXL.Microsoft Excel API等,最后综合考虑选用了Pandas. Pand ...
【Python自动化Excel】pandas处理Excel数据的基本流程
这里所说的pandas并不是大熊猫,而是Python的第三方库.这个库能干嘛呢?它在Python数据分析领域可是无人不知.无人不晓的.可以说是Python世界中的Excel. pandas库处理数据相 ...
java的poi技术读取Excel数据到MySQL
这篇blog是介绍java中的poi技术读取Excel数据,然后保存到MySQL数据中. 你也可以在 : java的poi技术读取和导入Excel了解到写入Excel的方法信息使用JXL技术可以在 ...
.NET读取Excel数据，提示错误：未在本地计算机上注册“Microsoft.ACE.OLEDB.12.0”提供程序
解决.NET读取Excel数据时,提示错误:未在本地计算机上注册“Microsoft.ACE.OLEDB.12.0”提供程序的操作: 1. 检查本机是否安装Office Access,如果未安装去去h ...
oledbdataadapter 读取excel数据时，有的单元格内容不能读出
表现:excel中某列中,有的单元格左上角有绿色箭头标志,有的没有,c#编写读取程序,但是只能读取出带绿色箭头的单元格中的内容,其余不带的读取不到内容原因:excel中单元格因为是文本格式而存储了数 ...

随机推荐

【最大流】POJ3236-ACM Computer Factory
[题意] 装配一个电脑需要P个零件,现在给出N机器的信息,每个机器可以将k个电脑由状态{S1,S2..,Sp}转变为{Q1,Q2..,Qp},问最多能装配多少台电脑以及对应的方案? [思路] 1A.. ...
【动态规划/二维背包问题】mr355-三角形牧场
应该也是USACO的题目?同样没有找到具体出处. [题目大意] 和所有人一样,奶牛喜欢变化.它们正在设想新造型牧场.奶牛建筑师Hei想建造围有漂亮白色栅栏的三角形牧场.她拥有N(3≤N≤40)块木板, ...
用xib自定义UITableViewCell的注意事项——重用
问题的提出: 有时候我们经常需要自定义tableView的cell,当cell里面的布局较为复杂时往往舍弃纯代码的方式而改用xib的方式进行自定义.当我们用纯代码的方式布局cell时,往往会在cell ...
将ip对应城市数据导入redis并查询
1.GeoLite免费数据库先去地址http://dev.maxmind.com/zh-hans/geoip/legacy/geolite/#i-5下载GeoLiteCity-latest .zip ...
[转]解析UML建模语言中的UML图分类、 UML各种图形及作用
本文向大家介绍一下UML图分类,作为一种建模语言,UML的定义包括UML语义和UML表示法两个部分. UML图大致可分为五类,共有九种图形. AD: 本文和大家重点讨论一下UML图分类,标准建模语言U ...
[转]Java中子类调用父类构造方法的问题分析
在Java中,子类的构造过程中,必须调用其父类的构造函数,是因为有继承关系存在时,子类要把父类的内容继承下来,通过什么手段做到的? 答案如下: 当你new一个子类对象的时候,必须首先要new一个 ...
#Java Web累积#关于MUI的上滑和下拉加载
其实按照MUI的文档去写,也没什么问题: JSP中: <%@ page contentType="text/html;charset=UTF-8" language=&quo ...
ElasticSearch安装为Windows服务
目前我都是在windows的环境下操作是Elasticsearch,并且喜欢使用命令行启动时通过cmd直接在elasticsearch的bin目录下执行elasticsearch 这样直接启动的话集 ...
Smart config风险分析与对策
Smart config风险分析与对策 1.简介: Smart config是一种将未联网设备快速连接wifi的技术,大概原理如下图所示: 2.业务需求: 要求实现 ...
Oracle Form删除list项
Oracle From中的list项点击后就会新增一个下拉列,此时可以使用 Ctrl+shift+< 进行删除

机器学习之数据预处理，Pandas读取excel数据

机器学习之数据预处理，Pandas读取excel数据的更多相关文章

随机推荐

热门专题