[Machine Learning with Python] How to get your data?
Using Pandas Library
The simplest way is to read data from .csv
files and store it as a data frame object:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
You can also read .xsl
files and directly select the rows and columns you are interested in by setting parameters skiprows
, usecols
. Also, you can indicate index column by parameter index_col
.
energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])
For .txt
files, you can also use read_csv
function by defining the separation symbol:
university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)
See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html
Using os Module
Read .csv
files:
import os
import csv
for file in os.listdir("objective_folder"):
with open('objective_folder/'+file, newline='') as csvfile:
rows = csv.reader(csvfile) # read csc file
for row in rows: # print each line in the file
print(row)
Read .xsl
files:
import os
import xlrd
for file in os.listdir("objective_folder/"):
data = xlrd.open_workbook('objective_folder/'+file)
table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
nrows = table.nrows #row number
for i in range(nrows):
if i == 0: # skip the first row if it defines variable names
continue
row_values = table.row_values(i) #read each row value
print(row_values)
Download from Website Automatically
We can also try to read data directly from url link. This time, the .csv
file is compressed as housing.tgz
. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.
import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
when you call fetch_housing_data()
, it creates a datasets/housing
directory in your workspace, downloads the housing.tgz
file, and extracts the housing.csv
from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
What’s more?
These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.
[Machine Learning with Python] How to get your data?的更多相关文章
- 【Machine Learning】Python开发工具:Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
- Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
- Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
- 《Learning scikit-learn Machine Learning in Python》chapter1
前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
- Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...
- [Machine Learning with Python] Familiar with Your Data
Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...
- [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset
The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...
- [Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
- [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...
随机推荐
- 第7课 Thinkphp 5 模板输出变量使用函数 Thinkphp5商城第四季
目录 1. 手册地址: 2. 如果前面输出的变量在后面定义的函数的第一个参数,则可以直接使用 3. 还可以支持多个函数过滤,多个函数之间用"|"分割即可,例如: 4. 变量输出使用 ...
- Python基础——判断和循环
判断 缩进代替大括号. 冒号(:)后换号缩进. if test=100 if test>50: print('OK') print('test') if-elif-else test=50 if ...
- 如何编写自己的C语言头文件
一些初学C语言的人,不知道头文件(*.h文件)原来还可以自己写的.只知道调用系统库函数时,要使用#include语句将某些头文件包含进去.其实,头文件跟.C文件一样,是可以自己写的.头文件是一种文本文 ...
- Manjaro 添加国内源和安装搜狗输入法
Manjaro 系统虽然比 Ubuntu 用着稳定,但有些小地方没有 Ubuntu 人性化,比如默认安装完的系统貌似没有中国的,Ubuntu 估计是用的人多,所以安装完后会根据所在地给你配置更新的源. ...
- bin、hex、elf、axf文件的区别
1.bin Bin文件是最纯粹的二进制机器代码, 或者说是"顺序格式".按照assembly code顺序翻译成binary machine code,内部没有地址标记.Bin是直 ...
- HDU - 1496 Equations (hash)
题意: 多组测试数据. 每组数据有一个方程 a*x1^2 + b*x2^2 + c*x3^2 + d*x4^2 = 0,方程中四个未知数 x1, x2, x3, x4 ∈ [-100, 100], 且 ...
- hiho 1050 树的直径
#1050 : 树中的最长路 时间限制:10000ms 单点时限:1000ms 内存限制:256MB 描述 上回说到,小Ho得到了一棵二叉树玩具,这个玩具是由小球和木棍连接起来的,而在拆拼它的过程中, ...
- Linux学习-函式库管理
动态与静态函式库 首先我们要知道的是,函式库的类型有哪些?依据函式库被使用的类型而分为两大类,分别是静态 (Static) 与动态 (Dynamic) 函式库两类. 静态函式库的特色: 扩展名:(扩展 ...
- SPOJ QTREE4 - Query on a tree IV 树分治
题意: 给出一棵边带权的树,初始树上所有节点都是白色. 有两种操作: C x,改变节点x的颜色,即白变黑,黑变白 A,询问树中最远的两个白色节点的距离,这两个白色节点可以重合(此时距离为0). 分析: ...
- [git 学习篇] 创建公钥
http://riny.net/2014/git-ssh-key/ 1 安装 windows gitbash msysgit是Windows版的Git,从https://git-for-wind ...