Using Pandas Library

The simplest way is to read data from .csv files and store it as a data frame object:

import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.

energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])

For .txt files, you can also use read_csv function by defining the separation symbol:

university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)

See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html

Using os Module

Read .csv files:

import os
import csv
for file in os.listdir("objective_folder"):
with open('objective_folder/'+file, newline='') as csvfile:
rows = csv.reader(csvfile) # read csc file
for row in rows: # print each line in the file
print(row)

Read .xsl files:

import os
import xlrd
for file in os.listdir("objective_folder/"):
data = xlrd.open_workbook('objective_folder/'+file)
table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
nrows = table.nrows #row number
for i in range(nrows):
if i == 0: # skip the first row if it defines variable names
continue
row_values = table.row_values(i) #read each row value
print(row_values)

Download from Website Automatically

We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.

 import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()

when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:

import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)

What’s more?

These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.

[Machine Learning with Python] How to get your data?的更多相关文章

  1. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  2. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  3. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  4. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  5. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

  6. [Machine Learning with Python] Familiar with Your Data

    Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...

  7. [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

    The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...

  8. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  9. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

随机推荐

  1. 第7课 Thinkphp 5 模板输出变量使用函数 Thinkphp5商城第四季

    目录 1. 手册地址: 2. 如果前面输出的变量在后面定义的函数的第一个参数,则可以直接使用 3. 还可以支持多个函数过滤,多个函数之间用"|"分割即可,例如: 4. 变量输出使用 ...

  2. Python基础——判断和循环

    判断 缩进代替大括号. 冒号(:)后换号缩进. if test=100 if test>50: print('OK') print('test') if-elif-else test=50 if ...

  3. 如何编写自己的C语言头文件

    一些初学C语言的人,不知道头文件(*.h文件)原来还可以自己写的.只知道调用系统库函数时,要使用#include语句将某些头文件包含进去.其实,头文件跟.C文件一样,是可以自己写的.头文件是一种文本文 ...

  4. Manjaro 添加国内源和安装搜狗输入法

    Manjaro 系统虽然比 Ubuntu 用着稳定,但有些小地方没有 Ubuntu 人性化,比如默认安装完的系统貌似没有中国的,Ubuntu 估计是用的人多,所以安装完后会根据所在地给你配置更新的源. ...

  5. bin、hex、elf、axf文件的区别

    1.bin Bin文件是最纯粹的二进制机器代码, 或者说是"顺序格式".按照assembly code顺序翻译成binary machine code,内部没有地址标记.Bin是直 ...

  6. HDU - 1496 Equations (hash)

    题意: 多组测试数据. 每组数据有一个方程 a*x1^2 + b*x2^2 + c*x3^2 + d*x4^2 = 0,方程中四个未知数 x1, x2, x3, x4 ∈ [-100, 100], 且 ...

  7. hiho 1050 树的直径

    #1050 : 树中的最长路 时间限制:10000ms 单点时限:1000ms 内存限制:256MB 描述 上回说到,小Ho得到了一棵二叉树玩具,这个玩具是由小球和木棍连接起来的,而在拆拼它的过程中, ...

  8. Linux学习-函式库管理

    动态与静态函式库 首先我们要知道的是,函式库的类型有哪些?依据函式库被使用的类型而分为两大类,分别是静态 (Static) 与动态 (Dynamic) 函式库两类. 静态函式库的特色: 扩展名:(扩展 ...

  9. SPOJ QTREE4 - Query on a tree IV 树分治

    题意: 给出一棵边带权的树,初始树上所有节点都是白色. 有两种操作: C x,改变节点x的颜色,即白变黑,黑变白 A,询问树中最远的两个白色节点的距离,这两个白色节点可以重合(此时距离为0). 分析: ...

  10. [git 学习篇] 创建公钥

    http://riny.net/2014/git-ssh-key/ 1 安装 windows gitbash    msysgit是Windows版的Git,从https://git-for-wind ...