Using Pandas Library

The simplest way is to read data from .csv files and store it as a data frame object:

import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.

energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])

For .txt files, you can also use read_csv function by defining the separation symbol:

university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)

See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html

Using os Module

Read .csv files:

import os
import csv
for file in os.listdir("objective_folder"):
with open('objective_folder/'+file, newline='') as csvfile:
rows = csv.reader(csvfile) # read csc file
for row in rows: # print each line in the file
print(row)

Read .xsl files:

import os
import xlrd
for file in os.listdir("objective_folder/"):
data = xlrd.open_workbook('objective_folder/'+file)
table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
nrows = table.nrows #row number
for i in range(nrows):
if i == 0: # skip the first row if it defines variable names
continue
row_values = table.row_values(i) #read each row value
print(row_values)

Download from Website Automatically

We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.

 import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()

when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:

import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)

What’s more?

These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.

[Machine Learning with Python] How to get your data?的更多相关文章

  1. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  2. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  3. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  4. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  5. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

  6. [Machine Learning with Python] Familiar with Your Data

    Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...

  7. [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

    The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...

  8. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  9. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

随机推荐

  1. 地理位置编码geohash学习笔记

    1.geohash及其性质 一种空间索引技术. (1)将二维的经纬度位置数据转换为一维的字符串(基本上hash族的算法都是这样): 其优点在于hash编码后的字符串,可以方便查找和索引,从而减少相似计 ...

  2. vue组件:canvas实现图片涂鸦功能

    方案背景 需求 需要对图片进行标注,导出图片. 需要标注N多图片最后同时保存. 需要根据多边形区域数据(区域.颜色.名称)标注. 对应方案 用canvas实现涂鸦.圆形.矩形的绘制,最终生成图片bas ...

  3. vscode 实时预览 编辑markdown 插件 Markdown Preview Enhanced

    说明地址: https://shd101wyy.github.io/markdown-preview-enhanced/#/zh-cn/?id=markdown-preview-enhanced

  4. CentOS 7.0 使用 yum 安装 MariaDB 及 简单配置

    1.安装MariaDB 安装命令 yum -y install MariaDB-server MariaDB-client 安装完成MariaDB,首先启动MariaDB 设置开机启动 接下来进行Ma ...

  5. CentOS7.2下Hadoop2.7.2的集群搭建

    1.基本环境: 操作系统: Centos 7.2.1511 三台虚机: 192.168.163.224  master 192.168.163.225  node1 192.168.163.226   ...

  6. Aizu - 1378 Secret of Chocolate Poles (DP)

    你有三种盘子,黑薄,白薄,黑厚. 薄的盘子占1,厚的盘子占k. 有一个高度为L的桶,盘子总高度不能超出桶的总高度(可以小于等于).相同颜色的盘子不能挨着放. 问桶内装盘子的方案数. 如 L = 5,k ...

  7. 二叉苹果树——树形Dp(由根到左右子树的转移)

    题意:给出一个二叉树,每条边上有一定的边权,并且剪掉一些树枝,求留下 Q 条树枝的最大边权和. ( 节点数 n ≤100,留下的枝条树 Q ≤ n ,所有边权和 ∑w[i] ≤30000 ) 细节:对 ...

  8. 64位程序调用32DLL解决方案

    最近做一个.NETCore项目,需要调用以前用VB6写的老程序,原本想重写,但由于其调用了大量32DLL,重写后还需要编译为32位才能运行,于是干脆把老代码整个封装为32DLL,然后准备在64位程序中 ...

  9. Netcore 基础之TagHelper知识

    饮水思源,来自:http://www.cnblogs.com/liontone 的BLOG中关于taghelper中的内容 概要 TagHelper是ASP.NET 5的一个新特性.也许在你还没有听说 ...

  10. UTV - URL Tag Validation

    What`s UTV 1.URL Tag Validation 2.Special format of URL for preventing unauthorized usage and access ...