Using Pandas Library

The simplest way is to read data from .csv files and store it as a data frame object:

  1. import pandas as pd
  2. df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.

  1. energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])

For .txt files, you can also use read_csv function by defining the separation symbol:

  1. university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)

See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html

Using os Module

Read .csv files:

  1. import os
  2. import csv
  3. for file in os.listdir("objective_folder"):
  4. with open('objective_folder/'+file, newline='') as csvfile:
  5. rows = csv.reader(csvfile) # read csc file
  6. for row in rows: # print each line in the file
  7. print(row)

Read .xsl files:

  1. import os
  2. import xlrd
  3. for file in os.listdir("objective_folder/"):
  4. data = xlrd.open_workbook('objective_folder/'+file)
  5. table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
  6. nrows = table.nrows #row number
  7. for i in range(nrows):
  8. if i == 0: # skip the first row if it defines variable names
  9. continue
  10. row_values = table.row_values(i) #read each row value
  11. print(row_values)

Download from Website Automatically

We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.

  1. import os
  2. import tarfile
  3. from six.moves import urllib
  4. DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
  5. HOUSING_PATH = "datasets/housing"
  6. HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
  7. def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
  8. if not os.path.isdir(housing_path):
  9. os.makedirs(housing_path)
  10. tgz_path = os.path.join(housing_path, "housing.tgz")
  11. urllib.request.urlretrieve(housing_url, tgz_path)
  12. housing_tgz = tarfile.open(tgz_path)
  13. housing_tgz.extractall(path=housing_path)
  14. housing_tgz.close()

when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:

  1. import pandas as pd
  2. def load_housing_data(housing_path=HOUSING_PATH):
  3. csv_path = os.path.join(housing_path, "housing.csv")
  4. return pd.read_csv(csv_path)

What’s more?

These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.

[Machine Learning with Python] How to get your data?的更多相关文章

  1. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  2. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  3. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  4. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  5. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

  6. [Machine Learning with Python] Familiar with Your Data

    Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...

  7. [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

    The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...

  8. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  9. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

随机推荐

  1. destoon 多表联合查询时出现解析错误,parse_str函数解析错误

    数据库前缀  wb_ 标签 ,调用文章时获取评论数量 <!--{php $tags=tag("table=article_24 a left join wb_comment_stat ...

  2. 图解Disruptor框架(一):初识Ringbuffer

    图解Disruptor框架(一):初识Ringbuffer 概述 1. 什么是Disruptor?为什么是Disruptor? Disruptor是一个性能十分强悍的无锁高并发框架.在JUC并发包中, ...

  3. selection problem-divide and conquer

    思路: 随机选取列表中的一个值v,然后将列表分为小于v的,等于v的,大于v的三组.对于k<=left.size()时, 在left中执行selection:落在中间的,返回v:k>left ...

  4. LeetCode(220) Contains Duplicate III

    题目 Given an array of integers, find out whether there are two distinct indices i and j in the array ...

  5. POJ:1094-Sorting It All Out(拓扑排序经典题型)

    Sorting It All Out Time Limit: 1000MS Memory Limit: 10000K Description An ascending sorted sequence ...

  6. NO_PUBKEY

    * 现象:$ sudo apt-get update时警告如下: W: GPG error: http://ppa.launchpad.net precise Release: The followi ...

  7. bash的位置变量和特殊变量

    bash编程的知识点:位置变量和特殊变量 位置参数变量:         scirpt1.sh arg1 arg2 ...         $0         $1   $2   ...  ${10 ...

  8. LA 7072 Signal Interference 计算几何 圆与多边形的交

    题意: 给出一个\(n\)个点的简单多边形,和两个点\(A, B\)还有一个常数\(k(0.2 \leq k < 0.8)\). 点\(P\)满足\(\left | PB \right | \l ...

  9. Apache不能启动: Unable to open logs

    日志名称:          Application来源:            Apache Service日期:            2014/3/12 14:43:21事件 ID:       ...

  10. webdriver高级应用- 在ajax方式产生的浮动框中,单击选择包含某个关键字的选项

    Ajax简介: Ajax:局部刷新,原理上也是一个js,js调用服务器的远程接口刷新局部页面数据. Ajax = 异步 JavaScript 和 XML(标准通用标记语言的子集). Ajax 是一种用 ...