Using Pandas Library

The simplest way is to read data from .csv files and store it as a data frame object:

  1. import pandas as pd
  2. df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.

  1. energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])

For .txt files, you can also use read_csv function by defining the separation symbol:

  1. university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)

See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html

Using os Module

Read .csv files:

  1. import os
  2. import csv
  3. for file in os.listdir("objective_folder"):
  4. with open('objective_folder/'+file, newline='') as csvfile:
  5. rows = csv.reader(csvfile) # read csc file
  6. for row in rows: # print each line in the file
  7. print(row)

Read .xsl files:

  1. import os
  2. import xlrd
  3. for file in os.listdir("objective_folder/"):
  4. data = xlrd.open_workbook('objective_folder/'+file)
  5. table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
  6. nrows = table.nrows #row number
  7. for i in range(nrows):
  8. if i == 0: # skip the first row if it defines variable names
  9. continue
  10. row_values = table.row_values(i) #read each row value
  11. print(row_values)

Download from Website Automatically

We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.

  1. import os
  2. import tarfile
  3. from six.moves import urllib
  4. DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
  5. HOUSING_PATH = "datasets/housing"
  6. HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
  7. def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
  8. if not os.path.isdir(housing_path):
  9. os.makedirs(housing_path)
  10. tgz_path = os.path.join(housing_path, "housing.tgz")
  11. urllib.request.urlretrieve(housing_url, tgz_path)
  12. housing_tgz = tarfile.open(tgz_path)
  13. housing_tgz.extractall(path=housing_path)
  14. housing_tgz.close()

when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:

  1. import pandas as pd
  2. def load_housing_data(housing_path=HOUSING_PATH):
  3. csv_path = os.path.join(housing_path, "housing.csv")
  4. return pd.read_csv(csv_path)

What’s more?

These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.

[Machine Learning with Python] How to get your data?的更多相关文章

  1. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  2. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  3. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  4. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  5. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

  6. [Machine Learning with Python] Familiar with Your Data

    Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...

  7. [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

    The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...

  8. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  9. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

随机推荐

  1. 【全面】Linux基础知识和基本操作语句大全(一)

    接触Linux已经有一段时间了,由于实际需要,三三两两地掌握了一些基本语法和实用语句,主要都是在日常开发中用得比较多的,条理不是特别清晰,请见谅!下面开始上硬货!! 基本操作: 关闭Linux系统的命 ...

  2. Python模块(二)(序列化)

    1. namedtuple 命名元组->类似创建了一个类 from collections import namedtuple p = namedtuple("Point", ...

  3. Mysql数据库查询语法详解

    数据库的完整查询语法 在平常的工作中经常需要与数据库打交道 , 虽然大多时间都是简单的查询抑或使用框架封装好的ORM的查询方法 , 但是还是要对数据库的完整查询语法做一个加深理解 数据库完整查询语法框 ...

  4. list_add_tail()双向链表实现分析

    struct list_head { struct list_head *next, *prev; }; list_add_tail(&buf->vb.queue, &vid-& ...

  5. python基础学习笔记——运算符

    计算机可以进行的运算有很多种,可不只加减乘除这么简单,运算按种类可分为算数运算.比较运算.逻辑运算.赋值运算.成员运算.身份运算.位运算,今天我们暂只学习算数运算.比较运算.逻辑运算.赋值运算 算数运 ...

  6. PYday16&17-设计模式\选课系统习题

    1.设计模式:对程序做整体得规划设计,这样做是为了更好的实现功能,使代码的可扩展性更好有27种常见的设计模式.流行的设计模式参考书:GoF设计模式.大话设计模式设计模式是为了更好的实现模块间的解耦,便 ...

  7. luogu3369 【模板】普通平衡树(Treap/SBT) treap splay

    treap做法,参考hzwer的博客 #include <iostream> #include <cstdlib> #include <cstdio> using ...

  8. Dijkstra算法_北京地铁换乘_android实现-附带源码.apk

    Dijkstra算法_北京地铁换乘_android实现   android 2.2+ 源码下载    apk下载 直接上图片 如下: Dijkstra(迪杰斯特拉)算法是典型的最短路径路由算法,用于计 ...

  9. Python学习-day6 面向对象概念

    开始学习面向对象,可以说之前的学习和编程思路都是面向过程的,从上到下,一步一步走完. 如果说一个简单的需求,用面向过程实现起来相对容易,但是如果在日常生产,面向对象就可以发挥出他的优势了. 程序的可扩 ...

  10. 在Notepad++里配置python环境

    首先在语言里选择Python 然后点击运行,在弹出的对话框里输入: cmd /k cd /d "$(CURRENT_DIRECTORY)" &  python " ...