[Machine Learning with Python] How to get your data?
Using Pandas Library
The simplest way is to read data from .csv
files and store it as a data frame object:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
You can also read .xsl
files and directly select the rows and columns you are interested in by setting parameters skiprows
, usecols
. Also, you can indicate index column by parameter index_col
.
energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])
For .txt
files, you can also use read_csv
function by defining the separation symbol:
university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)
See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html
Using os Module
Read .csv
files:
import os
import csv
for file in os.listdir("objective_folder"):
with open('objective_folder/'+file, newline='') as csvfile:
rows = csv.reader(csvfile) # read csc file
for row in rows: # print each line in the file
print(row)
Read .xsl
files:
import os
import xlrd
for file in os.listdir("objective_folder/"):
data = xlrd.open_workbook('objective_folder/'+file)
table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
nrows = table.nrows #row number
for i in range(nrows):
if i == 0: # skip the first row if it defines variable names
continue
row_values = table.row_values(i) #read each row value
print(row_values)
Download from Website Automatically
We can also try to read data directly from url link. This time, the .csv
file is compressed as housing.tgz
. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.
- import os
- import tarfile
- from six.moves import urllib
- DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
- HOUSING_PATH = "datasets/housing"
- HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
- def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
- if not os.path.isdir(housing_path):
- os.makedirs(housing_path)
- tgz_path = os.path.join(housing_path, "housing.tgz")
- urllib.request.urlretrieve(housing_url, tgz_path)
- housing_tgz = tarfile.open(tgz_path)
- housing_tgz.extractall(path=housing_path)
- housing_tgz.close()
when you call fetch_housing_data()
, it creates a datasets/housing
directory in your workspace, downloads the housing.tgz
file, and extracts the housing.csv
from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:
- import pandas as pd
- def load_housing_data(housing_path=HOUSING_PATH):
- csv_path = os.path.join(housing_path, "housing.csv")
- return pd.read_csv(csv_path)
What’s more?
These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.
[Machine Learning with Python] How to get your data?的更多相关文章
- 【Machine Learning】Python开发工具:Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
- Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
- Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
- 《Learning scikit-learn Machine Learning in Python》chapter1
前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
- Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...
- [Machine Learning with Python] Familiar with Your Data
Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...
- [Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset
The Dataset was acquired from https://www.kaggle.com/c/titanic For data preprocessing, I firstly def ...
- [Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
- [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...
随机推荐
- 【全面】Linux基础知识和基本操作语句大全(一)
接触Linux已经有一段时间了,由于实际需要,三三两两地掌握了一些基本语法和实用语句,主要都是在日常开发中用得比较多的,条理不是特别清晰,请见谅!下面开始上硬货!! 基本操作: 关闭Linux系统的命 ...
- Python模块(二)(序列化)
1. namedtuple 命名元组->类似创建了一个类 from collections import namedtuple p = namedtuple("Point", ...
- Mysql数据库查询语法详解
数据库的完整查询语法 在平常的工作中经常需要与数据库打交道 , 虽然大多时间都是简单的查询抑或使用框架封装好的ORM的查询方法 , 但是还是要对数据库的完整查询语法做一个加深理解 数据库完整查询语法框 ...
- list_add_tail()双向链表实现分析
struct list_head { struct list_head *next, *prev; }; list_add_tail(&buf->vb.queue, &vid-& ...
- python基础学习笔记——运算符
计算机可以进行的运算有很多种,可不只加减乘除这么简单,运算按种类可分为算数运算.比较运算.逻辑运算.赋值运算.成员运算.身份运算.位运算,今天我们暂只学习算数运算.比较运算.逻辑运算.赋值运算 算数运 ...
- PYday16&17-设计模式\选课系统习题
1.设计模式:对程序做整体得规划设计,这样做是为了更好的实现功能,使代码的可扩展性更好有27种常见的设计模式.流行的设计模式参考书:GoF设计模式.大话设计模式设计模式是为了更好的实现模块间的解耦,便 ...
- luogu3369 【模板】普通平衡树(Treap/SBT) treap splay
treap做法,参考hzwer的博客 #include <iostream> #include <cstdlib> #include <cstdio> using ...
- Dijkstra算法_北京地铁换乘_android实现-附带源码.apk
Dijkstra算法_北京地铁换乘_android实现 android 2.2+ 源码下载 apk下载 直接上图片 如下: Dijkstra(迪杰斯特拉)算法是典型的最短路径路由算法,用于计 ...
- Python学习-day6 面向对象概念
开始学习面向对象,可以说之前的学习和编程思路都是面向过程的,从上到下,一步一步走完. 如果说一个简单的需求,用面向过程实现起来相对容易,但是如果在日常生产,面向对象就可以发挥出他的优势了. 程序的可扩 ...
- 在Notepad++里配置python环境
首先在语言里选择Python 然后点击运行,在弹出的对话框里输入: cmd /k cd /d "$(CURRENT_DIRECTORY)" & python " ...