In this article, we dicuss some main steps in data preparation.

Drop Labels

Firstly, we drop labels for train set. Here we use drop() method in Pandas library.

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

Here are some tips:

  • The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
  • The drop function doesn't change the DataFrame by default.  And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True.
  • Note the function copy() here. It creates a copy that will not affect the original DataFrame

Impute Missing Values

Firstly, let's check the missing values:

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

Here give three methods to impute missing values:

Option 1: drop the rows

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

Option 2: drop the columns

sample_incomplete_rows.drop("total_bedrooms", axis=1) 

Option 3: impute with the median value

median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

Alternatively, we can import sklearn.impute.SimpleImputer class in Scikit-Learn 0.20.

 try:
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
from sklearn.preprocessing import Imputer as SimpleImputer imputer = SimpleImputer(strategy="median")
# Remove the text attribute because median can only be calculated on numerical attributes
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)

We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy

Finally, transform the train set:

 X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))

Encode Categorical Attributes

We need to convert text labels to numbers. There are two methods.

Option 1: Label Encoding

Conver a categorical attribute into an interger attribute.

 try:
from sklearn.preprocessing import OrdinalEncoder
except ImportError:
from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20 ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

Option2: One-Hot Encoding

Convert a categorical attribute into a series of binary intergers.

 try:
from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
from sklearn.preprocessing import OneHotEncoder
except ImportError:
from future_encoders import OneHotEncoder # Scikit-Learn < 0.20 cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()method:

housing_cat_1hot.toarray()

Alternatively, you can set sparse=False when creating the OneHotEncoder:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Feature Engineering

Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.

 from sklearn.base import BaseEstimator, TransformerMixin

 # column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn的更多相关文章

  1. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  2. [Machine Learning with Python] Data Visualization by Matplotlib Library

    Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...

  3. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  4. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  5. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  6. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  7. In machine learning, is more data always better than better algorithms?

    In machine learning, is more data always better than better algorithms? No. There are times when mor ...

  8. Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)

    Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...

  9. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

随机推荐

  1. 51nod_1445 变色DNA 最短路模板 奇妙思维

    这是一道最短路模板题,但是在理解题意和提出模型的阶段比较考验思维,很容易想到并且深深进入暴力拆解题目的无底洞当中. 题意是说:给出一个邻接矩阵,在每个点时,走且仅走向,合法路径中编号最小的点.问题是是 ...

  2. VBA连接MySQL数据库以及ODBC的配置(ODBC版本和MySQL版本如果不匹配会出现驱动和应用程序的错误)

    db_connected = False '获取数据库连接设置dsn_name = Trim(Worksheets("加载策略").Cells(2, 5).Value)  ---- ...

  3. 几条 ffmpeg 的命令

    1,获取视频的信息   ffmpeg -i video.avi 2,将图片序列合成视频   ffmpeg -f image2 -i image%d.jpg video.mpg   上面的命令会把当前目 ...

  4. DEDE调用指定文章ID来调用特定文档

    http://www.jb51.net/cms/137423.html 代码如下: {dede:arclist row=1 idlist='6'} <li><a href=" ...

  5. 第一次接触php

    一.什么是PHP PHP的中文意思:超文本预处理器,英文名字: HyperText Preprocessor. PHP通常有两层含义: (1)PHP是一个编程语言. (2)PHP是处理PHP编程语言的 ...

  6. 对于xss等有关的html,url,unicode编码做的一个小总结。

    参考:http://bobao.360.cn/learning/detail/292.html,算是对前部分作一个总结性的学习. 1<a href="%6a%61%76%61%73%6 ...

  7. linux/mac下的配置自定义命令alias

    linux/mac下的自定义命令alias,并保存别名使其永久生效(重启不会失效) 在做开发每次提交代码的命令都是一长串参数,不想去记,于是可以使用alias命令来解决这个问题:alias aComm ...

  8. HDU5726 GCD

    Give you a sequence of N(N≤100,000)N(N≤100,000) integers : a1,...,an(0<ai≤1000,000,000)a1,...,an( ...

  9. c++ 中的slipt实现

    来自 http://www.cnblogs.com/dfcao/p/cpp-FAQ-split.html http://blog.diveinedu.com/%E4%B8%89%E7%A7%8D%E5 ...

  10. csapp 深入理解计算机系统 csapp.h csapp.c文件配置

    转载自   http://condor.depaul.edu/glancast/374class/docs/csapp_compile_guide.html Compiling with the CS ...