http://kukuruku.co/hub/python/introduction-to-machine-learning-with-python-andscikit-learn

Hello, %username%!

My name is Alex. I deal with machine learning and web graphs analysis (mostly in theory). I also work on the development of Big Data products for one of the mobile operators in Russia. It’s the first time I write a post, so please, don’t judge me too harshly.

Nowadays, a lot of people want to develop efficient algorithms and take part in machine learning competitions. So they come to me and ask: “Where to start?”. Some time ago, I led the development of Big Data tools for the analysis of media and social networks in one of the institutions of the Government of the Russian Federation. I still have some documentation my team used, and I’d like to share it with you. It is assumed that the reader has a good knowledge of mathematics and machine learning (my team mostly consisted of MIPT (the Moscow Institute of Physics and Technology) and the School of Data Analysis graduates).

Actually, it has been the introduction to Data Science. This science has become quite popular recently. Competitions in machine learning are increasingly held (for example, Kaggle, TudedIT), and their budget is often quite considerable.

The most common tools for a Data Scientist today are R and Python. Each tool has its pros and cons, but Python wins recently in all respects (this is just imho, I use both R and Python though). This happened after there had appeared a very well documented Scikit-Learn library that contains a great number of machine learning algorithms.

Please note that we will focus on Machine Learning algorithms in the article. It is usually better to perform the primary data analysis by means of the Pandaspackage that is quite simple to deal with on your own. So, let’s focus on implementation. For definiteness, we assume that there is a feature-object matrix at the input, and it is stored in a *.csv file.

Data Loading

First of all, the data should be loaded into memory, so that we could work with it. The Scikit-Learn library uses NumPy arrays in its implementation, so we will use NumPy to load *.csv files. Let’s download one of the datasets from the UCI Machine Learning Repository.

import numpy as np
import urllib
# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

We will work with this dataset in all examples, namely, with the X feature-object matrix and values of the y target variable.

Data Normalization

All of us know well that the majority of gradient methods (on which almost all machine learning algorithms are based) are highly sensitive to data scaling. Therefore, before running an algorithm, we should perform either normalization, or the so-called standardization. Normalization involves replacing nominal features, so that each of them would be in the range from 0 to 1. As for standardization, it involves data pre-processing, after which each feature has an average 0 and 1 dispersion. The Scikit-Learn library provides ready-made functions for this:

from sklearn import preprocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)

Feature Selection

It’s no secret that the most important thing in solving a task is the ability to properly choose or even create features. It’s called Feature Selection and Feature Engineering. While Future Engineering is quite a creative process and relies more on intuition and expert knowledge, there are plenty of ready-made algorithms for Feature Selection. Tree algorithms allow to compute the informativeness of features.

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
# display the relative importance of each attribute
print(model.feature_importances_)

All other methods are based on the effective search of subsets of features in order to find the best subset, on which the developed model gives the best quality. One of these search algorithms is the Recursive Feature Elimination Algorithm that is also available in the Scikit-Learn library.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

Algorithm Development

As I have said, Scikit-Learn has implemented all the basic algorithms of machine learning. Let’s take a look at some of them.

Logistic Regression

Most often used for solving tasks of classification (binary), but multiclass classification (the so-called one-vs-all method) is also allowed. The advantage of this algorithm is that there’s the probability of belonging to a class for each object at the output.

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
Naive Bayes

Is also one of the most well-known machine learning algorithms, the main task of which is to restore the density of data distribution of the training sample. This method often provides good quality in multiclass classification problems.

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
k-Nearest Neighbours

The kNN (k-Nearest Neighbors) method is often used as part of a more complex classification algorithm. For instance, we can use its estimate as an object’s feature. Sometimes, a simple kNN provides great quality on well-chosen features. When parameters (metrics mostly) are set well, the algorithm often gives good quality in regression problems.

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
Decision Trees

Classification and Regression Trees (CART) are often used in problems, in which objects have category features and used for regression and classification problems. The trees are very well suited for multiclass classification.

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
Support Vector Machines

SVM (Support Vector Machines) is one of the most popular machine learning algorithms used mainly for the classification problem. As well as logistic regression, SVM allows multi-class classification with the help of the one-vs-all method.

from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

In addition to classification and regression algorithms, Scikit-Learn has a huge number of more complex algorithms, including clustering, and also implemented techniques to create compositions of algorithms, including Bagging and Boosting.

How to Optimize Algorithm Parameters

One of the most difficult stages in creating really efficient algorithms is choosing correct parameters. It’s usually easier with experience, but one way or another, we have to do the search. Fortunately, Scikit-Learn provides many implemented functions for this purpose.

As an example, let’s take a look at the selection of the regularization parameter, in which several values are searched in turn:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

Sometimes it is more efficient to randomly select a parameter from the given range, estimate the algorithm quality for this parameter and choose the best one.

import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

We have reviewed the entire process of working with the Scikit-Learn library, except for outputting results back to a file. Offering you to do this as an exercise, as Python’s (and Scikit-Learn library’s) advantage, in comparison to R, is its excellent documentation.

In the next articles, we will consider other problems in detail. In particular, we will touch on such an important thing as Feature Engineering.

I really hope that this material will help novice Data Scientists to get down to solving machine learning problems in practice as soon as possible.

In conclusion, I’d like to wish a success and patience to those who are just beginning to take part in machine learning competitions!

一步一步使用sklearn的更多相关文章

  1. 如何一步一步用DDD设计一个电商网站(九)—— 小心陷入值对象持久化的坑

    阅读目录 前言 场景1的思考 场景2的思考 避坑方式 实践 结语 一.前言 在上一篇中(如何一步一步用DDD设计一个电商网站(八)—— 会员价的集成),有一行注释的代码: public interfa ...

  2. 如何一步一步用DDD设计一个电商网站(八)—— 会员价的集成

    阅读目录 前言 建模 实现 结语 一.前言 前面几篇已经实现了一个基本的购买+售价计算的过程,这次再让售价丰满一些,增加一个会员价的概念.会员价在现在的主流电商中,是一个不大常见的模式,其带来的问题是 ...

  3. 如何一步一步用DDD设计一个电商网站(十)—— 一个完整的购物车

     阅读目录 前言 回顾 梳理 实现 结语 一.前言 之前的文章中已经涉及到了购买商品加入购物车,购物车内购物项的金额计算等功能.本篇准备把剩下的购物车的基本概念一次处理完. 二.回顾 在动手之前我对之 ...

  4. 如何一步一步用DDD设计一个电商网站(七)—— 实现售价上下文

    阅读目录 前言 明确业务细节 建模 实现 结语 一.前言 上一篇我们已经确立的购买上下文和销售上下文的交互方式,传送门在此:http://www.cnblogs.com/Zachary-Fan/p/D ...

  5. 如何一步一步用DDD设计一个电商网站(六)—— 给购物车加点料,集成售价上下文

    阅读目录 前言 如何在一个项目中实现多个上下文的业务 售价上下文与购买上下文的集成 结语 一.前言 前几篇已经实现了一个最简单的购买过程,这次开始往这个过程中增加一些东西.比如促销.会员价等,在我们的 ...

  6. 如何一步一步用DDD设计一个电商网站(五)—— 停下脚步,重新出发

    阅读目录 前言 单元测试 纠正错误,重新出发 结语 一.前言 实际编码已经写了2篇了,在这过程中非常感谢有听到观点不同的声音,借着这个契机,今天这篇就把大家提出的建议一个个的过一遍,重新整理,重新出发 ...

  7. 如何一步一步用DDD设计一个电商网站(四)—— 把商品卖给用户

    阅读目录 前言 怎么卖 领域服务的使用 回到现实 结语 一.前言 上篇中我们讲述了“把商品卖给用户”中的商品和用户的初步设计.现在把剩余的“卖”这个动作给做了.这里提醒一下,正常情况下,我们的每一步业 ...

  8. 如何一步一步用DDD设计一个电商网站(三)—— 初涉核心域

    一.前言 结合我们本次系列的第一篇博文中提到的上下文映射图(传送门:如何一步一步用DDD设计一个电商网站(一)—— 先理解核心概念),得知我们这个电商网站的核心域就是销售子域.因为电子商务是以信息网络 ...

  9. 一步一步使用ABP框架搭建正式项目系列教程

    研究ABP框架好多天了,第一次看到这个框架的名称到现在已经很久了,但由于当时内功有限,看不太懂,所以就只是大概记住了ABP这个名字.最近几天,看到了园友@阳光铭睿的系列ABP教程,又点燃了我内心要研究 ...

  10. 一步一步使用ABP框架搭建正式项目系列教程之本地化详解

    返回总目录<一步一步使用ABP框架搭建正式项目系列教程> 本篇目录 扯扯本地化 ABP中的本地化 小结 扯扯本地化 本节来说说本地化,也有叫国际化.全球化的,不管怎么个叫法,反正道理都是一 ...

随机推荐

  1. Axis2 webservice入门--写个简单的webservice

    上一篇介绍了webservice开发前的准备.下面开始写webservice.如果不了解axis2请看上一篇,如果是新手:建议一边看一边写代码,自己动手完成这个过程. 一.新建一个web项目 二.新建 ...

  2. linux命令行快捷键

    linux命令行编辑快捷键 先总结几个个人觉得最有用的 ctrl + ? 撤消前一次输入 ctrl + c 另起一行 ctrl + r 输入单词搜索历史命令 ctrl + u 删除光标前面所有字符相当 ...

  3. ASP.NET文章目录导航

    ASP.NET文章目录导航 ASP.NET-[读书笔记]-原创:ASP.Net状态管理读书笔记--思维导图 (2013-12-25 10:13) ASP.NET-[潜在危险]-从客户端中检测到有潜在危 ...

  4. POJ 3249 拓扑排序+DP

    貌似是道水题.TLE了几次.把所有的输入输出改成scanf 和 printf ,有吧队列改成了数组模拟.然后就AC 了.2333333.... Description: MR.DOG 在找工作的过程中 ...

  5. UTF-8

    UTF-8(8-bit Unicode Transformation Format)是一种针对Unicode的可变长度字符编码,又称万国码.由Ken Thompson于1992年创建.现在已经标准化为 ...

  6. P121 6.7 第一题和第二题

    package nothh; import java.util.Arrays; public class shuzu6_7 { public static void main(String[] arg ...

  7. bzoj 1911: [Apio2010]特别行动队

    #include<cstdio> #include<iostream> #define M 1000009 #define ll long long using namespa ...

  8. hdu 4628 Pieces

    http://acm.hdu.edu.cn/showproblem.php?pid=4628 状态压缩DP 时间复杂度应该是 16*(2^32) 但是运行时要远小于这个数 所以加一定剪枝就可以过 代码 ...

  9. Fence9

    题目大意: 求点(0,0),(n,m),(p,0)三点构成的三角形内部(不包括边界)整点的个数. 解题过程:1.直接枚举纵坐标,然后算出两条直线上纵坐标为y的点的横坐标,然后他们中间的点就是符合要求的 ...

  10. ASP.NET MVC 4使用Bundle的打包压缩JS/CSS

    打包(Bundling)及压缩(Minification)指的是将多个js文件或css文件打包成单一文件并压缩的做法,如此可减少浏览器需下载多个文件案才能完成网页显示的延迟感,同时通过移除JS/CSS ...