这是一篇翻译的博客,原文链接在这里。这是我看的为数不多的介绍scikit-learn简介而全面的文章,特别适合入门。我这里把这篇文章翻译一下,英语好的同学可以直接看原文。

大部分喜欢用Python来学习数据科学的人,应该听过scikit-learn,这个开源的Python库帮我们实现了一系列有关机器学习,数据处理,交叉验证和可视化的算法。其提供的接口非常好用。

这就是为什么DataCamp(原网站)要为那些已经开始学习Python库却没有一个简明且方便的总结的人提供这个总结。(原文是cheat sheet,翻译过来就是小抄,我这里翻译成总结,感觉意思上更积极点)。或者你压根都不知道scikit-learn如何使用,那这份总结将会帮助你快速的了解其相关的基本知识,让你快速上手。

你会发现,当你处理机器学习问题时,scikit-learn简直就是神器。

这份scikit-learn总结将会介绍一些基本步骤让你快速实现机器学习算法,主要包括:读取数据,数据预处理,如何创建模型来拟合数据,如何验证你的模型以及如何调参让模型变得更好。

总的来说,这份总结将会通过示例代码让你开始你的数据科学项目,你能立刻创建模型,验证模型,调试模型。(原文提供了pdf版的下载,内容和原文差不多)

A Basic Example

>>> from sklearn import neighbors, datasets, preprocessing
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.metrics import accuracy_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data[:, :2], iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train = scaler.transform(X_train)
>>> X_test = scaler.transform(X_test)
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred)

(补充,这里看不懂不要紧,其实就是个小例子,后面会详细解答)

Loading The Data

你的数据需要是numeric类型,然后存储成numpy数组或者scipy稀疏矩阵。我们也接受其他能转换成numeric数组的类型,比如Pandas的DataFrame。

>>> import numpy as np
>>> X = np.random.random((10,5))
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0

Preprocessing The Data

Standardization

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().fit(X_train)
>>> standardized_X = scaler.transform(X_train)
>>> standardized_X_test = scaler.transform(X_test)

Normalization

>>> from sklearn.preprocessing import Normalizer
>>> scaler = Normalizer().fit(X_train)
>>> normalized_X = scaler.transform(X_train)
>>> normalized_X_test = scaler.transform(X_test)

Binarization

>>> from sklearn.preprocessing import Binarizer
>>> binarizer = Binarizer(threshold=0.0).fit(X)
>>> binary_X = binarizer.transform(X)

Encoding Categorical Features

>>> from sklearn.preprocessing import LabelEncoder
>>> enc = LabelEncoder()
>>> y = enc.fit_transform(y)

Imputing Missing Values

>>>from sklearn.preprocessing import Imputer
>>>imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>>imp.fit_transform(X_train)

Generating Polynomial Features

>>> from sklearn.preprocessing import PolynomialFeatures)
>>> poly = PolynomialFeatures(5))
>>> oly.fit_transform(X))

Training And Test Data

>>> from sklearn.cross_validation import train_test_split)
>>> X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0))

Create Your Model

Supervised Learning Estimators

Linear Regression

>>> from sklearn.linear_model import LinearRegression)
>>> lr = LinearRegression(normalize=True))

Support Vector Machines (SVM)

>>> from sklearn.svm import SVC)
>>> svc = SVC(kernel='linear'))

Naive Bayes

>>> from sklearn.naive_bayes import GaussianNB)
>>> gnb = GaussianNB())

KNN

>>> from sklearn import neighbors)
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5))

Unsupervised Learning Estimators

Principal Component Analysis (PCA)

>>> from sklearn.decomposition import PCA)
>>> pca = PCA(n_components=0.95))

K Means

>>> from sklearn.cluster import KMeans)
>>> k_means = KMeans(n_clusters=3, random_state=0))

Model Fitting

Supervised learning

>>> lr.fit(X, y))
>>> knn.fit(X_train, y_train))
>>> svc.fit(X_train, y_train))

Unsupervised Learning

>>> k_means.fit(X_train))
>>> pca_model = pca.fit_transform(X_train))

Prediction

Supervised Estimators

>>> y_pred = svc.predict(np.random.random((2,5))))
>>> y_pred = lr.predict(X_test))
>>> y_pred = knn.predict_proba(X_test))

Unsupervised Estimators

>>> y_pred = k_means.predict(X_test))

Evaluate Your Model's Performance

Classification Metrics

Accuracy Score

>>> knn.score(X_test, y_test))
>>> from sklearn.metrics import accuracy_score)
>>> accuracy_score(y_test, y_pred))

Classification Report

>>> from sklearn.metrics import classification_report)
>>> print(classification_report(y_test, y_pred)))

Confusion Matrix

>>> from sklearn.metrics import confusion_matrix)
>>> print(confusion_matrix(y_test, y_pred)))

Regression Metrics

Mean Absolute Error

>>> from sklearn.metrics import mean_absolute_error)
>>> y_true = [3, -0.5, 2])
>>> mean_absolute_error(y_true, y_pred))

Mean Squared Error

>>> from sklearn.metrics import mean_squared_error)
>>> mean_squared_error(y_test, y_pred))

R2 Score

>>> from sklearn.metrics import r2_score)
>>> r2_score(y_true, y_pred))

Clustering Metrics

Adjusted Rand Index

>>> from sklearn.metrics import adjusted_rand_score)
>>> adjusted_rand_score(y_true, y_pred))

Homogeneity

>>> from sklearn.metrics import homogeneity_score)
>>> homogeneity_score(y_true, y_pred))

V-measure

>>> from sklearn.metrics import v_measure_score)
>>> metrics.v_measure_score(y_true, y_pred))

Cross-Validation

>>> print(cross_val_score(knn, X_train, y_train, cv=4))
>>> print(cross_val_score(lr, X, y, cv=2))

Tune Your Model

Grid Search

>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,param_grid=params)
>>> grid.fit(X_train, y_train)
>>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)

Randomized Parameter Optimization

>>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
>>> rsearch = RandomizedSearchCV(estimator=knn,
param_distributions=params,
cv=4,
n_iter=8,
random_state=5)
>>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)

Going Further

学习完上面的例子后,你可以通过our scikit-learn tutorial for beginners来学习更多的例子。另外你可以学习matplotlib来可视化数据。

不要错过后续教程 Bokeh cheat sheet, the Pandas cheat sheet or the Python cheat sheet for data science.

Python Machine Learning: Scikit-Learn Tutorial的更多相关文章

  1. Python机器学习 (Python Machine Learning 中文版 PDF)

    Python机器学习介绍(Python Machine Learning 中文版) 机器学习,如今最令人振奋的计算机领域之一.看看那些大公司,Google.Facebook.Apple.Amazon早 ...

  2. [Python & Machine Learning] 学习笔记之scikit-learn机器学习库

    1. scikit-learn介绍 scikit-learn是Python的一个开源机器学习模块,它建立在NumPy,SciPy和matplotlib模块之上.值得一提的是,scikit-learn最 ...

  3. Python -- machine learning, neural network -- PyBrain 机器学习 神经网络

    I am using pybrain on my Linuxmint 13 x86_64 PC. As what it is described: PyBrain is a modular Machi ...

  4. Python机器学习介绍(Python Machine Learning 中文版)

    Python机器学习 机器学习,如今最令人振奋的计算机领域之一.看看那些大公司,Google.Facebook.Apple.Amazon早已展开了一场关于机器学习的军备竞赛.从手机上的语音助手.垃圾邮 ...

  5. 《Python Machine Learning》索引

    目录部分: 第一章:赋予计算机从数据中学习的能力 第二章:训练简单的机器学习算法——分类 第三章:使用sklearn训练机器学习分类器 第四章:建立好的训练集——数据预处理 第五章:通过降维压缩数据 ...

  6. How do I learn machine learning?

    https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644   How Can I Learn X? ...

  7. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  8. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  9. 机器学习算法之旅A Tour of Machine Learning Algorithms

    In this post we take a tour of the most popular machine learning algorithms. It is useful to tour th ...

随机推荐

  1. December 04th 2016 Week 50th Sunday

    Learn wisdom by the follies of others. 前车之鉴,后人之师. Maybe my personal state is that others can learn w ...

  2. if 里面嵌套一个if&else (我自己又细分了别的条件,加了elif)

    场景: 一个陌生人敲门..... gender = input("你是男的是女的?") if gender == "女": print("请进&quo ...

  3. 项目属性的target platform和target platform version到底是什么(vs2015开发windows驱动小记)

    根据官方对属性页的介绍(General Property Page (Project))可了解: target platform是build后的结果会跑在哪个平台,例如windows,android, ...

  4. css画图那些事

    上一篇css3写了一些基本的图形,想到是不是能用css3画个动物,便在网上找图片.于是选中一只大鹏鸟 也不难,一步步的写出身体部位,再定位上去就好了.来一张效果图,后面给两个加了动画,稍微难看一点,后 ...

  5. Linux下七牛云存储qrsync命令行上传同步工具

    原址:https://m.aliyun.com/yunqi/ziliao/54370 VPS数据备份是一个重要的工作,之前在文章:使用七牛云存储自动备份VPS数据分享过使用七牛云存储提供的工具QRSB ...

  6. BZOJ2435:[NOI2011]道路修建 (差分)

    Description 在 W 星球上有 n 个国家.为了各自国家的经济发展,他们决定在各个国家 之间建设双向道路使得国家之间连通.但是每个国家的国王都很吝啬,他们只愿 意修建恰好 n – 1条双向道 ...

  7. Lambda表达式和For循环使用需要注意的一个地方

    一个需要注意的地方看下面的代码: using System; using System.Collections.Generic; using System.Linq; namespace MyCsSt ...

  8. ansible-playbook快速入门

    一.yaml语法: 1. yaml语法编写 1.1 同层级的字段通过相同缩进表示 1.2 map结构里面key/value用‘:’来分隔 1.3 key/value可以同行写,也可以换行写,换行写必须 ...

  9. Redis基本数据类型命令汇总

    前言   前阶段写Redis客户端作为学习和了解Redis Protocol,基本上把Strintg,List,Hash,Set,SortedSet五种基础类型的命令都写完了,本篇进行总结,也相当于复 ...

  10. Python:numpy.newaxis

    x1[:,np.newaxis]:增维,转置 从字面上是插入新的维度的意思 demo1: 针对一维的情况 >>> b = np.array([1, 2, 3, 4, 5, 6]) & ...