【scikit-learn】交叉验证及其用于參数选择、模型选择、特征选择的样例

内容概要¶

训练集/測试集切割用于模型验证的缺点
K折交叉验证是怎样克服之前的不足
交叉验证怎样用于选择调节參数、选择模型、选择特征
改善交叉验证

1. 模型验证回想¶

进行模型验证的一个重要目的是要选出一个最合适的模型，对于监督学习而言，我们希望模型对于未知数据的泛化能力强，所以就须要模型验证这一过程来体现不同的模型对于未知数据的表现效果。

最先我们用训练精确度（用所有数据进行训练和測试）来衡量模型的表现，这样的方法会导致模型过拟合；为了解决这一问题，我们将所有数据分成训练集和測试集两部分，我们用训练集进行模型训练。得到的模型再用測试集来衡量模型的预測表现能力，这样的度量方式叫測试精确度，这样的方式能够有效避免过拟合。

測试精确度的一个缺点是其样本精确度是一个高方差预计（high variance estimate），所以该样本精确度会依赖不同的測试集。其表现效果不尽同样。

高方差预计的样例¶

以下我们使用iris数据来说明利用測试精确度来衡量模型表现的方差非常高。

In [1]:

from sklearn.datasets import load_iris

from sklearn.cross_validation import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

In [2]:

# read in the iris data

iris = load_iris()

X = iris.data

y = iris.target

In [3]:

for i in xrange(1,5):

    print "random_state is ", i,", and accuracy score is:"

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=i)

    knn = KNeighborsClassifier(n_neighbors=5)

    knn.fit(X_train, y_train)

    y_pred = knn.predict(X_test)

    print metrics.accuracy_score(y_test, y_pred)

random_state is  1 , and accuracy score is:

1.0

random_state is  2 , and accuracy score is:

1.0

random_state is  3 , and accuracy score is:

0.947368421053

random_state is  4 , and accuracy score is:

0.973684210526

以上測试准确率能够看出，不同的训练集、測试集切割的方法导致其准确率不同。而交叉验证的基本思想是：将数据集进行一系列切割。生成一组不同的训练測试集，然后分别训练模型并计算測试准确率，最后对结果进行平均处理。这样来有效减少測试准确率的差异。

2. K折交叉验证¶

将数据集平均切割成K个等份
使用1份数据作为測试数据，其余作为训练数据
计算測试准确率
使用不同的測试集。反复2、3步骤
对測试准确率做平均。作为对未知数据预測准确率的预计

In [4]:

# 以下代码演示了K-fold交叉验证是怎样进行数据切割的

# simulate splitting a dataset of 25 observations into 5 folds

from sklearn.cross_validation import KFold

kf = KFold(25, n_folds=5, shuffle=False)

# print the contents of each training and testing set

print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations')

for iteration, data in enumerate(kf, start=1):

    print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])

Iteration                   Training set observations                   Testing set observations

    1     [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [0 1 2 3 4]

    2     [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [5 6 7 8 9]

    3     [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24]     [10 11 12 13 14]

    4     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24]     [15 16 17 18 19]

    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]     [20 21 22 23 24]

3. 使用交叉验证的建议¶

K=10是一个一般的建议
假设对于分类问题。应该使用分层抽样（stratified sampling）来生成数据。保证正负例的比例在训练集和測试集中的比例同样

4. 交叉验证的样例¶

4.1 用于调节參数¶

交叉验证的方法能够帮助我们进行调參。终于得到一组最佳的模型參数。以下的样例我们依旧使用iris数据和KNN模型，通过调节參数，得到一组最佳的參数使得測试数据的准确率和泛化能力最佳。

In [6]:

from sklearn.cross_validation import cross_val_score

In [7]:

knn = KNeighborsClassifier(n_neighbors=5)

# 这里的cross_val_score将交叉验证的整个过程连接起来，不用再进行手动的切割数据

# cv參数用于规定将原始数据分成多少份

scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')

print scores

[ 1.          0.93333333  1.          1.          0.86666667  0.93333333

  0.93333333  1.          1.          1.        ]

In [8]:

# use average accuracy as an estimate of out-of-sample accuracy

# 对十次迭代计算平均的測试准确率

print scores.mean()

0.966666666667

In [11]:

# search for an optimal value of K for KNN model

k_range = range(1,31)

k_scores = []

for k in k_range:

    knn = KNeighborsClassifier(n_neighbors=k)

    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')

    k_scores.append(scores.mean())

print k_scores

[0.95999999999999996, 0.95333333333333337, 0.96666666666666656, 0.96666666666666656, 0.96666666666666679, 0.96666666666666679, 0.96666666666666679, 0.96666666666666679, 0.97333333333333338, 0.96666666666666679, 0.96666666666666679, 0.97333333333333338, 0.98000000000000009, 0.97333333333333338, 0.97333333333333338, 0.97333333333333338, 0.97333333333333338, 0.98000000000000009, 0.97333333333333338, 0.98000000000000009, 0.96666666666666656, 0.96666666666666656, 0.97333333333333338, 0.95999999999999996, 0.96666666666666656, 0.95999999999999996, 0.96666666666666656, 0.95333333333333337, 0.95333333333333337, 0.95333333333333337]

In [10]:

import matplotlib.pyplot as plt

%matplotlib inline

In [12]:

plt.plot(k_range, k_scores)

plt.xlabel("Value of K for KNN")

plt.ylabel("Cross validated accuracy")

Out[12]:

<matplotlib.text.Text at 0x6dd0fb0>

上面的样例显示了偏置-方差的折中，K较小的情况时偏置较低。方差较高。K较高的情况时。偏置较高，方差较低；最佳的模型參数取在中间位置，该情况下，使得偏置和方差得以平衡，模型针对于非样本数据的泛化能力是最佳的。

4.2 用于模型选择¶

交叉验证也能够帮助我们进行模型选择，下面是一组样例，分别使用iris数据，KNN和logistic回归模型进行模型的比較和选择。

In [13]:

# 10-fold cross-validation with the best KNN model

knn = KNeighborsClassifier(n_neighbors=20)

print cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean()

0.98

In [14]:

# 10-fold cross-validation with logistic regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()

0.953333333333

4.3 用于特征选择¶

以下我们使用advertising数据，通过交叉验证来进行特征的选择，对照不同的特征组合对于模型的预測效果。

In [15]:

import pandas as pd

import numpy as np

from sklearn.linear_model import LinearRegression

In [16]:

# read in the advertising dataset

data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

In [17]:

# create a Python list of three feature names

feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the DataFrame (X)

X = data[feature_cols]

# select the Sales column as the response (y)

y = data.Sales

In [18]:

# 10-fold cv with all features

lm = LinearRegression()

scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')

print scores

[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754  -1.74163618

 -8.17338214 -2.11409746 -3.04273109 -2.45281793]

这里要注意的是，上面的scores都是负数，为什么均方误差会出现负数的情况呢？由于这里的mean_squared_error是一种损失函数，优化的目标的使其最小化。而分类准确率是一种奖励函数，优化的目标是使其最大化。

In [19]:

# fix the sign of MSE scores

mse_scores = -scores

print mse_scores

[ 3.56038438  3.29767522  2.08943356  2.82474283  1.3027754   1.74163618

  8.17338214  2.11409746  3.04273109  2.45281793]

In [20]:

# convert from MSE to RMSE

rmse_scores = np.sqrt(mse_scores)

print rmse_scores

[ 1.88689808  1.81595022  1.44548731  1.68069713  1.14139187  1.31971064

  2.85891276  1.45399362  1.7443426   1.56614748]

In [21]:

# calculate the average RMSE

print rmse_scores.mean()

1.69135317081

In [22]:

# 10-fold cross-validation with two features (excluding Newspaper)

feature_cols = ['TV', 'Radio']

X = data[feature_cols]

print np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean()

1.67967484191

因为不增加Newspaper这一个特征得到的分数较小（1.68 < 1.69）。所以，使用全部特征得到的模型是一个更好的模型。

參考资料¶

scikit-learn documentation: Cross-validation, Model evaluation
scikit-learn issue on GitHub: MSE is negative when returned by cross_val_score
Scott Fortmann-Roe: Accurately Measuring Model Prediction Error
Harvard CS109: Cross-Validation: The Right and Wrong Way
Journal of Cheminformatics: Cross-validation pitfalls when selecting and assessing regression and classification models