http://www.dataguru.cn/portal.php?mod=view&aid=3514

摘要 : 最近断断续续地在接触一些python的东西。按照我的习惯，首先从应用层面搞起，尽快入门，后续再细化一些技术细节。找了一些资料，基本语法和数据结构搞定之后，目光便转到了scikit-learn这个包。

最近断断续续地在接触一些python的东西。按照我的习惯，首先从应用层面搞起，尽快入门，后续再细化一些技术细节。找了一些资料，基本语法和数据结构搞定之后，目光便转到了scikit-learn这个包。这个包是基于scipy的统计学习包。里面所涵盖的算法接口非常全面。更令人振奋的是，其用户手册写得非常好。然而，其被墙了（或者没有，或者有时被墙有时又好了）。笔者不会翻墙（请嘲笑吧），笔者只有找代理，笔者要忍受各种弹窗。因此笔者决定，做一个记录，把这用户手册的内容多多少少搬一点到我的博客中来。以备查询方便。因此，笔者就动手了。

声明：如何安装python及其IDE，相关模块等等不在本系列博客的范围。本博客仅试图记录可能会用到的代码实例。

1.广义线性模型

这里的“广义线性模型”，是指线性模型及其简单的推广，包括岭回归，lasso，LAR，logistic回归，感知器等等。下面将介绍这些模型的基本想法，以及如何用python实现。

1.1.普通的最小二乘

由 LinearRegression 函数实现。最小二乘法的缺点是依赖于自变量的相关性，当出现复共线性时，设计阵会接近奇异，因此由最小二乘方法得到的结果就非常敏感，如果随机误差出现什么波动，最小二乘估计也可能出现较大的变化。而当数据是由非设计的试验获得的时候，复共线性出现的可能性非常大。

from sklearn import linear_model

clf = linear_model.LinearRegression()

clf.fit ([[0,0],[1,1],[2,2]],[0,1,2]) #拟合

clf.coef_ #获取拟合参数

上面代码给出了实现线性回归拟合以及获得拟合参数的两个主要函数。下面的脚本则给出了一个较为完整的实例。

脚本：

print __doc__

import pylab as pl

import numpy as np

from sklearn import datasets, linear_model

diabetes = datasets.load_diabetes() #载入数据

diabetes_x = diabetes.data[:, np.newaxis]

diabetes_x_temp = diabetes_x[:, :, 2]

diabetes_x_train = diabetes_x_temp[:-20] #训练样本

diabetes_x_test = diabetes_x_temp[-20:] #检测样本

diabetes_y_train = diabetes.target[:-20]

diabetes_y_test = diabetes.target[-20:]

regr = linear_model.LinearRegression()

regr.fit(diabetes_x_train, diabetes_y_train)

print 'Coefficients :\n', regr.coef_

print ("Residual sum of square: %.2f" %np.mean((regr.predict(diabetes_x_test) - diabetes_y_test) ** 2))

print ("variance score: %.2f" % regr.score(diabetes_x_test, diabetes_y_test))

pl.scatter(diabetes_x_test,diabetes_y_test, color = 'black')

pl.plot(diabetes_x_test, regr.predict(diabetes_x_test),color='blue',linewidth = 3)

pl.xticks(())

pl.yticks(())

pl.show()

1.2.岭回归

岭回归是一种正则化方法，通过在损失函数中加入L2范数惩罚项，来控制线性模型的复杂程度，从而使得模型更稳健。岭回归的函数如下：

from sklearn import linear_model

clf = linear_model.Ridge (alpha = .5)

clf.fit([[0,0],[0,0],[1,1]],[0,.1,1])

clf.coef_

下面的脚本提供了绘制岭估计系数与惩罚系数之间关系的功能

print __doc__

import numpy as np

import pylab as pl

from sklearn import linear_model

# X is the 10x10 Hilbert matrix

X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis])

y = np.ones(10)

# Compute paths

n_alphas = 200

alphas = np.logspace(-10, -2, n_alphas)

clf = linear_model.Ridge(fit_intercept=False) #创建一个岭回归对象

coefs = []#循环，对每一个alpha，做一次拟合

for a in alphas:

clf.set_params(alpha=a)

clf.fit(X, y)

coefs.append(clf.coef_)#系数保存在coefs中，append

# Display results

ax = pl.gca()

ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])

ax.plot(alphas, coefs)

ax.set_xscale('log') #注意这一步，alpha是对数化了的

ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis

pl.xlabel('alpha')

pl.ylabel('weights')

pl.title('Ridge coefficients as a function of the regularization')

pl.axis('tight')

pl.show()

使用GCV来设定正则化系数的代码如下：

clf = linear_model.RidgeCV(alpha = [0.1, 1.0, 10.0])

clf.fit([[0,0],[0,0],[1,1]],[0,.1,1])

clf.alpha_

1.3. Lasso

lasso和岭估计的区别在于它的惩罚项是基于L1范数的。因此，它可以将系数控制收缩到0，从而达到变量选择的效果。它是一种非常流行的变量选择方法。Lasso估计的算法主要有两种，其一是用于以下介绍的函数Lasso的coordinate descent。另外一种则是下面会介绍到的最小角回归（笔者学生阶段读过的最令人佩服的文章之一便是Efron的这篇LARS，完全醍醐灌顶，建议所有人都去读一读）。

clf = linear_model.Lasso(alpha = 0.1)

clf.fit([[0,0],[1,1]],[0,1])

clf.predict([[1,1]])

下面给出一个脚本，比较了Lasso和Elastic Net在处理稀疏信号中的应用

print __doc__

import numpy as np

import pylab as pl

from sklearn.metrics import r2_score

# generate some sparse data to play with

np.random.seed(42)

n_samples, n_features = 50, 200

X = np.random.randn(n_samples, n_features)

coef = 3 * np.random.randn(n_features)

inds = np.arange(n_features)

np.random.shuffle(inds)#打乱观测顺序

coef[inds[10:]] = 0 # sparsify coef

y = np.dot(X, coef)

# add noise

y += 0.01 * np.random.normal((n_samples,))

# Split data in train set and test set

n_samples = X.shape[0]

X_train, y_train = X[:n_samples / 2], y[:n_samples / 2]

X_test, y_test = X[n_samples / 2:], y[n_samples / 2:]

# Lasso

from sklearn.linear_model import Lasso

alpha = 0.1

lasso = Lasso(alpha=alpha)#Lasso对象

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)#拟合并预测

r2_score_lasso = r2_score(y_test, y_pred_lasso)

print lasso

print "r^2 on test data : %f" % r2_score_lasso

# ElasticNet

from sklearn.linear_model import ElasticNet

enet = ElasticNet(alpha=alpha, l1_ratio=0.7)

y_pred_enet = enet.fit(X_train, y_train).predict(X_test)

r2_score_enet = r2_score(y_test, y_pred_enet)

print enet

print "r^2 on test data : %f" % r2_score_enet

pl.plot(enet.coef_, label='Elastic net coefficients')

pl.plot(lasso.coef_, label='Lasso coefficients')

pl.plot(coef, '--', label='original coefficients')

pl.legend(loc='best')

pl.title("Lasso R^2: %f, Elastic Net R^2: %f"

% (r2_score_lasso, r2_score_enet))

pl.show()

1.3.1.如何设置正则化系数

1.3.1.1. 使用交叉验证

有两个函数实现了交叉验证， LassoCV ，另一个是 LassoLarsCV 。对于高维数据而言，选择 LassoCV 往往是最适合的方法，然而 LassoLarsCV 速度要快一些。

1.3.1.2. 使用信息准则

AIC,BIC。这些准则计算起来比cross validation方法消耗低。然而使用这些准则的前提是我们对模型的自由度有一个恰当的估计，并且假设我们的概率模型是正确的。事实上我们也经常遇到这种问题，我们还是更希望能直接从数据中算出些什么，而不是首先建立概率模型的假设。

下面的这个脚本比较了几种设定正则化参数的方法：AIC,BIC以及cross-validation

import time

import numpy as np

import pylab as pl

from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC

from sklearn import datasets

diabetes = datasets.load_diabetes()

X = diabetes.data

y = diabetes.target

rng = np.random.RandomState(42)

X = np.c_[X, rng.randn(X.shape[0], 14)] # add some bad features

# normalize data as done by Lars to allow for comparison

X /= np.sqrt(np.sum(X ** 2, axis=0))

# LassoLarsIC: least angle regression with BIC/AIC criterion

model_bic = LassoLarsIC(criterion='bic')#BIC准则

t1 = time.time()

model_bic.fit(X, y)

t_bic = time.time() - t1

alpha_bic_ = model_bic.alpha_

model_aic = LassoLarsIC(criterion='aic')#AIC准则

model_aic.fit(X, y)

alpha_aic_ = model_aic.alpha_

def plot_ic_criterion(model, name, color):

    alpha_ = model.alpha_

    alphas_ = model.alphas_

    criterion_ = model.criterion_

    pl.plot(-np.log10(alphas_), criterion_, '--', color=color,

        linewidth=3, label='%s criterion' % name)

    pl.axvline(-np.log10(alpha_), color=color, linewidth=3,

        label='alpha: %s estimate' % name)

    pl.xlabel('-log(alpha)')

    pl.ylabel('criterion')

pl.figure()

plot_ic_criterion(model_aic, 'AIC', 'b')

plot_ic_criterion(model_bic, 'BIC', 'r')

pl.legend()

pl.title('Information-criterion for model selection (training time %.3fs)'

% t_bic)

# LassoCV: coordinate descent

# Compute paths

print "Computing regularization path using the coordinate descent lasso..."

t1 = time.time()

model = LassoCV(cv=20).fit(X, y)#创建对像，并拟合

t_lasso_cv = time.time() - t1

# Display results

m_log_alphas = -np.log10(model.alphas_)

pl.figure()

ymin, ymax = 2300, 3800

pl.plot(m_log_alphas, model.mse_path_, ':')

pl.plot(m_log_alphas, model.mse_path_.mean(axis=-1), 'k',

label='Average across the folds', linewidth=2)

pl.axvline(-np.log10(model.alpha_), linestyle='--', color='k',

label='alpha: CV estimate')

pl.legend()

pl.xlabel('-log(alpha)')

pl.ylabel('Mean square error')

pl.title('Mean square error on each fold: coordinate descent '

'(train time: %.2fs)' % t_lasso_cv)

pl.axis('tight')

pl.ylim(ymin, ymax)

# LassoLarsCV: least angle regression

# Compute paths

print "Computing regularization path using the Lars lasso..."

t1 = time.time()

model = LassoLarsCV(cv=20).fit(X, y)

t_lasso_lars_cv = time.time() - t1

# Display results

m_log_alphas = -np.log10(model.cv_alphas_)

pl.figure()

pl.plot(m_log_alphas, model.cv_mse_path_, ':')

pl.plot(m_log_alphas, model.cv_mse_path_.mean(axis=-1), 'k',

label='Average across the folds', linewidth=2)

pl.axvline(-np.log10(model.alpha_), linestyle='--', color='k',

label='alpha CV')

pl.legend()

pl.xlabel('-log(alpha)')

pl.ylabel('Mean square error')

pl.title('Mean square error on each fold: Lars (train time: %.2fs)'

% t_lasso_lars_cv)

pl.axis('tight')

pl.ylim(ymin, ymax)

pl.show()

1.4. Elastic Net

ElasticNet是对Lasso和岭回归的融合，其惩罚项是L1范数和L2范数的一个权衡。下面的脚本比较了Lasso和Elastic Net的回归路径，并做出了其图形。

print __doc__

# Author: Alexandre Gramfort 

# License: BSD Style.

import numpy as np

import pylab as pl

from sklearn.linear_model import lasso_path, enet_path

from sklearn import datasets

diabetes = datasets.load_diabetes()

X = diabetes.data

y = diabetes.target

X /= X.std(0)  # Standardize data (easier to set the l1_ratio parameter)

# Compute paths

eps = 5e-3  # the smaller it is the longer is the path

print "Computing regularization path using the lasso..."

models = lasso_path(X, y, eps=eps)

alphas_lasso = np.array([model.alpha for model in models])

coefs_lasso = np.array([model.coef_ for model in models])

print "Computing regularization path using the positive lasso..."

models = lasso_path(X, y, eps=eps, positive=True)#lasso path

alphas_positive_lasso = np.array([model.alpha for model in models])

coefs_positive_lasso = np.array([model.coef_ for model in models])

print "Computing regularization path using the elastic net..."

models = enet_path(X, y, eps=eps, l1_ratio=0.8)

alphas_enet = np.array([model.alpha for model in models])

coefs_enet = np.array([model.coef_ for model in models])

print "Computing regularization path using the positve elastic net..."

models = enet_path(X, y, eps=eps, l1_ratio=0.8, positive=True)

alphas_positive_enet = np.array([model.alpha for model in models])

coefs_positive_enet = np.array([model.coef_ for model in models])

# Display results

pl.figure(1)

ax = pl.gca()

ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])

l1 = pl.plot(coefs_lasso)

l2 = pl.plot(coefs_enet, linestyle='--')

pl.xlabel('-Log(lambda)')

pl.ylabel('weights')

pl.title('Lasso and Elastic-Net Paths')

pl.legend((l1[-1], l2[-1]), ('Lasso', 'Elastic-Net'), loc='lower left')

pl.axis('tight')

pl.figure(2)

ax = pl.gca()

ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])

l1 = pl.plot(coefs_lasso)

l2 = pl.plot(coefs_positive_lasso, linestyle='--')

pl.xlabel('-Log(lambda)')

pl.ylabel('weights')

pl.title('Lasso and positive Lasso')

pl.legend((l1[-1], l2[-1]), ('Lasso', 'positive Lasso'), loc='lower left')

pl.axis('tight')

pl.figure(3)

ax = pl.gca()

ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])

l1 = pl.plot(coefs_enet)

l2 = pl.plot(coefs_positive_enet, linestyle='--')

pl.xlabel('-Log(lambda)')

pl.ylabel('weights')

pl.title('Elastic-Net and positive Elastic-Net')

pl.legend((l1[-1], l2[-1]), ('Elastic-Net', 'positive Elastic-Net'),

          loc='lower left')

pl.axis('tight')

pl.show()

1.5. 多任务Lasso

多任务Lasso其实就是多元Lasso。Lasso在多元回归中的推广的tricky在于如何设置惩罚项。这里略。

1.6. 最小角回归

LassoLars 给出了LARS算法求解Lasso的接口。下面的脚本给出了如何用LARS画出Lasso的path。

print __doc__

# Author: Fabian Pedregosa 

#         Alexandre Gramfort 

# License: BSD Style.

import numpy as np

import pylab as pl

from sklearn import linear_model

from sklearn import datasets

diabetes = datasets.load_diabetes()

X = diabetes.data

y = diabetes.target

print "Computing regularization path using the LARS ..."

alphas, _, coefs = linear_model.lars_path(X, y, method='lasso', verbose=True)#lars算法的求解路径

xx = np.sum(np.abs(coefs.T), axis=1)

xx /= xx[-1]

pl.plot(xx, coefs.T)

ymin, ymax = pl.ylim()

pl.vlines(xx, ymin, ymax, linestyle='dashed')

pl.xlabel('|coef| / max|coef|')

pl.ylabel('Coefficients')

pl.title('LASSO Path')

pl.axis('tight')

pl.show()

logistic 回归

Logistic回归是一个线性分类器。类 LogisticRegression 实现了该分类器，并且实现了L1范数，L2范数惩罚项的logistic回归。下面的脚本是一个例子，将8*8像素的数字图像分成了两类，其中0-4分为一类，5-9分为一类。比较了L1，L2范数惩罚项，在不同的C值的情况。
脚本如下:

print __doc__

# Authors: Alexandre Gramfort

#          Mathieu Blondel

#          Andreas Mueller

# License: BSD Style.

import numpy as np

import pylab as pl

from sklearn.linear_model import LogisticRegression

from sklearn import datasets

from sklearn.preprocessing import StandardScaler

digits = datasets.load_digits()

X, y = digits.data, digits.target

X = StandardScaler().fit_transform(X)

# classify small against large digits

y = (y > 4).astype(np.int)

# Set regularization parameter

for i, C in enumerate(10. ** np.arange(1, 4)):

    # turn down tolerance for short training time

    clf_l1_LR = LogisticRegression(C=C, penalty='l1', tol=0.01)

    clf_l2_LR = LogisticRegression(C=C, penalty='l2', tol=0.01)

    clf_l1_LR.fit(X, y)

    clf_l2_LR.fit(X, y)

    coef_l1_LR = clf_l1_LR.coef_.ravel()

    coef_l2_LR = clf_l2_LR.coef_.ravel()

    # coef_l1_LR contains zeros due to the

    # L1 sparsity inducing norm

    sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100

    sparsity_l2_LR = np.mean(coef_l2_LR == 0) * 100

    print "C=%d" % C

    print "Sparsity with L1 penalty: %.2f%%" % sparsity_l1_LR

    print "score with L1 penalty: %.4f" % clf_l1_LR.score(X, y)

    print "Sparsity with L2 penalty: %.2f%%" % sparsity_l2_LR

    print "score with L2 penalty: %.4f" % clf_l2_LR.score(X, y)

    l1_plot = pl.subplot(3, 2, 2 * i + 1)

    l2_plot = pl.subplot(3, 2, 2 * (i + 1))

    if i == 0:

        l1_plot.set_title("L1 penalty")

        l2_plot.set_title("L2 penalty")

    l1_plot.imshow(np.abs(coef_l1_LR.reshape(8, 8)), interpolation='nearest',

                   cmap='binary', vmax=1, vmin=0)

    l2_plot.imshow(np.abs(coef_l2_LR.reshape(8, 8)), interpolation='nearest',

                   cmap='binary', vmax=1, vmin=0)

    pl.text(-8, 3, "C = %d" % C)

    l1_plot.set_xticks(())

    l1_plot.set_yticks(())

    l2_plot.set_xticks(())

    l2_plot.set_yticks(())

pl.show()

其他

官方用户指南中还包含了感知器，Passive Aggressive算法等。本文略去。

Python机器学习——线性模型的更多相关文章

Python机器学习基础教程-第2章-监督学习之线性模型
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习笔记：sklearn库的学习
网上有很多关于sklearn的学习教程,大部分都是简单的讲清楚某一方面,其实最好的教程就是官方文档. 官方文档地址:https://scikit-learn.org/stable/ (可是官方文档非常 ...
Python机器学习笔记：不得不了解的机器学习面试知识点（1）
机器学习岗位的面试中通常会对一些常见的机器学习算法和思想进行提问,在平时的学习过程中可能对算法的理论,注意点,区别会有一定的认识,但是这些知识可能不系统,在回答的时候未必能在短时间内答出自己的认识,因 ...
Python机器学习--回归
线性回归 # -*- coding: utf-8 -*- """ Created on Wed Aug 30 19:55:37 2017 @author: Adminis ...
Python机器学习基础教程-第2章-监督学习之决策树集成
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习基础教程-第2章-监督学习之决策树
前言本系列教程基本就是摘抄<Python机器学习基础教程>中的例子内容. 为了便于跟踪和学习,本系列教程在Github上提供了jupyter notebook 版本: Github仓库: ...
Python机器学习基础教程
介绍本系列教程基本就是搬运<Python机器学习基础教程>里面的实例. Github仓库使用 jupyternote book 是一个很好的快速构建代码的选择,本系列教程都能在我的Gi ...
Python机器学习笔记集成学习总结
集成学习(Ensemble learning)是使用一系列学习器进行学习,并使用某种规则把各个学习结果进行整合,从而获得比单个学习器显著优越的泛化性能.它不是一种单独的机器学习算法啊,而更像是一种优 ...
python机器学习的常用算法
Python机器学习学习意味着通过学习或经验获得知识或技能.基于此,我们可以定义机器学习(ML)如下 - 它可以被定义为计算机科学领域,更具体地说是人工智能的应用,其为计算机系统提供了学习数据和从经 ...

随机推荐

spring boot学习总结（一）-- 基础入门 Hello,spring boot!
写在最前 SpringBoot是伴随着Spring4.0诞生的: 从字面理解,Boot是引导的意思,因此SpringBoot帮助开发者快速搭建Spring框架: SpringBoot帮助开发者快速启动 ...
j.u.c系列（03）---之AQS：AQS简介
写在前面 Java的内置锁一直都是备受争议的,在JDK 1.6之前,synchronized这个重量级锁其性能一直都是较为低下,虽然在1.6后,进行大量的锁优化策略,但是与Lock相比synchron ...
.Net 环境下C# 通过托管C++调用本地C++ Dll文件
综述 : 本文章介绍.Net 环境下C# 通过托管C++调用本地C++ Dll文件, 示例环境为:VS2010, .Net4.0, Win7. 具体事例为测试C++, C#, 及C#调用本地C++D ...
IOS-UITableView入门(2)
1.对于TableView .每一个item的视图基本都是一样的. 不同的仅仅有数据. IOS提供了一种缓存视图跟数据的方法.在 -UITableViewCell *) tableView:cellF ...
改进架构，实现动态数据源，减少java维护
怎样不用写java代码来完毕开发? 对于大部分的产品和项目来说.页面变化是很头痛的事情.每次小功能上线,新客户到来,都须要进行定制改造,不断的开发维护.每次开发一方面要修改页面,一方面要修改serve ...
Revit API射线法读取空间中相交的元素
Revit API提供根据射线来寻找经过的元素.方法是固定模式,没什么好说.关键代码:doc.FindReferencesWithContextByDirection(ptStart, (ptEnd ...
如何在Windows服务程序中添加U盘插拔的消息
研究了下这个问题,主要要在一般的windows服务程序中修改两个地方: 一.调用RegisterServiceCtrlHandlerEx VOID WINAPI SvcMain( DWORD dwAr ...
Git：基础要点
直接快照,而非比较差异. 近乎所有操作都可本地执行. 在Git 中的绝大多数操作都只需要访问本地文件和资源,不用连网.但如果用CVCS 的话,差不多所有操作都需要连接网络.因为Git 在本地磁盘上就保 ...
升级IOS8游戏上传自定义头像功能失效的问题
为了支持arm64,之前已经折腾了很久,昨晚打包准备提交苹果审核时,测试那边的同事反馈说游戏上传自定义头像功能不可用了. 游戏上传自定义功能的简介:卡牌游戏最初是<比武招亲>中有一个充VI ...
sql语句 update 字段=字段+字符串
update aa set name=concat('x',name) SELECT OWNER,phone ,COUNT(fc_hc) as c from tb_p GROUP BY fc_hc H ...

Python机器学习——线性模型