Ridge Regression and Ridge Regression Kernel
Ridge Regression and Ridge Regression Kernel
Reference:
1. scikit-learn linear_model ridge regression
2. Machine learning for quantum mechanics in a nutshell Authors
3. sample plot ridge path code from #Fabian Pedregosa --
Ridge regression
Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares:
\[\underset{w}{min} {\left\| Xw-y\right\|_2^{2}} + \lambda\left\|{w}\right\|_2^2\]
Here, \(\lambda \ge 0\) is a complexity parameter that controls the amount of shrinkage: the larger the value of \(\lambda\), the greater the amount of shrinkage and thus the coefficients become robust to collinearity. The figure below show the relationship between the \(\lambda\) and oscillations of the weights.
Ridge Regression Theory
the \(\tilde{x}\) meaning the test dataset, and \(x_i\) means the training data:
\[f(\tilde{x}) = \sum_{i=0}^n \alpha_i k(\tilde{x}, x)\]
Although the dimensionalty of \(\mathbf{Hilbert}\) space can be high, the solution lives in the finite span of the projected training data, enabling a finite representation. The corresponding convex optimization problems is:
\[\underset{\alpha \varepsilon R^{n}} {\mathrm{argmin}} \sum_{i=1}^n(f(x_i - y_i)^2 + \lambda\left \| f \right \|_{2}^{H} \]
\[\Leftrightarrow \underset{\alpha \varepsilon R^{n}} {\mathrm{argmin}} \left \langle K\alpha-y, K\alpha-y \right \rangle + \lambda \alpha^{T}K\alpha\]
Where \(\left \| f \right \|_{2}^{H}\) is the norm of \(f\) in \(\mathbf{Hilbert}\) space, the complexity of the linear ridge regression model in feature space, and \(K \varepsilon R^{n\times n}, K_{i, j}=k(x_i, x_j)\) is the kernel matrix between training sampels. As before, setting the gradient to 0 yeilds an analytic solution for the regression coefficients:
\[\alpha^{T}K^{2}\alpha - 2\alpha^{T}Ky\ + y^{T}y + \lambda\alpha^{T}K\alpha = 0 \Leftrightarrow K^{2}\alpha + \lambda K\alpha\]
\[\Leftrightarrow \alpha=(K+\lambda I)^{-1}y \]
where \(\lambda\) is a hyperparameter determining strength of regularization. The norm of the coefficient vector \(\mathbf\sigma\) is related to the smooothness and simpler models.
Figure below give a example of a KRR model with Gaussian kernel that demostates the role length-scale hyperparameter \(\sigma\). Although \(\sigma\) is directly related to the regularization term. But it's does control smoothness of the predictor, and effectively regularizes.
The figure above show that kernel ridge regression with Gaussian kernel and different length scales. We learn \(\cos(x)\), the KRR models(dasded lines).The regularization constant \(\lambda\) set to be \(10^{-14}\), a very small \(\mathbf{\sigma}\) fit the train set well but in-between error are very bigger, while a too large \(\sigma\) results in too close to linear model, with both high train error and prediction error.
From above description we can see all information about KRR model is contained in the matrix \(\mathbf K\) of kernel evaluations between training data. Similarly, all info required to predict new inputs \(\tilde{x}\) is contained in the kernel matrix of training set versus prediction data.
Kernel Ridge Regression
The regression coefficients \(\lambda\) are obtained by solving the linear system of equations \((\mathbf K + \lambda \mathbf I)\lambda=\mathbf y\), where \((\mathbf K + \lambda \mathbf I)\) is symmetric and strictly positive definite. To solve this equation we can use Cholesky decomposition \(\mathbf K + \lambda \mathbf I\), where \(\mathbf U\) is upper triangular. One then we break up the \(\mathbf{U^{T}U\lambda=y}\) equantion into 2 equations, the first is \(\mathbf{U^{T}\beta=y}\), and the other is \(\mathbf{U\lambda=\beta}\). Since \(U^{T}\) is lower triangular and \(\mathbf{U}\) is upper triangular, this requires only 2 striaghtforward passes over the data called forward and backward substitution, respectively. For \(\mathbf{U^{T}\beta=y}\), just like below:
\[\mathbf{U_{1, 1}^T \beta_1=y_1} \Leftrightarrow \beta_1=y1/u_{1,1}\]
\[\mathbf{U_{2, 1}^T \beta_1 + \mathbf{U_{2, 2}^T} \beta_1=y_1} \Leftrightarrow \beta_2=(y2 - u_{1, 2}\beta_1)/u_{2,2}\]
\[ ...\]
\[\sum_{j}^i \mathbf{U_{i, j}^T}\beta_j=y_i \Leftrightarrow \beta_i=(y_i-\sum_{j=1}^{i-1}u_{j,i}\beta_j)/u_{i,j}\]
Once the model is trained, then predictions can be made, and the prediction for a new input \(\mathbf{\tilde{x}}\) is the inner product between the vector of coefficients and the vector of corresponding kernel evaluations.
For a test datasets \(\mathbf{\tilde{X}} \varepsilon \Bbb{R^{n \times d}}\), the rows \(\mathbf{\tilde{x},...,\tilde{x_n}}\), and \(\mathbf{L} \varepsilon \Bbb{R^{n \times n}}\) is the kernel matrix of training versus prediction inputs. \(\mathbf{L_{i,j}=k(x_i, \tilde{x_j})}\)
This method has the same order of complexity than an Ordinary Least Squares. In the next notes I will introduce the detail of implementation about KRR.
Ridge Regression and Ridge Regression Kernel的更多相关文章
- 机器学习方法(五):逻辑回归Logistic Regression,Softmax Regression
欢迎转载,转载请注明:本文出自Bin的专栏blog.csdn.net/xbinworld. 技术交流QQ群:433250724,欢迎对算法.技术.应用感兴趣的同学加入. 前面介绍过线性回归的基本知识, ...
- 机器学习---三种线性算法的比较(线性回归,感知机,逻辑回归)(Machine Learning Linear Regression Perceptron Logistic Regression Comparison)
最小二乘线性回归,感知机,逻辑回归的比较: 最小二乘线性回归 Least Squares Linear Regression 感知机 Perceptron 二分类逻辑回归 Binary Logis ...
- why constrained regression and Regularized regression equivalent
problem 1: $\min_{\beta} ~f_\alpha(\beta):=\frac{1}{2}\Vert y-X\beta\Vert^2 +\alpha\Vert \beta\Vert$ ...
- L1,L2范数和正则化 到lasso ridge regression
一.范数 L1.L2这种在机器学习方面叫做正则化,统计学领域的人喊她惩罚项,数学界会喊她范数. L0范数 表示向量xx中非零元素的个数. L1范数 表示向量中非零元素的绝对值之和. L2范数 表 ...
- 【机器学习】Linear least squares, Lasso,ridge regression有何本质区别?
Linear least squares, Lasso,ridge regression有何本质区别? Linear least squares, Lasso,ridge regression有何本质 ...
- Kernel Methods (3) Kernel Linear Regression
Linear Regression 线性回归应该算得上是最简单的一种机器学习算法了吧. 它的问题定义为: 给定训练数据集\(D\), 由\(m\)个二元组\(x_i, y_i\)组成, 其中: \(x ...
- 机器学习技法:06 Support Vector Regression
Roadmap Kernel Ridge Regression Support Vector Regression Primal Support Vector Regression Dual Summ ...
- 机器学习技法笔记:06 Support Vector Regression
Roadmap Kernel Ridge Regression Support Vector Regression Primal Support Vector Regression Dual Summ ...
- 【Support Vector Regression】林轩田机器学习技法
上节课讲了Kernel的技巧如何应用到Logistic Regression中.核心是L2 regularized的error形式的linear model是可以应用Kernel技巧的. 这一节,继续 ...
随机推荐
- SPRING 配置文件和类名
今天写项目碰到一个很奇怪的问题,无论怎么改,还是一直包空指针 最终的问题出现在spring配置文件
- javascript限制input只允许输入数字
在做数据提交的表单时,经常要对input输入内容的类型进行限制,譬如javascript限制input只允许输入数字,最好的方法当然是使用javascript,因为它不用与服务器交互,大大减轻了服务器 ...
- OC语法1——OC概述
Object-C简介: OC,即Object-C,iOS开发的核心语言.它是基于C语言的,在C的基础上做了面向对象的封装,所以OC是面向对象的语言.同时也因此OC是兼容C的,也就是说在iOS开发中,可 ...
- 数组初始化(c, c++, gcc, g++)
这是很基础的东西,但基础的重要性不言而喻,我敢肯定这个知识点我肯定曾经了解过,但现在,我不敢确定,由此可见纪录的重要性,这世界没有什么捷径,找对方向,然后不停重复.所以从今天开始,我会比较详细的纪录这 ...
- Sql优化(三) 关于oracle的并发
Oracle的并发技术可以将一个大任务分解为多个小任务由多个进程共同完成.合理地使用并发可以充分利用系统资源,提高效率.一. 并发的种类Parallel queryParallel DML(PDML) ...
- 限制ITEM读取其它物料的物料描述
应用 Oracle Purchasing 层 Level Function 函数名 Funcgtion Name CUXPOXPOEPO 表单名 Form Name POXPOEPO 说明 Des ...
- 「python」: arp脚本的两种方法
「python」: arp脚本的两种方法 第一种是使用arping工具: #!/usr/bin/env python import subprocess import sys import re de ...
- cdoj 847 方老师与栈 火车进出战问题
//其实我是不想写这题的,但是这题让我想起了我年轻的时候 解法:直接模拟栈就好. //另外我年轻时候做的那题数据范围比较小,原理也不一样. //对于序列中的任何一个数其后面所有比它小的数应该是倒序的, ...
- 移动开发之fastclick 点击穿透
穿透(点穿)是在mobile各种浏览器上发生的常见的bug.可能是由click事件的延迟(300ms)或者事件冒泡导致 现象:在A页面中有个 btn1<或a标签>,在B页面中有个 btn2 ...
- 04737_C++程序设计_第6章_继承和派生
例6.1 使用默认内联函数实现单一继承. #include<iostream> using namespace std; class Point { private: int x, y; ...