Machine Learning--week3 逻辑回归函数(分类)、决策边界、逻辑回归代价函数、多分类与(逻辑回归和线性回归的)正则化

Classification

It's not a good idea to use linear regression for classification problem.

We can use logistic regression algorism, which is a classification algorism

想要$0\le h_{\theta}(x) \le 1$，只需要使用sigmoid function （又称为logistic function）

\[\large h_\theta(x) = g(\theta^Tx), \quad其中\;g(z) =\frac{1}{1+e^{-z}}
\]

$h_\theta(x)$的意义在于： $h_\theta(x)$ = estimated probability that $y = 1$ on input $x$

注意：$x=0$时，$g(z)$刚好等于0.5

Decision Boundary

$h_\theta{(x)} == P\{y=1|x;0 \}$ ($P$指预测的概率)

在课上的例子中，$h_\theta(x) \ge 0.5，则y=1, else\; y=0$

不妨设$\theta = \begin{bmatrix}-3\\ 1\\ 1 \end{bmatrix} ，则 h_\theta(x)=g(-3+x_1+x_2)$

由于"$y=1$" == "$h_\theta(x) \ge 0.5$" == "$\theta^Tx \ge 0$" == "$-3+x_1+x_2 \ge 0$"

这样的到了 "$y=1$" == "$x_1+x_2 \ge 3$"

$x_1+x_2$ 与 $3$ 的关系决定了 $y$ 的值，这就是Decision boundary(决策边界)

拓展到 Non-linear decision boundary:

还可以有：Predict "$y=1$" if $-1+x_1^2+x_2^2 \ge 0$ ($\theta = \begin{bmatrix}-1\\ 0\\ 0 \\ 1\\ 1 \end{bmatrix},\;x = \begin{bmatrix}x_0\\ x_1\\ x_2\\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix}1\\ x_1\\ x_2\\ x_1^2 \\ x_2^2 \end{bmatrix}$)

通过$\theta$的不同选择与$x$的不同构造可以得到各种形状的决策边界

而Decision Boundary 取决于参数 $\theta$ 的选择，并非由训练集决定

我们需要用训练集来拟合参数 $\theta$

Cost Function

\[\begin{align} &J(\theta) =\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\end{align}
\]

在之前的 linear regression 中，用的Cost函数是：$Cost(h_\theta(x,y)) = \frac{1}{2}(h_\theta(x,y))^2 $

但那不是通用的，在hypothesis function $h_\theta(x)$不再是线性方程的情况下，若再采用$Cost(h_\theta(x,y)) = \frac{1}{2}(h_\theta(x,y))^2 $会导致$J(\theta)$ 有着众多的local optima，而不是我们想要的convex function

Logistic Regression Cost Function

\[Cost(h_\theta(x),y) = \begin{cases}
\begin{align}
{-log(h_\theta(x))} &\quad\text{ if $y$ = 1} \\
{-log(1-h_\theta(x))} &\quad \text{ if $y$ = 0}
\end{align}
\end{cases}
\]

当 $h_\theta(x)=y$ 时，$Cost(h_\theta(x,y))=0$,

当 $y=1,h_\theta(x)\rightarrow0$ 时 $Cost \rightarrow \infty$，此时：$\theta^Tx \rightarrow -\infty$

当 $y=0,h_\theta(x)\rightarrow1$ 时 $Cost \rightarrow \infty$，此时：$\theta^Tx \rightarrow \infty$

这样就保证了$\theta$的调整能使得$h_\theta(x)$ 向 $y$ 靠近，也就是预测效果与实际更加符合

上面的$Cost$ function 也可以写成：

\[Cost(h_\theta(x),y) = -y\cdot log(h_\theta(x))-(1-y)\cdot log(1-h_\theta(x))
\]

这与之前的cases形式是等价的

所以：

\[\begin{align} J(\theta) &=\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\\
&= -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\cdot log(h_\theta(x^{(i)}))+(1-y^{(i)})\cdot log(1-h_\theta(x^{(i)}))]
\end{align}
\]

Gradient Descent Algorithm的通用形式还是跟linear regression的一样（当然把$h_\theta(x)$展开后就不一样了）：

\[\begin{align}&\text{Repeat\{} \\ &\qquad\theta_j := \theta_j - \alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\\ &\} \end{align}
\]

Other Optimization Algorism

Conjugate Algorism（共轭梯度法）
BFGS（Broyden–Fletcher–Goldfarb–Shanno algorithm)
L-BFGS（ Limited-memory BFGS）

advantage:

no need to manually pick $\alpha$
Often faster than gradient descent

disadvantage:

More complex

不建议自己写，但是...可以直接调库啊

%{

%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'

function [jVal, gradient] = costFunction(theta)

	jVal = [code to compute J(theta)]

	gradient = zeros(n+1,1)

	gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]]

	gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]

	...

	gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]]      %the matrix in Octave starts from 1

%}

options = optimset('GradObj', 'on', 'MaxIter', '100');

initialTheta = zeros(2,1);

[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification:

用one-vs-all(一对多/一对余)的思想

对每一类都分成"这一类" 与 "剩下的所有类的集合" 两类，然后用之前的课程中讲得分类方法拟合出这一类的分类器（classifier）

(classifier 就是hypothesis)

最后得出$n$个classifiers，其中$n$是类别的总数量, $y$是类别：

\[h_\theta^{(i)}(x) = P(y=i|x;\theta)\qquad (i=1,2,3,\dots,n)
\]

也就是说，给定$x$和$\theta$， $h_\theta^{(i)}(x)$ 能算出来类别是$i$类的概率

然后输入一个新的input $x$时，作出预测的行为是：$\underbrace{max}_i(h_\theta^{(i)}(x))$

Regularization (正则化)

解决overfitting(过拟合)的问题，另一个描述这个问题的词语是high variance(高方差)

这是过多变量（feature）+ 过少训练数据造成的

If we have too many features, the learned hypothesis may fit the training set very well($J(\theta) \approx 0$)

generalize: how well a hypothesis applies even to new examples

Option to address overfitting:

Reduce number of features:
- Manually select which features to keep
- Model selection algorism
Regularization:
- Keep all features, but reduce magnitude(大小)/values of parameters $\theta_j$
- Works well when having a lot of features , each of which contributes a bit to predicting $y$

regularized Linear Regression

Regularization 的思路：

Small values for parameters $\theta_0, \theta_1,\dots,\theta_n$:

"Simpler" hypothesis
Less prone to overfitting

也就是将某些影响过大的$\theta_j$设得很小，比如： $\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4 \approx \theta_0 + \theta_1x + \theta_2x^2$

Gradient Descent

但是这个regularization 的过程不是在 $h_\theta(x)$ 里进行的，而是在Cost Function $J(\theta)$里进行的：

\[\large J(\theta) =\frac{1}{2m} [\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 + \lambda\sum_{j=1}^{n}\theta_j^2 ]
\]

注意后面加上的那一项（称之为正则化项）是从1开始的，它收缩了除了$\theta_0$外的每一个参数。 $\lambda$ 称为regularization parameter（正则化参数），用于控制两个不同目标之间的平衡关系。

在这个cost functions 里两个$\sum$项代表了两个不同的目标：

使假设更好地拟合数据（fit the training data well）
保持参数值较小（keep the parameters small）

较小的参数值能得到简单的hypothesis，从而避免overfitting

注意：$\lambda$不能过大，否则会使得 $\theta_1,\dots ,\theta_n \approx 0$, 从而fail to fit even the training set ——too high bias——underfitting（欠拟合）

\[\begin{align}
&\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\
&\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\
&\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha[\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j] \qquad (j = 1,2...,n)\\
&\}
\end{align}
\]

亦即：

\[\begin{align}
&\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\
&\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\
&\qquad \theta_{j}\; \text{:= } \theta_{j}(1-\alpha\frac{\lambda}{m}) - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}\qquad (j = 1,2...,n)\\
&\}
\end{align}
\]

Normal Equation

review: 之前的Normal Equation是 $\theta = (X^TX)^{-1}X^Ty$

改成$\theta = (X^TX+\lambda \small{\begin{bmatrix}0 \\&1 \\ &&1\\&&&\ddots\\&&&&1 \end{bmatrix}})^{-1}X^Ty,\quad \large\text{if }\lambda \gt 0$

关于不可逆/退化矩阵的问题，还是用Octave中的pinv()可以取伪逆矩阵

但是只要确保$\lambda$严格大于0，就能证明括号里的两个矩阵的和是可逆的.....

Regularized Logistic Regression

review: $ J(\theta) = -\frac{1}{m}[\sum_{i=1}^{m}y{(i)}, log,h_\theta(x^{(i)})+(1-y{(i)}), log,(1-h_\theta(x^{(i)}))]$

处理方法与Linear Regression 的一样，都是在式子最后面加上一个正则化项 $\frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2$

\[J(\theta) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\, log\,h_\theta(x^{(i)})+(1-y^{(i)})\, log\,(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2
\]

Gradient Descent(general 形式跟Linear Regression的一样，区别还是只有$h_\theta(x^{(i)})$不同):

在Octave中还是用之前的代码模版就行，注意在算$\frac{\partial J(\theta)}{\partial \theta_j}\;(\small j=1,2,\dots,n)$时需要注意把正则化项的偏微分加上

%{

%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'

function [jVal, gradient] = costFunction(theta)

	jVal = [code to compute J(theta)]

	gradient = zeros(n+1,1)

	gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]]

	gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]

	...

	gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]]      %the matrix in Octave starts from 1

%}

options = optimset('GradObj', 'on', 'MaxIter', '100');

initialTheta = zeros(2,1);

[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);

Machine Learning--week3 逻辑回归函数(分类)、决策边界、逻辑回归代价函数、多分类与(逻辑回归和线性回归的)正则化的更多相关文章

【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
Machine Learning Algorithms Study Notes(2)--Supervised Learning
Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...
Machine Learning With Go 第4章：回归
4 回归之前有转载过一篇文章:容量推荐引擎:基于吞吐量和利用率的预测缩放,里面用到了基本的线性回归来预测容器的资源利用情况.后面打算学一下相关的知识,译自:Machine Learning With ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料汇总（上）
转载:http://dataunion.org/8463.html?utm_source=tuicool&utm_medium=referral <Brief History of Ma ...
机器学习(Machine Learning)与深度学习(Deep Learning)资料汇总
<Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.D ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)
##机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)---#####注:机器学习资料[篇目一](https://github.co ...
学习笔记之机器学习实战 (Machine Learning in Action)
机器学习实战 (豆瓣) https://book.douban.com/subject/24703171/ 机器学习是人工智能研究领域中一个极其重要的研究方向,在现今的大数据时代背景下,捕获数据并从中 ...
逻辑回归，多分类推广算法softmax回归中
转自http://ufldl.stanford.edu/wiki/index.php/Softmax%E5%9B%9E%E5%BD%92 简介在本节中,我们介绍Softmax回归模型,该模型是log ...

随机推荐

aes 和 Md5 分析
高级加密标准(英语:Advanced Encryption Standard,缩写:AES). 密码的设计力求满足以下3条标准: ① 抵抗所有已知的攻击. ② 在多个平台上速度快,编码紧凑. ③ 设计 ...
用 ArrayList 集合调用商品类
public class Commodity{ //定义商品类 String name; //定义商品名字 double size; //定义商品尺寸 double price; //定义商品 ...
JS常用各种正则表达式（汇总）
匹配URL 这个url的正则表达式判断的JavaScript!比较全面的.它验证的情况包括IP,域名(domain),ftp,二级域名,域名中的文件,域名加上端口!用户名等等信息, function ...
[daily][python2] int型IP地址与string型IP地址互转
使用python2,类似如下操作. >>> import socket >>> import struct >>> socket.ntohl(]) ...
npm笔记
#执行npm start时是运行的哪个js文件? 打开package.json看看scripts属性中start配置的是什么运行脚本,这里配置的就是你执行npm start时跑的脚本 #设置npm的源 ...
Visual Studio 2017使用Asp.Net Core构建Angular4应用程序
文章转载请著名出处:http://www.cnblogs.com/smallprogram 你需要了解的名词 1. NodeJS,这是一个基于Chrome V8 JavaScript引擎构建的Java ...
TableExport导出失败问题
本周有一个需求,将一个网页上一个js导出成csv文件,供数据分析使用找到一个插件,TableExport,可以很方便的将table导出(默认设置的话,仅需一行代码) 但是,这导出文件较大(6.2M) ...
Python静态方法（staticmethod）和类方法（classmthod）
Python静态方法(staticmethod)和类方法(classmthod)翻了翻之前的笔记,也刚好看到一篇不错的blog,关于静态方法和类方法的,方便以后查阅,就写在这里了,废话不多说,直接上代 ...
家庭记账本之微信小程序（七）
最后成果在经过对微信小程序的简单学习后,对于微信小程序也稍有理解,在浏览学习过别人的东西后自己也制作了一个,觉得就是有点low,在今后的学习中会继续完善这个微信小程序 //index.js //获取 ...
Go 初体验 - 死锁的几种情况
go 语言里,channel 是一个重要的对象和概念,它是通信的基础实现如何实例化: ch := make(chan int) 由 channel 通信引起的死锁共有3种: 第一种是因为给 ch 推 ...

Machine Learning--week3 逻辑回归函数(分类)、决策边界、逻辑回归代价函数、多分类与(逻辑回归和线性回归的)正则化

Machine Learning--week3 逻辑回归函数(分类)、决策边界、逻辑回归代价函数、多分类与(逻辑回归和线性回归的)正则化的更多相关文章

随机推荐

热门专题