Classification

It's not a good idea to use linear regression for classification problem.

We can use logistic regression algorism, which is a classification algorism

想要\(0\le h_{\theta}(x) \le 1\), 只需要使用sigmoid function (又称为logistic function)

\[\large h_\theta(x) = g(\theta^Tx), \quad其中\;g(z) =\frac{1}{1+e^{-z}}
\]

\(h_\theta(x)\)的意义在于: \(h_\theta(x)\) = estimated probability that \(y = 1\) on input \(x\)

注意:\(x=0\)时,\(g(z)\)刚好等于0.5

Decision Boundary

​ \(h_\theta{(x)} == P\{y=1|x;0 \}\) (\(P\)指预测的概率)

​ 在课上的例子中,\(h_\theta(x) \ge 0.5,则y=1, else\; y=0\)

​ 不妨设\(\theta = \begin{bmatrix}-3\\ 1\\ 1 \end{bmatrix} ,则 h_\theta(x)=g(-3+x_1+x_2)\)

​ 由于"\(y=1\)" == "\(h_\theta(x) \ge 0.5\)" == "\(\theta^Tx \ge 0\)" == "\(-3+x_1+x_2 \ge 0\)"

这样的到了 "\(y=1\)" == "\(x_1+x_2 \ge 3\)"

​ \(x_1+x_2\) 与 \(3\) 的关系决定了 \(y\) 的值,这就是Decision boundary(决策边界)

拓展到 Non-linear decision boundary:

​ 还可以有:Predict "\(y=1\)" if \(-1+x_1^2+x_2^2 \ge 0\) (\(\theta = \begin{bmatrix}-1\\ 0\\ 0 \\ 1\\ 1 \end{bmatrix},\;x = \begin{bmatrix}x_0\\ x_1\\ x_2\\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix}1\\ x_1\\ x_2\\ x_1^2 \\ x_2^2 \end{bmatrix}\))

​ 通过\(\theta\)的不同选择与\(x\)的不同构造可以得到各种形状的决策边界

​ 而Decision Boundary 取决于参数 \(\theta\) 的选择,并非由训练集决定

​ 我们需要用训练集来拟合参数 \(\theta\)

Cost Function

\[\begin{align} &J(\theta) =\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\end{align}
\]

在之前的 linear regression 中,用的Cost函数是:$Cost(h_\theta(x,y)) = \frac{1}{2}(h_\theta(x,y))^2 $

但那不是通用的,在hypothesis function \(h_\theta(x)\)不再是线性方程的情况下,若再采用$Cost(h_\theta(x,y)) = \frac{1}{2}(h_\theta(x,y))^2 \(会导致\)J(\theta)$ 有着众多的local optima,而不是我们想要的convex function

Logistic Regression Cost Function

\[Cost(h_\theta(x),y) = \begin{cases}
\begin{align}
{-log(h_\theta(x))} &\quad\text{ if $y$ = 1} \\
{-log(1-h_\theta(x))} &\quad \text{ if $y$ = 0}
\end{align}
\end{cases}
\]

当 \(h_\theta(x)=y\) 时,\(Cost(h_\theta(x,y))=0\),

当 \(y=1,h_\theta(x)\rightarrow0\) 时 \(Cost \rightarrow \infty\),此时:\(\theta^Tx \rightarrow -\infty\)

当 \(y=0,h_\theta(x)\rightarrow1\) 时 \(Cost \rightarrow \infty\),此时:\(\theta^Tx \rightarrow \infty\)

这样就保证了\(\theta\)的调整能使得\(h_\theta(x)\) 向 \(y\) 靠近,也就是预测效果与实际更加符合

上面的\(Cost\) function 也可以写成:

\[Cost(h_\theta(x),y) = -y\cdot log(h_\theta(x))-(1-y)\cdot log(1-h_\theta(x))
\]

这与之前的cases形式是等价的

所以:

\[\begin{align} J(\theta) &=\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\\
&= -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\cdot log(h_\theta(x^{(i)}))+(1-y^{(i)})\cdot log(1-h_\theta(x^{(i)}))]
\end{align}
\]

Gradient Descent Algorithm的通用形式还是跟linear regression的一样(当然把\(h_\theta(x)\)展开后就不一样了):

\[\begin{align}&\text{Repeat\{} \\ &\qquad\theta_j := \theta_j - \alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\\ &\} \end{align}
\]

Other Optimization Algorism

  • Conjugate Algorism(共轭梯度法)
  • BFGS(Broyden–Fletcher–Goldfarb–Shanno algorithm)
  • L-BFGS( Limited-memory BFGS)

advantage:

  • no need to manually pick \(\alpha\)
  • Often faster than gradient descent

disadvantage:

  • More complex

不建议自己写,但是...可以直接调库啊

%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(theta)]
gradient = zeros(n+1,1)
gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]]
gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
...
gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]] %the matrix in Octave starts from 1
%} options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification:

用one-vs-all(一对多/一对余)的思想

对每一类都分成"这一类" 与 "剩下的所有类的集合" 两类,然后用之前的课程中讲得分类方法拟合出这一类的分类器(classifier)

(classifier 就是hypothesis)

最后得出\(n\)个classifiers, 其中\(n\)是类别的总数量, \(y\)是类别:

\[h_\theta^{(i)}(x) = P(y=i|x;\theta)\qquad (i=1,2,3,\dots,n)
\]

也就是说,给定\(x\)和\(\theta\), \(h_\theta^{(i)}(x)\) 能算出来类别是\(i\)类的概率

然后输入一个新的input \(x\)时,作出预测的行为是:\(\underbrace{max}_i(h_\theta^{(i)}(x))\)

Regularization (正则化)

解决overfitting(过拟合)的问题,另一个描述这个问题的词语是high variance(高方差)

这是 过多变量(feature)+ 过少训练数据 造成的

​ If we have too many features, the learned hypothesis may fit the training set very well(\(J(\theta) \approx 0\))

generalize:  how well a hypothesis applies even to new examples

Option to address overfitting:

  • Reduce number of features:

    • Manually select which features to keep
    • Model selection algorism
  • Regularization:
    • Keep all features, but reduce magnitude(大小)/values of parameters \(\theta_j\)
    • Works well when having a lot of features , each of which contributes a bit to predicting \(y\)

regularized Linear Regression

Regularization 的思路:

Small values for parameters \(\theta_0, \theta_1,\dots,\theta_n\):

  • "Simpler" hypothesis
  • Less prone to overfitting

也就是将某些影响过大的\(\theta_j\)设得很小,比如: \(\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4 \approx \theta_0 + \theta_1x + \theta_2x^2\)

Gradient Descent

但是这个regularization 的过程不是在 \(h_\theta(x)\) 里进行的,而是在Cost Function \(J(\theta)\)里进行的:

\[\large J(\theta) =\frac{1}{2m} [\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 + \lambda\sum_{j=1}^{n}\theta_j^2 ]
\]

注意后面加上的那一项(称之为正则化项)是从1开始的,它收缩了除了\(\theta_0\)外的每一个参数。 \(\lambda\) 称为regularization parameter(正则化参数),用于控制两个不同目标之间的平衡关系。

在这个cost functions 里两个\(\sum\)项代表了两个不同的目标:

  • 使假设更好地拟合数据(fit the training data well)
  • 保持参数值较小(keep the parameters small)

较小的参数值能得到简单的hypothesis,从而避免overfitting

注意:\(\lambda\)不能过大,否则会使得 \(\theta_1,\dots ,\theta_n \approx 0\), 从而fail to fit even the training set ——too high bias——underfitting(欠拟合)

\[\begin{align}
&\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\
&\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\
&\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha[\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j] \qquad (j = 1,2...,n)\\
&\}
\end{align}
\]

亦即

\[\begin{align}
&\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\
&\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\
&\qquad \theta_{j}\; \text{:= } \theta_{j}(1-\alpha\frac{\lambda}{m}) - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}\qquad (j = 1,2...,n)\\
&\}
\end{align}
\]

Normal Equation

review: 之前的Normal Equation是 \(\theta = (X^TX)^{-1}X^Ty\)

改成\(\theta = (X^TX+\lambda \small{\begin{bmatrix}0 \\&1 \\ &&1\\&&&\ddots\\&&&&1 \end{bmatrix}})^{-1}X^Ty,\quad \large\text{if }\lambda \gt 0\)

关于不可逆/退化矩阵 的问题,还是用Octave中的pinv()可以取伪逆矩阵

但是只要确保\(\lambda\)严格大于0,就能证明括号里的两个矩阵的和是可逆的.....

Regularized Logistic Regression

review: $ J(\theta) = -\frac{1}{m}[\sum_{i=1}{m}y{(i)}, log,h_\theta(x{(i)})+(1-y{(i)}), log,(1-h_\theta(x^{(i)}))]$

处理方法与Linear Regression 的一样,都是在式子最后面加上一个正则化项 \(\frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2\)

\[J(\theta) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\, log\,h_\theta(x^{(i)})+(1-y^{(i)})\, log\,(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2
\]

Gradient Descent(general 形式跟Linear Regression的一样,区别还是只有\(h_\theta(x^{(i)})\)不同):

\[\begin{align}
&\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\
&\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\
&\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha[\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j] \qquad (j = 1,2...,n)\\
&\}
\end{align}
\]

在Octave中还是用之前的代码模版就行,注意在算\(\frac{\partial J(\theta)}{\partial \theta_j}\;(\small j=1,2,\dots,n)\)时需要注意把正则化项的偏微分加上

%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(theta)]
gradient = zeros(n+1,1)
gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]]
gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
...
gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]] %the matrix in Octave starts from 1
%} options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);

Machine Learning--week3 逻辑回归函数(分类)、决策边界、逻辑回归代价函数、多分类与(逻辑回归和线性回归的)正则化的更多相关文章

  1. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  2. Machine Learning Algorithms Study Notes(2)--Supervised Learning

    Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 22 ...

  3. Machine Learning With Go 第4章:回归

    4 回归 之前有转载过一篇文章:容量推荐引擎:基于吞吐量和利用率的预测缩放,里面用到了基本的线性回归来预测容器的资源利用情况.后面打算学一下相关的知识,译自:Machine Learning With ...

  4. 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】

    转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...

  5. 机器学习(Machine Learning)&深度学习(Deep Learning)资料汇总 (上)

    转载:http://dataunion.org/8463.html?utm_source=tuicool&utm_medium=referral <Brief History of Ma ...

  6. 机器学习(Machine Learning)与深度学习(Deep Learning)资料汇总

    <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.D ...

  7. 机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)

    ##机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)---#####注:机器学习资料[篇目一](https://github.co ...

  8. 学习笔记之机器学习实战 (Machine Learning in Action)

    机器学习实战 (豆瓣) https://book.douban.com/subject/24703171/ 机器学习是人工智能研究领域中一个极其重要的研究方向,在现今的大数据时代背景下,捕获数据并从中 ...

  9. 逻辑回归,多分类推广算法softmax回归中

    转自http://ufldl.stanford.edu/wiki/index.php/Softmax%E5%9B%9E%E5%BD%92 简介 在本节中,我们介绍Softmax回归模型,该模型是log ...

随机推荐

  1. mybatis03--字段名和属性名不一致

    1.修改数据库中的字段 2.运行根据id查询所有的学生信息的测试方法会出现下面的异常 也就是说明 数据库中的字段没有个实体类中的属性名一致 3.修改StudentMapper.xml文件中的列名 4. ...

  2. git 设置tracking information

    There is no tracking information for the current branch.Please specify which branch you want to merg ...

  3. MySQL数据库一些常用命令

    输入mysql –u root(用户名) -p 回车后输入密码,就可以连接到mysql数据库. 1. 创建数据库:create database 数据库名称: 2. 删除数据库:drop databa ...

  4. linux_rename命令用法

    rename在man中的解释为: NAME rename - rename files SYNOPSIS rename [options] expression replacement file... ...

  5. Oracle ROWNUM用法和分页查询总结

    **************************************************************************************************** ...

  6. Ps去除背景

    http://www.16xx8.com/photoshop/jiaocheng/26905.html

  7. react中进入某个详情页URL路劲参数Id获取问题

    <Route path={`${match.url}/detail/:id`} component={AppManageAddDetail} /> const { match:{param ...

  8. python中的*args和**kw

    学习python装饰器decorator的时候遇到*args和**kw两种函数值传递. 在python中定义函数,可以使用一般参数.默认参数.非关键字参数和关键字参数. 一般参数和默认参数在前面的学习 ...

  9. UI自动化框架——构建思维

    目的:从Excel中获取列的值,传输到页面 技巧:尽可能的提高方法的重用率 Java包: 1.java.core包 3个类:1)日志(LogEventListener)扩展web driver自带的事 ...

  10. qemu对虚拟机的内存管理(二)

    上篇文章主要分析了qemu中对虚拟机内存管理的关键数据结构及他们之间的联系,这篇文章则主要分析在地址空间发生变化时,如何将其更新至KVM中,保持用户空间与内核空间的同步. 这一系列操作与之前说的Add ...