大叔学ML第二：线性回归

目录

[TOC]

基本形式

求解参数$\vec\theta$
梯度下降法

正规方程导法

调用函数库

基本形式

线性回归非常直观简洁，是一种常用的回归模型，大叔总结如下：

设有样本$X$形如：

\[\begin{pmatrix}
x_1^{(1)} & x_2^{(1)} & \cdots &x_n^{(1)}\\
x_1^{(2)} & x_2^{(2)} & \cdots & x_n^{(2)}\\
\vdots & \vdots & \vdots & \vdots\\
x_1^{(m)} & x_2^{(m)} & \cdots & x_n^{(m)}\\
\end{pmatrix} \]

对应的标记$\vec{y}$形如：

\[\begin{pmatrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(m)} \\
\end{pmatrix}\]

其中，矩阵$X$的每一行表示一个样本，一共有m个样本；每列表示样本的一个属性，共有n个属性。设假设函数

\[h(x_1,x_2 \dots x_n)= \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n \tag{1}
\]

设$x_0=1$，则(1)式重新写为

\[h(x_1,x_2 \dots x_n)= \theta_0x_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n \tag{2}
\]

定义代价函数(均方误差)

\[j(\theta_0,\theta_1\dots \theta_n)=\frac{1}{2m}\sum_{k=1}^m (h(x_1^{(k)},x_2^{(k)} \dots x_n^{(k)}) - y^{(k)})^2
\]

即：

\[j(\theta_0,\theta_1\dots \theta_n)=\frac{1}{2m}\sum_{k=1}^m (\theta_0x_0^{(k)} + \theta_1 x_1^{(k)} + \theta_2 x_2^{(k)} + \dots + \theta_n x_n^{(k)} - y^{(k)})^2 \tag{3}
\]

这里的分母乘以2并没有意义，只是为了求导后正好约掉。另外，其实求绝对值之和更直观，但是计算不方便，求平方后再求和效果是一样的，而且计算非常容易。我们的目标是根据样本数据求出使得代价函数取值最小的参数$\vec\theta$，均方误差越小，说明以$\vec\theta$为参数的线性函数拟合样本的能力越强

求解参数$\vec\theta$

梯度下降法

关于梯度下降法可参考大叔学ML第一：梯度下降

由于代价函数是一个凸函数，可以用梯度下降法找到最小值。由于用到梯度，首先对$\theta_0$、$\theta_1$、$\theta_2$直到$\theta_n$求偏导：

$\frac{\partial}{\partial\theta_0}j(\theta_0,\theta_1\dots \theta_n) = \frac{1}{m}\sum_{k=1}^m(\theta_0x_0^{(k)} + \theta_1x_1^{(k)} + \dots+ \theta_nx_n^{(k)} - y^{(k)})x_0^{(k)}$
$\frac{\partial}{\partial\theta_1}j(\theta_0,\theta_1\dots \theta_n) = \frac{1}{m}\sum_{k=1}^m(\theta_0x_0^{(k)} + \theta_1x_1^{(k)} + \dots+ \theta_nx_n^{(k)}- y^{(k)})x_1^{(k)}$
$\dots$
$\frac{\partial}{\partial\theta_n}j(\theta_0,\theta_1\dots \theta_n) = \frac{1}{m}\sum_{k=1}^m(\theta_0x_0^{(k)} + \theta_1x_1^{(k)} + \dots+ \theta_nx_n^{(k)}- y^{(k)})x_n^{(k)}$

可归纳为：$\frac{\partial}{\partial\theta_n}j(\theta_0,\theta_1\dots \theta_n) = \frac{1}{m}\sum_{k=1}^m(\theta_0x_0^{(k)} + \theta_1x_1^{(k)} + \dots+ \theta_nx_n^{(k)}- y^{(k)})x_n^{(k)}\tag{4}$

万事俱备，现在可以编程了。创建一组测试数据，每组数据包括3个属性，我们来编码拟合出一个线性函数：

import numpy as np

def gradient(X, Y, m, theta):

    ''' 求theta位置的梯度.

    Args:

        X: 样本

        Y: 样本标记

        m: 样本数

        theta: 欲求梯度的位置

    Returns:

        gi: theta处函数的梯度值

    '''

    theta_size = np.size(theta)

    g = np.zeros(theta_size)

    for i in range(theta_size):

        gi = 0 #第i个theta分量对应的偏导

        for j in range(m):

            gi += ((np.dot(X[j], theta) - Y[j]) * X[j, i])

        gi = gi / m

        g[i] = gi

    return g

def gradient_descent(X, Y, step = 0.02, threshold = 0.01):

    ''' 梯度下降法求使代价函数最小的 theta

    Args:

        X: 样本

        Y: 样本标记

        step:步长

        threshold:梯度模长阈值，低于此值时停止迭代

    Returns:

        theta: 使代价函数取最小值的theta

    '''

    theta = np.random.rand(4)

    grad = gradient(X, Y, np.size(X, 0), theta)

    norm = np.linalg.norm(grad)

    while(norm > threshold):

        theta -= step * grad

        grad = gradient(X, Y, np.size(X, 0), theta)

        norm = np.linalg.norm(grad)

    return theta

''' 以下是测试数据 '''

# 测试用线性函数

def linear_function(x1, x2, x3):

    result = 1 + 2 * x1 + 3 * x2 + 4 * x3

    result = result + np.random.rand() # 噪音

    return result

# 计算函数值

def calculate(X):

    rowsnumber = np.size(X, axis = 0)

    Y = [linear_function (X[i, 0], X[i, 1], X[i, 2]) for i in range(0, rowsnumber)]

    return Y

if __name__ == "__main__":

    row_count = 500

    X = np.random.randint(0, 10, (row_count, 3)) # 随机产生row_count个样本

    Y = calculate(X) # 计算标记

    X0 = np.ones((row_count, 1))

    X = np.hstack((X0, X)) # 补充一列1

    theta = gradient_descent(X, Y)

    print('theta is ', theta)

运行结果：theta is [1.41206515 2.00558441 3.0013728 4.00684577]

上面的迭代方法被称为批量梯度下降法，参考式(4)，计算梯度时用到了所有的样本。梯度下降法还有个简化的版本，叫做随机梯度下降法，每次计算梯度时只随机使用一个样本，而不是所有样本，这样可以加快计算速度。将式(4)修改为：

\[\frac{\partial}{\partial\theta_n}j(\theta_0,\theta_1\dots \theta_n) = (\theta_0x_0^{(k)} + \theta_1x_1^{(k)} + \dots+ \theta_nx_n^{(k)}- y^{(k)})x_n^{(k)} \tag{5}
\]

其中：$1 \leq k \leq m$

将上面Python代码中的方法gradient替换一下：

def gradient_sgd(X, Y, m, theta):

    ''' 求theta位置的梯度.

    Args:

        X: 样本

        Y: 样本标记

        m: 样本数

        theta: 欲求梯度的位置

    Returns:

        gi: theta处函数的梯度值

    '''

    theta_size = np.size(theta)

    g = np.zeros(theta_size)

    for i in range(theta_size):

        random_Index = np.random.randint(1, m + 1)

        gi = ((np.dot(X[random_Index], theta) - Y[random_Index]) * X[random_Index, i])

        g[i] = gi

    return g

运行结果：

theta is [1.43718942 2.00043557 3.00620849 4.00674728]

感觉像是飞起来。随机梯度下降法肯定没有批量梯度下降法准确，所有还有第三种下降法，叫做小批量梯度下降法，介于批量梯度下降法和随机梯度下降法之间，每次计算梯度使用随机的一小批样本，此处不再code说明。

正规方程导法

因为代价函数是个凸函数，那么我们可以对代价函数求导，让其导数等于0的点即为最小值点。

为方便计算，我们在前面增加了一个值恒等于1的$x_0$，这样就把线性函数的偏置项去掉了，参考式(2)，重新定义矩阵$X$为：

\[\begin{pmatrix}
x_0^{(1)} & x_1^{(1)} & x_2^{(1)} & \cdots &x_n^{(1)}\\
x_0^{(2)} &x_1^{(2)} & x_2^{(2)} & \cdots & x_n^{(2)}\\
\vdots & \vdots & \vdots & \vdots & \vdots\\
x_0^{(m)} & x_1^{(m)} & x_2^{(m)} & \cdots & x_n^{(m)}\\
\end{pmatrix}\]

代价函数式(3)等价于：

\[J(\vec\theta)=\frac{1}{2m}||X\vec\theta - \vec{y}||^2 \tag{6}
\]

化简式(6):

\[\begin{align}
J(\vec\theta)&=\frac{1}{2m}||X\vec\theta - \vec{y}||^2 \\
&=\frac{1}{2m}(X\vec\theta - \vec{y})^T(X\vec\theta - \vec{y}) \\
&=\frac{1}{2m}(\vec\theta^TX^T - \vec{y}^T)(X\vec\theta - \vec{y}) \\
&=\frac{1}{2m}(\vec\theta^TX^TX\vec\theta - \vec\theta^TX^T\vec{y}- \vec{y}^TX\vec\theta + \vec{y}^T\vec{y})\\
&=\frac{1}{2m}(\vec\theta^TX^TX\vec\theta - 2\vec{y}^TX\vec\theta + \vec{y}^T\vec{y})\\
\end{align}\]

对$\vec\theta$求导：

\[\frac{d}{d\vec\theta}J(\vec\theta)=\frac{1}{m}(X^TX\vec\theta-X^T\vec{y})
\]

令其等于0，得：$$\vec\theta=(X^TX){-1}X^T\vec{y}\tag{7}$$

将上面的Python代码改为：

# 测试用线性函数

def linear_function(x1, x2, x3):

    result = 1 + 2 * x1 + 3 * x2 + 4 * x3

    result = result + np.random.rand() # 噪音

    return result

# 计算函数值

def calculate(X):

    rowsnumber = np.size(X, axis = 0)

    Y = [linear_function (X[i, 0], X[i, 1], X[i, 2]) for i in range(0, rowsnumber)]

    return Y

if __name__ == "__main__":

    row_count = 500

    X = np.random.randint(0, 10, (row_count, 3)) # 随机产生row_count个样本

    Y = calculate(X) # 计算标记

    X0 = np.ones((row_count, 1))

    X = np.hstack((X0, X)) # 补充一列1

    theta = np.dot(np.dot(np.linalg.pinv(np.dot(X.T, X)), X.T), np.array(Y).T)

    print('theta is ', theta)

运行结果：theta is [1.49522638 1.99801209 2.99704438 4.00427252]

和梯度下降法比较，光速的感觉，那为什么还要用梯度下降法呢？这是因为求矩阵的逆算法复杂度较高，达爷的建议是：如果样本的属性超过一万个，考虑使用梯度下降法。

调用函数库

其实我们也可以直接调用类库的，有很多类库可以做回归算法，比如：

import numpy as np

from sklearn import linear_model

# 测试用线性函数

def linear_function(x1, x2, x3):

    result = 1 + 2 * x1 + 3 * x2 + 4 * x3

    result = result + np.random.rand() # 噪音

    return result

# 计算函数值

def calculate(X):

    rowsnumber = np.size(X, axis = 0)

    Y = [linear_function (X[i, 0], X[i, 1], X[i, 2]) for i in range(0, rowsnumber)]

    return Y

if __name__ == "__main__":

    row_count = 500

    X = np.random.randint(0, 10, (row_count, 3)) # 随机产生row_count个样本

    Y = calculate(X) # 计算标记

    regr = linear_model.LinearRegression()

    regr.fit(X, np.array(Y).T)

    a, b = regr.coef_, regr.intercept_

    print(a)

    print(b)

运行结果：

[2.00384674 2.99234723 3.99603084]

1.5344826581936104

和我们自己算的差不多吧。还有很多其他的类库可以调用，大叔没有一一去找。可能通常只要调用类库就足够了，不需要我们自己写，不过还是知道原理比较好，遇到问题才好对症下药。

我是这样理解的：我们能够调用到的常见的（广义）线性回归库，其实内部都是用直接求导法实现的（没有看过源码，猜测是直接求导，如果是梯度下降，不太可能自动算出步长），如果样本的属性比较少，比如少于一万个，调用类库就好，类库肯定比我们大部分人自己写的强，但是当样本属性非常多时，用直接求导法求解速度太慢，这时才需要我们自己写梯度下降代码。