DL基础补全计划(三)---模型选择、欠拟合、过拟合

PS：要转载请注明出处，本人版权所有。

PS: 这个只是基于《我自己》的理解，

如果和你的原则及想法相冲突，请谅解，勿喷。

前置说明

本文作为本人csdn blog的主站的备份。（BlogID=107）

环境说明

Windows 10
VSCode
Python 3.8.10
Pytorch 1.8.1
Cuda 10.2

前言

在前文中，我们已经接触了两种回归模型，也接触了深度学习中的一些常见的概念。其中有趣的信息是，我们在《DL基础补全计划(二)---Softmax回归及示例（Pytorch，交叉熵损失）》中已经发现了，在softmax回归的时候，我们使用一个线性的隐藏层在其数据集上都能够达到不错的不错的准确率，这里的不错是指瞎猜和我们的模型推测的准确率，一个是10%，一个是80%左右。这至少说明了我们这个分类模型是有效的。其实后续我们就会更换线性隐藏层为其他层来实现我们的模型，比如：CNN、RNN等等，不同的隐藏层是后续我们要接触和学习的内容，这里不先做详解。

我们假设我们已经设计出了许多的不同隐藏层的模型，这个时候有一个重要的问题就是选择哪一个模型为我们实际的要应用的模型，本文将会介绍一些方法来实现怎么选择模型的问题。

一些基本概念简介

基本概念简介：

训练误差是指模型在参数更新后，在训练集上做一次测试，算出的和真实值的误差。
泛化误差是指模型在真实数据分布下，算出的和真实值的误差，但是一般情况下数据是无穷多的，我们只能够采集一些真实数据，并算出泛化误差。常见的情况是我们构造一个测试集来计算泛化误差。
欠拟合模型拟合能力差，训练误差和泛化误差差异小，但是两个误差都比较大，一般来说，就是模型基本没有学习到我们需要学习的规律和特征。
过拟合训练误差小，泛化误差大。一般来说就是在训练集上学习的太过分了，类似强行记住了训练集上的所有规律和特征，导致泛化能力太弱了。

一般来说欠拟合的话，就是换网络，加深加大网络等解决问题，欠拟合其实很明显，解决方向比较明确。

其实我们更多是遇到过拟合，因为随着发展，我们的模型越来越深和宽，但是我们能够收集到的数据是有限的，导致了我们的模型可能出现‘死记硬背’下我们的训练集，然后泛化能力就令人担忧，为了缓解这个问题，后续我们将会介绍几种缓解过拟合的方法。

下面我们将会通过一个实例来体会一下正常拟合、欠拟合、过拟合。

一个正常拟合、过拟合、欠拟合的实例

这里我们通过pytorch的高级API来设计一个线性规划的实例。

首先通过如下的代码生成\(Y=X^3*W1 + X^2*W2 + X*W3 + b + \epsilon, \epsilon=N(0, 0.1^2)\)的特征和标签。

def synthetic_data(w, num_examples): #@save
    """⽣成y = X1^3*W1 + X2^2*W1 + X3*W3 + b + 噪声。"""
    X = np.random.normal(0, 1, (num_examples, 1))
    y = np.dot(X**3/np.math.factorial(3), w[0]) + np.dot(X**2/np.math.factorial(2), w[1]) + np.dot(X/np.math.factorial(1), w[2]) + w[3]
    # 噪声
    y += np.random.normal(0, 0.1, y.shape)
    return X, y.reshape((-1, 1))

然后通过自定义Pytorch层，通过传入参数N，计算N项多项式的结果。

class TestLayer(nn.Module):
    def __init__(self, n, **kwargs):
        super(TestLayer, self).__init__(**kwargs)
        self.n = n
        self.w_array = nn.Parameter(torch.tensor( np.random.normal(0, 0.1, (1, n))).reshape(-1, 1))
        self.b = nn.Parameter(torch.tensor(np.random.normal(0, 0.1, 1)))
    def cal(self, X, n):
        X = X.reshape(batch_size, 1, 1)
        Y = self.b
        for i in range(n):
            # print(X.shape)
            # print(self.w_array.shape)
            # print(Y.shape)
            Y  = Y + torch.matmul(X**(i + 1)/torch.tensor(np.math.factorial(i + 1)), self.w_array[i])
        return Y
    def forward(self, x):
        return self.cal(x, self.n)
class TestNet(nn.Module):
    def __init__(self, n):
        super(TestNet, self).__init__()
        self.test_net = nn.Sequential(
            TestLayer(n)
        )   
    def forward(self, x):
        return self.test_net(x)

最终完整代码如下：

import torch
from torch import nn
import numpy as np
import matplotlib.pyplot as plt
from torch.utils import data
from matplotlib.pyplot import MultipleLocator
fig, ax = plt.subplots()
xdata0, ydata0 = [], []
xdata1, ydata1 = [], []
line0, = ax.plot([], [], 'r-', label='TrainError')
line1, = ax.plot([], [], 'b-', label='TestError')
def init_and_show():
    ax.set_xlabel('epoch')
    ax.set_ylabel('loss')
    ax.set_title('Train/Test Loss')
    ax.set_xlim(0, epochs)
    ax.set_ylim(0.05, 100)
    ax.set_yscale('log')
    # y_locator = MultipleLocator(0.1)
    # ax.yaxis.set_major_locator(y_locator)
    ax.legend([line0, line1], ('TrainError', 'TestError'))
    # ax.legend([line1], ('TestError', ))
    line0.set_data(xdata0, ydata0)
    line1.set_data(xdata1, ydata1)
    plt.show()
def synthetic_data(w, num_examples): #@save
    """⽣成y = X1^3*W1 + X2^2*W1 + X3*W3 + b + 噪声。"""
    X = np.random.normal(0, 1, (num_examples, 1))
    y = np.dot(X**3/np.math.factorial(3), w[0]) + np.dot(X**2/np.math.factorial(2), w[1]) + np.dot(X/np.math.factorial(1), w[2]) + w[3]
    # 噪声
    y += np.random.normal(0, 0.1, y.shape)
    return X, y.reshape((-1, 1))
class TestLayer(nn.Module):
    def __init__(self, n, **kwargs):
        super(TestLayer, self).__init__(**kwargs)
        self.n = n
        self.w_array = nn.Parameter(torch.tensor( np.random.normal(0, 0.1, (1, n))).reshape(-1, 1))
        self.b = nn.Parameter(torch.tensor(np.random.normal(0, 0.1, 1)))
    def cal(self, X, n):
        X = X.reshape(batch_size, 1, 1)
        Y = self.b
        for i in range(n):
            # print(X.shape)
            # print(self.w_array.shape)
            # print(Y.shape)
            Y  = Y + torch.matmul(X**(i + 1)/torch.tensor(np.math.factorial(i + 1)), self.w_array[i])
        return Y
    def forward(self, x):
        return self.cal(x, self.n)
class TestNet(nn.Module):
    def __init__(self, n):
        super(TestNet, self).__init__()
        self.test_net = nn.Sequential(
            TestLayer(n)
        )   
    def forward(self, x):
        return self.test_net(x)
# copy from d2l/torch.py
def load_array(data_arrays, batch_size, is_train=True):
    """Construct a PyTorch data iterator."""
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)
# def data_loader(batch_size, features, labels):
#     num_examples = len(features)
#     indices = list(range(num_examples))
#     np.random.shuffle(indices) # 样本的读取顺序是随机的
#     for i in range(0, num_examples, batch_size):
#         j = np.array(indices[i: min(i + batch_size, num_examples)])
#         yield torch.tensor(features.take(j, 0)), torch.tensor(labels.take(j)) # take函数根据索引返回对应元素
def train(dataloader, model, loss_fn, optimizer):
    size = train_examples
    num_batches = train_examples / batch_size
    train_loss_sum = 0
    for batch, (X, y) in enumerate(dataloader):
        # move X, y to gpu
        if torch.cuda.is_available():
            X = X.to('cuda')
            y = y.to('cuda')
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss_sum += loss.item()
        if batch % 5 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
    print(f"Train Error: \n Avg loss: {train_loss_sum/num_batches:>8f} \n")
    return train_loss_sum/num_batches
def test(dataloader, model, loss_fn):
    num_batches = test_examples / batch_size
    test_loss = 0
    with torch.no_grad():
        for X, y in dataloader:
            # move X, y to gpu
            if torch.cuda.is_available():
                X = X.to('cuda')
                y = y.to('cuda')
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
    test_loss /= num_batches
    print(f"Test Error: \n Avg loss: {test_loss:>8f} \n")
    return test_loss
if __name__ == '__main__':
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print('Using {} device'.format(device))
    true_w1 = [1.65]
    true_w2 = [-2.46]
    true_w3 = [3.54]
    true_b = 0.78    
    test_examples = 100
    train_examples = 100
    num_examples = test_examples + train_examples
    f1, labels = synthetic_data([true_w1, true_w2, true_w3, true_b], num_examples)
    print(f1.shape)
    print(labels.shape)
    num_weight = 3
    l1_loss_fn = torch.nn.MSELoss()
    learning_rate = 0.01
    model = TestNet(num_weight)
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    model = model.to(device)
    print(model)
    epochs = 1500
    model.train()
    batch_size = 10
    train_data = (torch.tensor(f1[:train_examples,]), torch.tensor(labels[:train_examples,]))
    test_data = (torch.tensor(f1[train_examples:,]), torch.tensor(labels[train_examples:,]))
    train_dataloader = load_array(train_data ,batch_size, True)
    test_dataloader = load_array(test_data ,batch_size, True)
    # verify dataloader
    # for x,y in train_dataloader:
    #     print(x.shape)
    #     print(y.shape)
    #     print(torch.matmul(x**3, torch.tensor(true_w1, dtype=torch.double)) + torch.matmul(x**2, torch.tensor(true_w2, dtype=torch.double)) + torch.matmul(x, torch.tensor(true_w3, dtype=torch.double)) + true_b)
    #     print(y)
    #     break
    model.train()
    for t in range(epochs):
        print(f"Epoch {t+1}\n-------------------------------")
        train_l = train(train_dataloader, model, l1_loss_fn, optimizer)
        test_l = test(test_dataloader, model, l1_loss_fn)
        ydata0.append(train_l*10)
        ydata1.append(test_l*10)
        xdata0.append(t)
        xdata1.append(t)
    print("Done!")
    init_and_show()
    param_iter = model.parameters()
    print('W = ')
    print(next(param_iter)[: num_weight, :])
    print('b = ')
    print(next(param_iter))

注意，此最终代码首先生成了100个训练集和100个测试集。通过num_weight可以控制参与训练的多项式个数，话句话说，可以控制参与拟合训练的参数个数。下面通过三个说明我们来看看，不同num_weight下，TrainErr和TestErr和迭代次数，参与拟合训练的参数的关系。

正常拟合(num_weight = 3)

当num_weight = 3时，运行我们的训练脚本，我们可以清楚的看到，我们拟合出来的结果和我们的真实参数是几乎一样的。同时我们也可以看到TrainErr和TestErr快速的收敛接近0而且差别不是很大。

欠拟合(num_weight = 1)

当num_weight = 1时，运行我们的训练脚本，我们可以清楚的看到，损失图像到了一定程度就不下降了，不能够收敛。

过拟合(num_weight = 20)

当num_weight = 20时，按照我们的猜测，我们的模型应该会出现过拟合。

正常过拟合现象, 注意观察最终输出前面3项的w和b和真实w和b存在差异。

从我多次的实验的结果来看，除了上面的真实出现的过拟合情况，还有一些情况是，不会出现过拟合现象，如下图。注意观察最终输出前面3项的w和b和真实w和b。

我们通过观察，发现了w的4到20项参数接近于0，前面3项的w和b和真实w和b是比较接近的，因此我们猜测没有出现过拟合的原因是w的4到20项的权重在整个表达式中占比非常小，因此不会过拟合。可以直接理解为w的4到20项的权重为0。

注意过拟合这个例子，需要多次运行才会出现过拟合现象，其是波动的，其实就是我们初始化的参数充满了随机性，导致了不容易收敛。而欠拟合和正常拟合的例子不管你怎么运行，都能稳定的得到结果。

后记

这里我们从模型选择的角度出发，发现了我们训练的过程中会出现的3种现象，欠拟合，正常拟合，过拟合。其中正常拟合状态下的模型是我们需要的。

对于欠拟合来说，就是参与训练的参数少了，换句话说我们的模型太简单了，不能够代表我们要学习的特征，导致完全不能够收敛。

对于过拟合来说，远不止我们看到的这么简单和清晰。在这里我们只是看到了一个主要的导致训练出现大波动的原因就是参数过多，这种情况下会出现过拟合现象。由于在后面的模型中，参数都是成百上千，我们不可能一个个尝试，因此在后续，我们还会学习一些手段来抑制过拟合现象。

这里我们也要引出一个问题，我们知道模型的复杂度（参数个数）在一个特定数据集上可能会导致过拟合，那么我们除了控制模型复杂度之外，还有其他的方案可以选择吗?

参考文献

https://github.com/d2l-ai/d2l-zh/releases (V1.0.0)
https://github.com/d2l-ai/d2l-zh/releases (V2.0.0 alpha1)

打赏、订阅、收藏、丢香蕉、硬币，请关注公众号（攻城狮的搬砖之路）