loss函数学习笔记

一直对机器学习里的loss函数不太懂，这里做点笔记。

符号表示的含义，主要根据Andrew Ng的课程来的，\(m\)个样本，第\(i\)个样本为\(\vec x^{(i)}\)，对应ground truth标签为\(y^{(i)}\)。

线性回归

假设函数：

\[\begin{align}
h_{\vec \theta}(\vec x^{(i)})
& = \vec \theta^T \vec x \\
\end{align}
\]

损失函数：

使用MSE(mean squared error)作为loss function

\[\begin{align}
J(\vec \theta)
& = \frac{1}{2m} \sum\limits_{i=1}^{m}(h_{\vec \theta}(\vec x^{(i)}-y^{(i)})^2 \\
\end{align}
\]

有mini-SGD梯度下降来优化

逻辑回归

逻辑回归用于二分类，所以也叫逻辑分类。

先把线性回归泛化为广义线性模型：

\[\begin{align}
h_{\vec \theta}(\vec x)
& = g^{-1}(\vec \theta^T \vec x^{(i)}) \\
\end{align}
\]

考虑到执行二分类，要预测的\(y^{(i)} \in {0,1}\)，因此使用Sigmoid函数：

\[\begin{align}
g^{-1}(z)
& = \frac{1}{1+e^{-z}} \\
\end{align}
\]

得到逻辑回归的假设函数：

\[\begin{align}
h_{\vec \theta}(\vec x)
& = \frac{1}{1+e^{-\vec \theta^T \vec x}} \\
\end{align}
\]

获得逻辑回归的损失函数，则和线性模型不同。是从极大似然估计入手的，因为根据经验风险最小化的原则，应当搜索参数\(\vec \theta\)使得loss函数取值最小，数学表达式上等价于似然函数取最大。那么首先写出概率密度函数(p.d.f)，似然函数是所有样本的概率密度乘积，再取对数，以及乘以\(-1\)，就得到逻辑回归损失函数。

逻辑回归的每个样本\(\vec x^{(i)}\)对应的类别标签\(y^{(i)}\)服从两点分布（Bernoulli分布），其p.d.f为：

\[\begin{align}
f(y^{(i)}|\phi)
& = \phi^{y^{(i)}}(1-\phi)^{1-y^{(i)}} \\
\phi
& = h_{\vec \theta}(x^{(i)}) \\
\end{align}
\]

其中\(\phi\)表示\(p(y^{(i)}=1)\)，也就是\(x^{(i)}\)被预测为正样本（“1”类）的概率。

对应的似然函数数为：

\[\begin{align}
l(\vec \theta)
& = \prod\limits_{i=1}^{m}p(y^{(i)}|h_{\vec \theta}(\vec x^{(i)})) \\
& = (h_{\vec \theta}(\vec x^{(i)})^{y^{(i)}})(1-h_{\vec \theta}(\vec x^{(i)}))^{1-y^{(i)}} \\
\end{align}
\]

则逻辑回归的损失函数（也即负对数似然函数为）:

\[\begin{align}
J(\vec \theta)
& = -ln l(\vec \theta) \\
& = \sum\limits_{i=1}^{m}[y^{(i)}ln(h_{\vec \theta}(\vec x^{(i)}))+(1-y^{(i)})ln(1-h_{\vec \theta}(\vec x^{(i)}))] \\
\end{align}
\]

对应的优化求解，通常也是用梯度下降来搞

Softmax回归

考虑多类分类问题，\(y^{(i)} \in \{1,2,...,K\}\)，则将逻辑回归扩展一下可以得到想要的假设函数和损失函数。

Softmax

Sigmoid是这样的映射：\(\sigma: \mathbb{R} \rightarrow \{0,1\}\)

Softmax则是这样的映射：\(\sigma: \mathbb{R}^{K} \rightarrow \{0,1\}^{K}\)

也即，Softmax对一个只有一个1的one-hot编码的类别标签向量\(\vec t^{(i)}\)做映射，效果上是\(\vec t^{(i)}\)的每个维度都被Softmax映射到\(\{0,1\}\)内，但并不是各个维度独立执行sigmoid，而是：

\[\begin{align}
Softmax(\vec x^{(i)})
& = [\frac{e^{x_1^{(i)}}}{\sum\limits_{j=1}^Ke^{x_j^{(i)}}};\frac{e^{x_2^{(i)}}}{\sum\limits_{j=1}^Ke^{x_j^{(i)}}}; ...; \frac{e^{x_K^{(i)}}}{\sum\limits_{j=1}^Ke^{x_j^{(i)}}};] \\
\end{align}
\]

所以，看到很多网上的资料写说softmax看作是sigmoid的泛化形式，我觉得有误导嫌疑，从公示上看并不像，仅仅是效果上相似。

Softmax回归的假设函数

相当于在线性回归对于各个类别的预测的概率向量基础上，包了一层Softmax:

\[\begin{align}
h_{\vec \theta}(\vec x^{(i)})
& = Softmax(\vec \theta_1^T\vec x^{(i)}; \vec \theta_2^T\vec x^{(i)}, ...;\vec \theta_K^T\vec x^{(i)}) \\
& = (p(y^{(i)}=1|\vec x^{(i)};\vec \theta_1); p(y^{(i)}=2|\vec x^{(i)};\vec \theta_1); ...; p(y^{(i)}=K|\vec x^{(i)};\vec \theta_1)) \\
& = \frac{1}{\sum\limits_{j=1}^Ke^{\vec \theta_j^T\vec x^{(i)}}}(e^{\vec \theta_1^T\vec x^{(i)}}; e^{\vec \theta_2^T\vec x^{(i)}}; ...; e^{\vec \theta_K^T\vec x^{(i)}}) \\
\end{align}
\]

Softmax回归的loss函数

依然是用负对数似然函数作为损失函数，只不过此时的p.d.f是服从多点分布的了：

\[\begin{align}
f(y^{(i)}|h_{\vec\theta}(\vec x^{(i)}))
& = \prod\limits_{j=1}^Kp(y^{(i)}=j) \\
& = \prod\limits_{j=1}^K(h_{\vec \theta_j}(\vec x^{(i)})^{y^{(i)}}) \\
\end{align}
\]

其似然函数为：

\[\begin{align}
l(\vec \theta)
& = \prod\limits_{i=1}^mf(y^{(i)}|h_{\vec\theta}(\vec x^{(i)})) \\
\end{align}
\]

使用负对数似然函数作为损失函数：

\[\begin{align}
J(\vec \theta)
& = -ln l(\vec \theta) \\
& = -\sum\limits_{i=1}^mlnf(y^{(i)}|h_{\vec \theta}(\vec x^{(i)})) \\
& = -\sum\limits_{i=1}^m \sum\limits_{j=1}^K y^{(i)}ln(h_{\vec\theta_j}(\vec x^{(i)})) \\
& = -\sum\limits_{i=1}^m \sum\limits_{j=1}^K (I(y^{(i)}=j)ln(h_{\vec\theta_j}(\vec x^{(i)}))) \\
& = -\sum\limits_{i=1}^m ln(\frac{e^{\vec\theta_{y^{(i)}}\vec x^{(i)}}}{\sum\limits_{l=1}^Ke^{\vec\theta_l^T \vec x^{(i)}}}) \\
\end{align}
\]

交叉熵损失函数Cross-Entropy Loss Function

逻辑回归是做二分类，其损失函数是2类情况下的Cross-Entropy Loss

Softmax回归是做多类分类，其损失函数是K类情况下的Cross-Entropy Loss：

\[\begin{align}
-\sum\limits_{c=1}^K{y_{gt}\log(y_{pred})}
\end{align}
\]

其中\(y_{gt}\)表示ground truth的y取值；\(y_{pred}\)表示分类器预测出来的y取值

Softmax Loss和Cross-Entropy Loss是一样的吗？

Cross-Entropy Loss，交叉熵损失函数。

严格说起来Cross-Entropy Loss则是规范属术语，而Softmax Loss不是规范术语。Softmax classifier是一个线性分类器，使用到了Cross-Entropy Loss函数。也就是说，交叉熵损失函数的梯度，告诉了Softmax分类器应该如何在SGD更新公式里更新参数\(\vec \theta\)。

但是，约定俗成的说法，当人们提到SoftmaxLoss时，说的就是Cross-Entropy Loss。

(ref: https://www.quora.com/Is-the-softmax-loss-the-same-as-the-cross-entropy-loss)

此外也注意到，Softmax回归和Logistic回归，它们的损失函数都是交叉熵损失函数。

Caffe里的线性回归、逻辑回归、softmax回归的损失函数

EuclideanLoss

EuclideanLoss作为线性回归的损失函数

SigmoidCrossEntropyLoss

SigmoidCrossEntropyLoss是计算cross-entropy (logistic) loss，也就是multi-label并且label相互独立，例如“民族歌曲、女声、优雅”这样的标签；当然也可用于互斥的label，也即多类分类展开为one-hot编码，但此时和SoftmaxWithLoss计算结果是不一样的。

这个函数具体实现的时候，为了数值的稳定性，做了处理。参考：http://www.caffecn.cn/?/question/25

SoftmaxWithLoss

SoftmaxWithLoss是计算multinomial logistic loss，也就是服从多点分布的情形，单个标签，one-hot编码后只有一个1，其计算结果和SigmoidCrossEntropyLoss不能混为一谈。

具体计算时，各个维度分别减去最大维度上的值再计算softmax（作为预测出的概率），然后套用到负对数似然损失函数中。参考shuzfan的博客：https://blog.csdn.net/shuzfan/article/details/51460895

二分类时，SigmoidCrossEntropyLoss和SoftmaxWithLoss的异同

两者相同的地方：都是用交叉熵作为损失函数的大模样

\[\frac{1}{m}\sum\limits_{i=1}^{m}[y^{(i)}\ln (\hat{y^{(i)}}) + (1-y^{(i)}) \ln (1-\hat{y^{(i)}})]
\]

其中\(\hat{y^{(i)}}\)也就是\(h_{\vec \theta}(\vec x^{(i)})\)

两者不同的地方：前者用sigmoid分别处理特征的各个维度，处理后作为预测的概率\(\hat y\)；后者用softmax处理整个特征的各个维度。注意sigmoid是独立考虑计算各个维度的，而sofmax必须知道所有维度取值后才可以分别计算各个维度。

用代码运行结果验证:

取x=[3,5]作为分类器/回归器/损失函数的输入，对应的ground truth类别标签为1，one-hot编码后为[0,1]。

分别以EuclideanLoss、SigmoidCrossEntropyLoss、SoftmaxWithLoss作为loss函数进行计算（这里是为了示范，实际情况下分类任务不用EuclideanLoss)。

test.py:

#!/usr/bin/env python

# coding:utf-8

from __future__ import print_function

import os, sys

pycaffe_dir = '/home/chris/work/caffe-BVLC/python'

sys.path.insert(0, pycaffe_dir)

import numpy as np

import caffe

from caffe import layers as L, params as P, to_proto

from caffe.proto import caffe_pb2

import yaml

from matplotlib import pyplot as plt

x = np.array([3,5], dtype=np.float32)

x = x[np.newaxis, :]

# single_label:把类别对应的索引作为single_label，从0开始

y1 = np.array([1], dtype=np.float32)

y1 = y1[np.newaxis, :]

# full_label:one-hot编码格式的类别标签向量，只有一个1，其他都是0

y2 = np.array([0, 1], dtype=np.float32)

y2 = y2[np.newaxis, :]

print('x.shape:', x.shape)

print('y1.shape:', y1.shape)

print('y2.shape:', y2.shape)

caffe.set_mode_cpu()

solver = caffe.SGDSolver('solver.pt')

solver.net.blobs['data'].data[...] = x

solver.net.blobs['single_label'].data[...] = y1

solver.net.blobs['full_label'].data[...] = y2

solver.step(1)

print('===========================')

print('x: [3,5], y:1，i.e. [0,1]')

# 0.12692806

# 也就是-math.log(math.exp(0)/(math.exp(-2)+math.exp(0)))

softmax_loss = solver.net.blobs['softmax_loss'].data

print('softmax_loss:', softmax_loss)

# 3.0553026

# 也就是-(-3-math.log(1+math.exp(-3))-math.log(1+math.exp(-5)))

sigmoid_cross_entropy_loss = solver.net.blobs['sigmoid_cross_entropy_loss'].data

print('sigmoid_cross_entropy_loss:', sigmoid_cross_entropy_loss)

# 12.5

euclidean_loss = solver.net.blobs['euclidean_loss'].data

print('euclidean_loss:', euclidean_loss)

solver.pt:

train_net: "train.pt"

base_lr: 0.1

display: 10

max_iter: 300

lr_policy: "step"

gamma: 0.1

momentum: 0.9

weight_decay: 0.0005

stepsize: 200

snapshot: 300

snapshot_prefix: "test"

solver_mode: CPU

device_id: 0

train.pt:

layer{

  name: "data"

  type: "Input"

  top: "data"

  top: "single_label"

  top: "full_label"

  input_param {

    shape{

      dim: 1

      dim: 2

    }

    shape{

      dim: 1

      dim: 1

    }

    shape{

      dim: 1

      dim: 2

    }

  }

}

layer {

  name: "euclidean_loss"

  type: "EuclideanLoss"

  bottom: "data"

  bottom: "full_label"

  top: "euclidean_loss"

}

layer{

  name: "sigmoid_cross_entropy_loss"

  type: "SigmoidCrossEntropyLoss"

  bottom: "data"

  bottom: "full_label"

  top: "sigmoid_cross_entropy_loss"

}

layer {

  name: "softmax_loss"

  type: "SoftmaxWithLoss"

  bottom: "data"

  bottom: "single_label"

  top: "softmax_loss"

}

运行结果：

x: [3,5], y:1，i.e. [0,1]

softmax_loss: 0.12692806

sigmoid_cross_entropy_loss: 3.0553026

euclidean_loss: 12.5