Foundations of Machine Learning: The Margin Explanation for Boosting's Effectiveness

Foundations of Machine Learning: The Margin Explanation for Boosting's Effectiveness

在这一节，我们要回答的一个问题是：什么样的分类器用于预测未知数据会更让人信服？而要回答这个问题，我们首先得量化“信服”这个概念。那就是margin， margin越大就越让人信服。

一、支撑向量机

SVM 用一个超平面$w\cdot x+ b=0$对数据进行分类，而分类的原则是使样本离这个超平面最短的距离尽可能的大，或者说使所有样本点离这个超平面的最短距离最大。

图 1：SVM example

如图1所示，点到直线的距离为：$\frac{|w\cdot x+b|}{\| w\|_2}$，则最小距离为： $\rho=\min_{x\in S}\frac{|w\cdot x+b|}{\|w\|_2}$。

由于我们可以同时对$(w,b)$进行收缩和扩大而不会影响超平面的位置，故我们可以令$\min_{x\in S}|w\cdot x+b|=1$，即$\rho=\min_{x\in S}\frac{|w\cdot x+b|}{\|w\|_2}=\frac{1}{\|w\|_2}$，而这个$\rho$就是margin。现在我们要最大化这个$\rho$，即最小化这个margin。

另外，我们还必须保证其他所有点到超平面的margin大于1。即对所有点$x_i$，有$\frac{|w\cdot x_i+b|}{\|w\|_2}\geq \rho = \frac{1}{\|w\|_2}$即$|w\cdot x_i+b|\geq 1$。所以最终的优化模型型为：

\begin{eqnarray*}\min & \frac{1}{2}\|w\|_{2}^{2} \\ \text{s.t.} & y_i(w\cdot x_i+b)\geq 1\end{eqnarray*}

这是对线性可分的情况，若对线性不可分，可加入slack 变量，模型变为:

\begin{align*} \min_{w,b,\xi}\ \ \ & \frac{1}{2}\|w\|_{2}^{2} + C\sum_{i=1}^m\xi_i \\ \text{s.t.}\ \ \ & y_i(w\cdot x_i + b) \geq 1- \xi_i \\ \ \ \ & \xi_i \geq 0,\ \ \ \ i\in 1,..., m\end{align*}

上面的这些模型是在原空间下的模型，我们可以利用lagrangian把其转化为对偶空间下的形式，这样就可以利用kernel实现非线性的分类。

二、边界理论（Margin Theroy）

定义 1：一个带有标签$y$的样本点$x$与线性分类器$h:\ x\rightarrow w\cdot x+b$的几何边界$\rho(x)$是该点到超平面$w\cdot x + b=0$的距离：

$$\rho(x) = \frac{y(w\cdot x + b)}{\|w\|_2}$$

对于样本$S=(x_1,x_2,...,x_m)$，线性分类器$h$的边界是样本中所有点的最小边界：

$$\rho=\min _{1\leq i\leq m}\frac{y_i(w\cdot x_i + b)}{\|w\|_2}$$

我们知道超平面的VC-dimension 为$N+1$，故应用推论2.4可得：

$$\mathcal{R}(h) \leq \widehat{\mathcal{R}}(h) + \sqrt{\frac{2(N+1)log\frac{em}{N+1}}{w}} + \sqrt{\frac{lof\frac{1}{\delta}}{2m}}$$

这是个与N有关的界，当应用kernel方法时，这个N可能很大甚至无穷大，故这个界对我们来说没有什么意义。

接下去，我们从margin的角度来求它的上界。

定理 4.1 令 $S \subseteq\{x:\|x\|_2\leq \gamma\}$。那么，相关超平面 $\{x\rightarrow sgn(w\cdot x):\min _{x\in S}|w\cdot x|=1\ \bigwedge\ \|w\|_2\leq \Lambda\}$的VC维d满足以下不等式：

$$d\leq \gamma^2\Lambda^2.$$

证明：假设$\{x_1,x_2,...,x_d\}$可以被正则超平面打散。也就说，对于所有$y=\{y_1,y_2,...,y_d\}\in\{-1,+1\}^d$, 存在$w$使$\forall i \in [1,d],\ 1 \leq y_i(w\cdot x_i)$ 成立。即

$$d\leq w\sum _{i=1}^d y_ix_i \leq \|w\|_2\|\sum _{i=1}^m y_ix_i\|_2\leq\Lambda\|\sum _{i=1}^m y_ix_i\|_2$$

由于上式对于所有$y$均成立，故对其期望也成立：

\begin{align*} d &\leq \Lambda \mathop{E} _{y}[\|\sum _{i=1}^d y_ix_i\|_2] \\ & \leq \Lambda [\mathop{E} _{y}[\|\sum _{i=1}^d y_ix_i\|_2^2]^{1/2} \\ & = \Lambda [\sum _{i,j=1}^d\mathop{E} _y[y_iy_j](x_ix_j)]^{1/2} \\ & = \Lambda [\sum _{i=1}^d(x_ix_j)]^{1/2} \\ & \leq \Lambda[d\gamma^2]^{1/2} = \Lambda\gamma\sqrt{d} \end{align*}

即 $\sqrt{d} \leq \Lambda\gamma$。

我们也可以用$\gamma,\Lambda$ 界定 empirical Rademacher complexity。

定理 4.2：令样本$S \subseteq\{x:\|x\|_2\leq \gamma\}$的大小为m，令$H=\{x\rightarrow w\cdot x:\|w\|_2\leq\Lambda\}$。那么H的empirical Rademacher complexity可以用如下式子来界定：

$$\widehat{\mathfrak{R}}(H)\leq \sqrt{\frac{\gamma^2\Lambda^2}{m}}$$

证明：

\begin{align*} \widehat{\mathfrak{R}}_S(H) &= \frac{1}{m}\mathop{E} _{\sigma}[\sup _{\|w\|_2\leq\Lambda}\sum _{i=1}^m\sigma_iwx_i] \\ &= \frac{1}{m}\mathop{E} _{\sigma}[\sup _{\|w\|_2\leq\Lambda}w\sum _{i=1}^m\sigma_ix_i] \\ &\leq \frac{\Lambda}{m}\mathop{E} _\sigma[\|\sum _{i=1}^m\sigma_ix_i\|_2] \\ &\leq \frac{\Lambda}{m}[\mathop{E} _\sigma[\|\sum _{i=1}^m\sigma_ix_i\|_2^2]]^{1/2} \\ &= \frac{\Lambda}{m}[\mathop{E} _\sigma[\sum _{i,j=1}^m\sigma_i\sigma_j(x_ix_j)]]^{1/2} \\ &\leq \frac{\Lambda}{m}[\sum _{i=1}^m\|x_i\|_2^2]^{1/2}\\ &\leq \frac{\Lambda\sqrt{m\gamma^2}}{m}=\sqrt{\frac{\gamma^2\Lambda^2}{m}}\end{align*}

为了给出generalization error 的界，我们先定义一些损失函数。

定义 2：边界损失函数。对任意$\rho>0$，$\rho$-margin函数 $L_\rho:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}_+$定义在所有$y,y'\in\mathbb{R}$且$L_\rho(y,y')=\Phi_\rho(yy')$，其中：

\begin{equation*} \Phi_\rho(x)= \begin{cases} 0 & if \rho\leq x\\ 1-x/\rho & if 0\leq x\leq\rho\\ 1 & if x\leq 0 \end{cases} \end{equation*}

定义 3：经验边界损失。给定一个样本$S=(x_1,x_2,...,x_m)$和假设$h$，经验边界损失定义为：

$$\widehat{\mathcal{R}}_\rho(h) = \frac{1}{m}\sum _{i=1}^m\Phi_\rho(y_ih(x_i))$$

注意，对任意$i\in[1,m]$. $\Phi_\rho(y_ih(x_i))\leq\mathbb{I}(y_ih(x_i)\leq\rho)$。因此，经验边界损失的上界定义为：

$$\widehat{\mathcal{R}}_\rho(h)\leq\frac{1}{m}\sum _{i=1}^m\mathbb{I}(y_ih(x_i)\leq \rho)$$

所有以 Empirical margin loss 为上界的结果店都可以用 Empirical margin loss 的上界替换，这个上界的意思：被错误分类以及分类的置信度小于$\rho$ 的点占总共点的比例。
$\Phi_\rho$是$1/\rho-Lipschitz$连续。

引理 4.1 Talagrand's lemma。令$\Phi:\mathbb{R}\rightarrow\mathbb{R}$为$l-Lipschitz$。那么，对于任何的一个实值函数的假设集H，以下不等式成立：

$$\widehat{\mathfrak{R}}_S(\Phi\circ H)\leq l\widehat{}\mathfrak{R}_S(H)$$

证明：固定样本 $S=(x_1,x_2,...,x_m)$，通过定义：
\begin{align*} \widehat{\mathfrak{R}}_S(\Phi\circ H)&=\frac{1}{m}\mathop{E}_\sigma[\sup_{h\in H}\sum_{i=1}^m\sigma_i(\Phi\circ H)(x_i)]\\ &=\frac{1}{m}\mathop{E}_{\sigma_1,...,\sigma_{m-1}}[\mathop{E}_{\sigma_m}[\sup_{h\in H}U_{m-1}(h)+\sigma_m(\Phi\circ h)(x_m)]] \end{align*}

其中$U_{m-1}(h)=\sum_{i=1}^{m-1}\sigma_i(\Phi\circ H)(x_i)$。

通过supremum（最小的上界）定义可知：对$\forall \epsilon > 0$,存在$h_1,H-2\in H$ 使下式成立

$$U_{m-1}(h_1) + (\Phi\circ h_1)(x_m)\geq(1-\epsilon)[\sup_{h\in H}U_{m-1}(h)+(\Phi\circ h)(x_m)]$$

$$U_{m-1}(h_2) - (\Phi\circ h_2)(x_m)\geq(1-\epsilon)[\sup_{h\in H}U_{m-1}(h)-(\Phi\circ h)(x_m)]$$

因此，对任意$\epsilon > 0$，通过$E_{\sigma_m}$的定义有

\begin{eqnarray*} & &(1-\epsilon)\mathop{E}_{\sigma_m}[\sup_{h\in H}U_{m-1}(h)+\sigma_m(\Phi\circ h)(x_m)] \\ &=&(1-\epsilon)[\frac{1}{2}\sup_{h\in H}U_{m-1}(h)+\sigma_m(\Phi\circ h)(x_m)+\frac{1}{2}\sup_{h\in H}U_{m-1}(h)-(\Phi\circ h)(x_m)]\\ &\leq&\frac{1}{2}[U_{m-1}(h_1)+(\Phi\circ h_1)(x_m)]+\frac{1}{2}[U_{m-1}(h_2)-(\Phi\circ h_2)(x_m)]\end{eqnarray*}

令$S=sgn(h_1(x_m)-h_2(x_m))$,则根据$\Phi$的$l-Lipschitz$可得：

\begin{align*} |(\Phi\circ h_1)(x_m)-(\Phi\circ h_2)(x_m)|&\leq l|h_1(x_m)-h_2(x_m)| \\ &=sl(h_1(x_m)-h_2(x_m))\end{align*}

故上面的不等式可继续放大：

\begin{eqnarray*} & &(1-\epsilon)\mathop{E}_{\sigma_m}[\sup_{h\in H}U_{m-1}(h)+\sigma_m(\Phi\circ h)(x_m)] \\ &\leq&\frac{1}{2}[U_{m-1}(h_1)+U_{m-1}(h_2)+sl(h_1(x_m)-h_2(x_m))]\\ &=&\frac{1}{2}[U_{m-1}(h_1)+slh_1(x_m)+U_{m-1}(h_2)-slh_2(x_m)]\\ &\leq&\frac{1}{2}\sup_{h\in H}[U_{m-1}(h)+slh(x_m)]+\frac{1}{2}\sup_{h\in H}[U_{m-1}(h)-slh(x_m)]\\ &=&\mathop{E}_{\sigma_m}[\sup_{h\in H}U_{m-1}(h)+\sigma_mlh(x_m)]\end{eqnarray*}

由于上述不等式对所有$\epsilon>0$都成立，故必有

$$\mathop{E}_{\sigma_m}[\sup_{h\in H}U_{m-1}(h)+\sigma_m(\Phi\circ h)(x_m)]\leq \mathop{E}_{\sigma_m}[\sup_{h\in H}U_{m-1}(h)+\sigma_mlh(x_m)]$$

对于所有$i=1,...,m-1$使用上面不等式得：

\begin{eqnarray*} &\ &\frac{1}{m}\mathop{E}_{\sigma_1,...,\sigma_m}[\sup_{h\in H}\sum_{i=1}^m\sigma_i(\Phi\circ h)(x_i)]\\ &\leq&\frac{1}{m}\mathop{E}_{\sigma_1,...,\sigma_{m-1}}[\mathop{E}_{\sigma_m}[\sup_{h\in H}U_{m-1}(h)+\sigma_mlh(x_m)]]\\ &\leq&\frac{1}{m}\mathop{E}_{\sigma_1,...,\sigma_{m-2}}[\mathop{E}_{\sigma_{m-1}\sigma_m}[\sup_{h\in H}U_{m-2}(h)+\sigma_{m-1}lh(x_{m-1})+\sigma_mlh(x_m)]]\\ &\ & ...\\ &\leq&\frac{1}{m}\mathop{E}_{\sigma_1,...,\sigma_m}[\sup_{h\in H}\sigma_1lh(x_1)+\sigma_2lh(x_2)+...+\sigma_mlh(x_m)]\\ &=&l\widehat{\mathfrak{R}}_S(H)\end{eqnarray*}

定理 4.3 Margin bound for binary classification。令H为实值函数的集合。固定$\rho>0$，那么，对于任意的$\delta>0$，至少以概率$1-\delta$，以下的每一个不等式对所有的$h\in H$都成立：

$$\mathcal{R}(h)\leq\widehat{\mathcal{R}}_\rho(h)+\frac{2}{\rho}\mathfrak{R}_m(H) + \sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

$$\mathcal{R}(h)\leq\widehat{\mathcal{R}}_\rho(h)+\frac{2}{\rho}\widehat{\mathfrak{R}}_S(H) +3\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

证明：令$\widetilde{H}=\{z=(x,y)\rightarrow yh(x);h\in H\}$,考虑取值为$[0,1]$的函数族$\widetilde{\mathcal{H}}=\{\Phi_\rho\circ
f:f\in\widetilde{H}\}$。根据定理2.1 有：$\forall g\in\widetilde{\mathcal{H}}$，至少以概率$1-\delta$下式成立：

$$E[g(z)]\leq\frac{1}{m}\sum_{i=1}^mg(z_i)+2\mathfrak{R}_m(\widetilde{\mathcal{H}})+\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

即$\forall h\in H$

$$E[\Phi_\rho(yh(x))]\leq\widehat{\mathcal{R}}_\rho(h)+2\mathfrak{R}_m(\Phi_\rho\circ \widetilde{H})+\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

又因为

$$\mathcal{R}(h)=E[\mathbb{I}(yh(x)\leq 0)]\leq E[\Phi_\rho(yh(x))](\text{由于}\mathbb{I}(u\leq 0)\leq\Phi_\rho(u))$$

故

$$\mathcal{R}(h)\leq\widehat{\mathcal{R}}_\rho(h)+2\mathfrak{R}(\Phi_\rho\circ \widetilde{H})+\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

由于$\Phi_\rho$是$\frac{1}{\rho}-Lipschitz$, 故$\widetilde{\mathfrak{R}}_S(\Phi_\rho\circ \widehat{H})\leq\frac{1}{\rho}\widehat{\mathfrak{R}}_S(\widetilde{H})$对$\forall S$成立。

所以

$$\mathfrak{R}_m(\Phi_\rho\circ \widetilde{H})=\mathop{E}_S[\widehat{\mathfrak{R}}_S(\Phi_\rho\circ \widetilde{H})]\leq\frac{1}{\rho}\mathop{E}_S[\widetilde{\mathfrak{R}}_S(\widetilde{H})]=\mathfrak{R}(\widetilde{H})$$

又因为

$$\mathfrak{R}_m(\widetilde{H})=\frac{1}{m}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=1}^m\sigma_iy_ih(x_i)]=\frac{1}{m}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=1}^m\sigma_ih(x_i)]=\mathfrak{R}_m(H)$$

所以

$$\mathcal{R}(h)\leq \widehat{\mathcal{R}}_\rho(h) + \frac{2}{\rho}\mathfrak{R}_m(H) +\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

对于第二个不等式用引理2.1 的第二个不等式也可得到。

三、基于margin的分析

首先，我们将Boosting产生的最后组合分类器写成$g=\sum_{t=1}^T\alpha_th_t\triangleq\alpha\cdot h(x)$, 其中$\alpha=(\alpha_1,\alpha_2,...,\alpha_T)^T$, $h(x)=(h_1(x),h_2(x),...,h_T(x))^T$。然后定义Boosting中的Margin。

定义 4：$L_1$-margin。The $L_1$-margin $\rho(x)$ of a point $x\in \mathcal{X}$,with label $y\in\{-1,+1\}$ for a linear combination of base classifiers$g=\sum_{t=1}^T\alpha_th_t$ with $\alpha\neq 0$ and $h_t\in H$ for all $t\in[1,T]$ is defined as
$$\rho(x)=\frac{yg(x)}{\sum_{t=1}^T|\alpha_t|}=y\frac{\alpha\cdot h(x)}{\|\alpha\|_1}$$
The $L_1$-margin of a linear combination classifier g with respect to a sample $S=(x_1,x_2,...,x_m)$ is the minimum margin of the points within the sample:
$$ \rho=\min_{i\in[1,m]}y_i\frac{\alpha\cdot h(x_i)}{\|\alpha\|_1}$$

当$\alpha_t\geq0$ （AdaBoost 中 $\alpha_t\geq0$）时， $\rho(x)$ 是 $yh_1(x),yh_2(x),...,yh_t(x)$ 的凸组合，且当$h_t$取值为$[-1,+1]$时$\rho(x)$的取值也为$[-1,+1]$，而$|\rho(x)|$可看成分类器$g(x)=\sum_{t=1}^T\alpha_th_t(x)$将$x$分成$y$的置信程度。
将上述margin与svm中的margin比较。SVM中的margin定义在$l_2-norm$中，这里的margin定义在$l_1-norm$中。
$$\rho_1(x)=\frac{|\alpha\cdot h(x)|}{\|\alpha\|_1}\ \ \ \ \ \ \rho_2(x)=\frac{|\alpha\cdot h(x)|}{\|\alpha\|_2}$$
当$p,q\geq 1$, 且 $1/p+1/q=1$时，p与q互为共轭，且点x到超名面$\alpha\cdot x=0$的$L_q$距离为$|\alpha\cdot x/\|\alpha\|_p$。也就是说$\rho_2(x)$为点到超平面$\alpha\cdot x=0$的$l_2$距离，即欧式距离；$\rho_1(x)$为点到超平面$\alpha\cdot x=0$的$l_\infty$距离。（点到$l_2$距离表示，这个点到垂点的直线距离；点到超平面的$l_\infty$距离表示，这个点到垂点的坐标相差最大的距离。）

接下去，分两步讲解。第一，分析假设集的凸组合形成的Rademacher complexity; 第二，使用margin理论分析Boosting。

定义$conv(H)$, 对于任意假设集H，

$$conv(H)=\{\sum_{k=1}^p\mu_kh_k:p\geq1,\forall k\in[1,p],\mu_k\geq0,h_k\in H,\sum_{k=1}^p=1\}$$

定理 4.4 令H为从$\mathcal{X}$ 到 $\mathbb{R}$的函数集合。那么，对任意的样本S，我们有：

\begin{equation}\label{equ:10}\widehat{\mathfrak{R}}_S(conv(H))=\widehat{\mathfrak{R}}_S(H)\end{equation}

证明：

\begin{align}\widehat{\mathfrak{R}}_S(conv(H))&=\frac{1}{m}\mathop{E}_\sigma[\sup_{h_1,...,h_p\in H,\mu\geq0,\|\mu\|_1=1}\sum_{i=1}^m\sigma_i\sum_{k=1}^p\mu_kh_k(x_i)]\nonumber\\ &=\frac{1}{m}\mathop{E}_\sigma[\sup_{h_1,...,h_p\in H,\mu\geq0}\sup_{\|\mu\|_1=1}\sum_{k=1}^p\mu_k\sum_{i=1}^m\sigma_ih_k(x_i)]\nonumber\\ &=\frac{1}{m}\mathop{E}_\sigma[\sup_{h_1,...,h_p\in H}\max_{k\in[1,p]}(\sum_{i=1}^m\sigma_ih_k(x_i))] \label{equ:11}\\ &=\frac{1}{m}\mathop{E}_\sigma[\sup_{h\in H}\sum_{i=1}^m\sigma_ih(x_i)]=\widehat{\mathfrak{R}}_S(H) \end{align}
等式\ref{equ:11}成立是因为凸组合的最大值就是将所有权重都分配给值最大的那一点。

上述等式\ref{equ:10}成立，说明：
$$ \mathfrak{R}_m(conv(H))=\mathop{E}_{S}[\widehat{\mathfrak{R}}_S(conv(H))]=\mathop{E}_S[\widehat{\mathfrak{R}}_S(H)]=\mathfrak{R}_m(H)$$
将定理4.3 应用与此，可得推论4.1。

推论 4.1：Ensemble Rademacher margin bound。令H为一个实值函数的集合。固定$\rho>0$。那么，对任意的$\delta>0$，至少以概率$1-\delta$以下的每一个不等式在所有的$h\in conv(H)$下都成立：

$$\mathcal{R}(h)\leq \widehat{\mathcal{R}}_\rho(h)+\frac{2}{\rho}\mathfrak{R}_m(H)+\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

$$\mathcal{R}(h)\leq \widehat{\mathcal{R}}_\rho(h)+\frac{2}{\rho}\widehat{\mathfrak{R}}_S(H)+3\sqrt{\frac{log\frac{2}{\delta}}{2m}}$$

结合推论2.1 推论2.3 以及定理4.3 ，可以得到以下推论：

推论 4.2 Ensemble VC-Dimension margin bound。令H为取值为$\{+1,-1\}$ VC维为d的函数族。固定$\rho>0$。那么，对任意的$\delta>0$，至少以概率$1-\delta$以下的每一个不等式在所有的$h\in conv(H)$下都成立：

$$\mathcal{R}(h)\leq \widehat{\mathcal{R}}_\rho(h)+\frac{2}{\rho}\sqrt{\frac{2dlog\frac{em}{d}}{m}}+\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

凸组合要求所有系数相加等于1，但AdaBoost产生的系数$\alpha_t$，虽然能保证其值大于0，但$\sum_{i=1}^T\alpha_t$不一定等于1.所以，我们必须对系数进行归一化。令$g=\sum_{t=1}^T\alpha_th_t$为AdaBoost在跑了T步后返回的分类器，将其归一化为：$\frac{g}{\|\alpha\|_1}=\sum_{t-1}^T\frac{\alpha_t}{\|\alpha\|_1h_t}\in conv(H)$由于$sgn(g)=sgn(g/\|\alpha\|_1)$, 因此$\mathcal{R}(g)=\mathcal{R}(g/\|\alpha\|_1)$，但$\widehat{\mathcal{R}}_\rho(g)\neq \widehat{\mathcal{R}}_\rho(g/\|\alpha\|_1)$。

所以根据推论4.1 和推论4.2有：

\begin{align*} \mathcal{R}(g)=\mathcal{R}(g/\|\alpha\|_1) &\leq \widehat{\mathcal{R}}_\rho(g/\|\alpha\|_1)+\frac{2}{\rho}\mathfrak{R}_m(H)+\sqrt{\frac{log\frac{1}{\delta}}{2m}} \\ \mathcal{R}(g)=\mathcal{R}(g/\|\alpha\|_1) &\leq \widehat{\mathcal{R}}_\rho(g/\|\alpha\|_1)+\frac{2}{\rho}\widehat{\mathfrak{R}}_S(H)+3\sqrt{\frac{log\frac{2}{\delta}}{2m}} \\ \mathcal{R}(g)=\mathcal{R}(g/\|\alpha\|_1) &\leq \widehat{\mathcal{R}}_\rho(g/\|\alpha\|_1)+\frac{2}{\rho}\sqrt{\frac{2dlog\frac{em}{d}}{m}}+\sqrt{\frac{log\frac{1}{\delta}}{2m}}\end{align*}

从 $\widehat{\mathcal{R}}(h)$ 的定义，我们可以知道：

$$\widehat{\mathcal{R}}_\rho(g/\|\alpha\|_1)\leq\frac{1}{m}\sum_{i=1}^m\mathbb{I}(y_ig(x_i)/\|\alpha\|_1\leq\rho)$$

因此可以证明以下定理：

定理 4.5： 令$g=\sum_{t=1}^T\alpha_th_t$表示AdaBoost经过T步后返回的分类器函数，假设对所有的$t\in[1,T]$，$\epsilon_t<\frac{1}{2}$，也就是说$\alpha_t>0$。那么，对任意$\rho>0$，以下不等式成立：

$$\widehat{\mathcal{R}}_\rho(\frac{g}{\| \alpha \|_1})\leq 2^T \prod_{t=1}^T\sqrt{\epsilon_t^{1-\rho}(1-\epsilon_t)^{1+\rho}}$$

证明：

\begin{align*} \widehat{\mathcal{R}}_\rho(\frac{g}{\| \alpha \|_1}) &\leq \frac{1}{m}\sum_{i=1}^m\mathbb{I}(y_ig(x_i)-\rho\| \alpha \|_1\leq 0) \\ &\leq \frac{1}{m}\sum_{i=1}^m exp(-y_ig(x_i)+\rho\| \alpha \|_1) \\ &= \frac{1}{m}\sum_{i=1}^m exp(\rho \| \alpha \|_1)[m\prod_{t=1}^TZ_t]D_{T+1}(i) \\ &= e^{\rho \| \alpha \|_1}\prod_{t=1}^TZ_t = e^{\rho\sum_i\alpha_i}\prod_{t=1}^TZ_t \\ &= e^{\rho\sum_i\frac{1}{2}log\frac{1-\epsilon_t}{\epsilon_t}}\prod_{t=1}^T2\sqrt{\epsilon_t(1-\epsilon_t)} \\ &= 2^T\prod_{t=1}^T[\sqrt{\frac{1-\epsilon_t}{\epsilon_t}}]^\rho \sqrt{\epsilon_t(1-\epsilon_t)} \\ &= 2^T\prod_{t=1}^T\sqrt{\epsilon_t^{1-\rho}(1-\epsilon_t)^{1+\rho}} \\ &= \prod_{t=1}^T\sqrt{4\epsilon_t^{1-\rho}(1-\epsilon_t)^1+\rho}\end{align*}

说明：

（1）、若对所有$t\in[1,T]$, $\gamma \leq (\frac{1}{2}-\epsilon_t)$ 且 $\rho\leq 2\gamma$都成立的话。函数 $f(\epsilon_t)=4\epsilon_t^{1-\rho}(1-\epsilon_t)^{1+\rho}$ 在 $\epsilon_t=\frac{1}{2}-\gamma$时取最大值。即$$ \widehat{\mathcal{R}}_\rho(\frac{g}{\| \alpha \|_1})\leq [(1-2\gamma)^{1-\rho}(1+2\gamma)^{1+\rho}]^{T/2}$$

当 $\sqrt{(1-2\gamma)^{1-\rho}(1+2\gamma)^{1+\rho}}<1$, 即 $\rho <\theta(\gamma)\triangleq \frac{-ln(1-4\gamma^2)}{ln(\frac{1+2\gamma}{1-2\gamma})}$时 $\widehat{\mathcal{R}}_\rho(\frac{g}{\| \alpha \|_1})$以指数级下降。并且由于
$$ \widehat{\mathcal{R}}_\rho(\frac{g}{\| \alpha \|_1}) \leq \frac{1}{m}\sum_{i=1}^m\mathbb{I}(y_ig(x_i)-\rho\| \alpha \|_1\leq 0)$$
右边的式子总是$\frac{1}{m}$的倍数，即当T足够大时，右边总是会达到0。也就是说，对所有样本，其margin都大于$\rho$，即margin至少为$\theta(\gamma)$，或者说以$\theta(\gamma)$为界。

所以当T达到一定数量时，$\theta(\gamma)$为训练集的最小margin。

（2）、由$\theta(\gamma)$的表达式可知，当 $\gamma$越大时$\theta(\gamma)$越大，即最小的margin越大。更进一步说，如果每一步的边$\gamma_t$越大，最小的margin也越大。这就将边与margin的关系联系起来了。

（3）这也解释了为什么AdaBoost不容一产生Overfit，即使训练错误为0，增加步数也能降低预测错误（因为margin在增大，也就是说置信度在增大）。

Foundations of Machine Learning: The Margin Explanation for Boosting's Effectiveness的更多相关文章

Foundations of Machine Learning: Boosting
Foundations of Machine Learning: Boosting Boosting是属于自适应基函数(Adaptive basis-function Model(ABM))中的一种模 ...
Foundations of Machine Learning: Rademacher complexity and VC-Dimension(2)
Foundations of Machine Learning: Rademacher complexity and VC-Dimension(2) (一) 增长函数(Growth function) ...
Foundations of Machine Learning: Rademacher complexity and VC-Dimension(1)
Foundations of Machine Learning: Rademacher complexity and VC-Dimension(1) 前面两篇文章中,我们在给出PAC-learnabl ...
Foundations of Machine Learning: The PAC Learning Framework(2)
Foundations of Machine Learning: The PAC Learning Framework(2) (一)假设集有限在一致性下的学习界. 在上一篇文章中我们介绍了PAC-le ...
Foundations of Machine Learning: The PAC Learning Framework(1)
写在最前:本系列主要是在阅读 Mehryar Mohri 等的最新书籍<Foundations of Machine Learning>以及 Schapire 和 Freund 的 < ...
【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料汇总（上）
转载:http://dataunion.org/8463.html?utm_source=tuicool&utm_medium=referral <Brief History of Ma ...
机器学习(Machine Learning)与深度学习(Deep Learning)资料汇总
<Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.D ...

随机推荐

C语言union关键字,union和struct区别
union 关键字的用法与struct 的用法非常类似. union 维护足够的空间来置放多个数据成员中的“一种”,而不是为每一个数据成员配置空间,在union 中所有的数据成员共用一个空间,同一时间 ...
Gunicorn+Flask中重复启动后台线程问题
假设程序如下: if __name__ == '__main__': t = Thread(target=test) t.start() app.run(host='0.0.0.0',port=808 ...
Linux中Shell的执行流程
Shell执行流程 1.Printthe info of reminding 打印提示信息 2.Waitinguser for input(wait) 等待用户输入 3.Acceptthe comma ...
pytorch hook使用
由于pytorch会自动舍弃图计算的中间结果,所以想要获取这些数值就需要使用钩子函数. 钩子函数包括Variable的钩子和nn.Module钩子,用法相似. import torch from to ...
Android组件之自定义ContentProvider
Android的数据存储有五种方式Shared Preferences.网络存储.文件存储.外储存储.SQLite,一般这些存储都只是在单独的一个应用程序之中达到一个数据的共享,有时候我们需要操作其他 ...
GridControl 分组排序
方法一:纯代码 this.list.gridControl.ItemsSource = lsItem; this.list.gridControl.GroupBy("GroupTitle&q ...
【转】Understanding the Angular Boot Process
原文: https://medium.com/@coderonfleek/understanding-the-angular-boot-process-9a338b06248c ----------- ...
重新安装 RCU-数据库 2014-11-22
删除数据库Endv(原RCU数据库) 重建数据库为LLS(新RCU数据库)..略.. Database Control URL 为 https://www:1158/em 管理资料档案库已置于安全模式 ...
hdu-悼念512汶川大地震遇难同胞——珍惜现在，感恩生活
http://acm.hdu.edu.cn/showproblem.php?pid=2191 Problem Description 急!灾区的食物依然短缺! 为了挽救灾区同胞的生命,心系灾区同胞的你 ...
uni - 条件渲染
vue官方文档和uni官方同步:https://cn.vuejs.org/v2/guide/conditional.html 1.多次切换建议使用v-show(始终保存在BOM) 2.因为if是惰性判 ...

Foundations of Machine Learning: The Margin Explanation for Boosting's Effectiveness

Foundations of Machine Learning: The Margin Explanation for Boosting's Effectiveness的更多相关文章

随机推荐

热门专题