
  Aggregation models: mix or combine hypotheses for better performance, and it's a rich family. Aggregation can do better with many (possibly weaker) hypotheses.

  Suppose we have $T$ hypotheses ,denoted by $g_1$, $g_2$, ... ,$g_T$. There are four different approachs to get a appregation model:

1.Select the best one $g_{t_*}$ from validation error $$G(x)=g_{t_*}(x) with t_*=argmin_{t \in \{1,2,...,T\}}E_{val}(g^-_t)$$

2.Mix all hypotheses uniformly $$G(x)=sign(\sum_{t=1}^T1*g_t(x))$$

3.mix all hypotheses non-uniformly $$G(x)=sign(\sum_{t=1}^T\alpha_t*g_t(x)) \quad with \quad  \alpha_t \geq 0$$

  NOTE: conclude select and mix uniformly.

4.Combine all hypotheses conditionally $$G(x)=sign(\sum_{t=1}^Tq_t(x)*g_t(x)) \quad  with \quad  q_t(x)\geq 0$$

  NOTE: conclude non-uniformly

Why aggregation work?

In the left graph,  we get a strong $G(x)$ by mixing different weak hypotheses uniformly.  In some sense, aggregation can be seen as feature transform.

In the right graph, we get a moderate $G(x)$ by mixing different weak hypotheses uniformly.  In some sense, aggregation can be seen as regularization.

          appgegation type              blending                 learning       
                 uniform        voting/averging     Bagging
             non-uniform                linear      Adaboost
              conditional             stacking       Decision Tree 

Uniform Blending

Classification: $G(x)=sign(\sum_{t=1}^T1*g_t(x))$


And uniformly blending can reduce variance for more stable performance(数学推导可见课件207_handout.pdf).

Linear Blending

Classification:$G(x)=sign(\sum_{t=1}^T\alpha_t*g_t(x)) \quad with \quad  \alpha_t \geq 0$

Regression:$G(x)=\frac{1}{T}\sum_{t=1}^T\alpha_t*g_t(x) \quad with \quad  \alpha_t \geq 0$

How to choose $\alpha$?  We need get some $\alpha$ to minimize $E_{in}$. $$\mathop {\min }\limits_{\alpha_t\geq0}\frac{1}{N}\sum_{n=1}^Nerr\Big(y_n,\sum_{t=1}^T\alpha_tg_t(x_n)\Big)$$

so $ linear blending = LinModel + hypotheses as transform + constraints$.

  Given $g_1^-$, $g_2^-$, ..., $g_T^-$ from $D_{train}$, transform $(x_n, y_n)$ in $D_{val}$  to $(z_n=\Phi^-(x_n),y_n)$,where $\Phi^-(x)=(g_1^-(x),...,g_T^-(x))$.And

  1. compute $\alpha$ = LinearModel$\Big(\{(z_n,y_n)\}\Big)$
  2. return $G_{LINB}(x)=LinearHypothesis_\alpha(\Phi(x))$

Bootstrap Aggregation(bagging)

Bootstrap sample $\widetilde{D}_t$: resample N examples  from $D$ uniformly with replacement - can also use arbitracy N' instead of N.

bootstrap aggregation:

  consider a physical iterative process that for t=1,2,...,T:

  1. request size-N' data $\widetilde{D}_t$ from bootstrap;
  2. obtain $g_t$ by $\mathcal{A}(\widetilde{D}_t)$, $G=Uniform(\{g_t\})$.

Adaptive Boosting (AdaBoost) Algorithm

Decision Tree

Random Forest

$$RF = bagging +random-subspace C&RT$$

