Comparing Differently Trained Models

At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.

To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.

In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.

(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)

As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.

Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:

(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279

If we also consider the top-12 features of both
models, which are pretty similar,

(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast

we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.

Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.

Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.

 

Comparing Differently Trained Models的更多相关文章

  1. 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015

    Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...

  2. (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search

    The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...

  3. (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance

    Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...

  4. Keras vs. PyTorch

    We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...

  5. Understanding Tensorflow using Go

    原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...

  6. How to handle Imbalanced Classification Problems in machine learning?

    How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...

  7. understanding backpropagation

    几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...

  8. 论文翻译——Deep contextualized word representations

    Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...

  9. 论文翻译——Character-level Convolutional Networks for Text Classification

    论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...

随机推荐

  1. GTK学习笔记————创建窗口

    创建gtk1.c文件 代码 #include <gtk/gtk.h> int main (int argc, char *argv[]) { GtkWidget *window; gtk_ ...

  2. Unity关于方法事件生命周期官方文档

    http://docs.unity3d.com/Manual/ExecutionOrder.html 一.组件运行的基本顺序 下图中创建类的顺序为A,B,C,A1,二运行的结果为A1,B,C,A. 可 ...

  3. 完善好的web项目(校园包车)

  4. 团队项目设计完善&编码测试

    任务1:软件设计方案说明书 <基于弹幕评论的大数据分析平台软件设计方案说明书>仓库链接:点击跳转 任务2:搭建并配置项目集成开发环境: 开发环境 java version "1. ...

  5. Linux下使用NTFS格式移动硬盘

    https://zhidao.baidu.com/question/66517344.html https://zhidao.baidu.com/question/586036510.html htt ...

  6. 怎么把焦点放在RichEdit的最后一行

    急急急!!!!如何把焦点放在RichEdit的最后一行!! 请高手指点,在线等!!!!当添加到出现滚动条时焦点就不会往下了,怎么把焦点移到最后一行 RichEdit-> Lines-> A ...

  7. 选择提供器 - 选择监听器(selection provider-selection listener)模式

             

  8. 【HDU 5858】Hard problem(圆部分面积)

    边长是L的正方形,然后两个半径为L的圆弧和中间直径为L的圆相交.求阴影部分面积. 以中间圆心为原点,对角线为xy轴建立直角坐标系. 然后可以联立方程解出交点. 交点是$(\frac{\sqrt{7} ...

  9. mysql账户添加远程访问

    我们要将root账户设置为远程可访问 mysql> show databases; +--------------------+ | Database | +------------------ ...

  10. BZOJ 2427 [HAOI2010]软件安装 | 这道树形背包裸题严谨地证明了我的菜

    传送门 BZOJ 2427 题解 Tarjan把环缩成点,然后跑树形背包即可. 我用的树形背包是DFS序上搞的那种. 要注意dp数组初始化成-INF! 要注意dp顺推的时候也不要忘记看数组是否越界! ...