Comparing Differently Trained Models
Comparing Differently Trained Models
At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.
To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.
In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.
(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)
As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.
Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:
(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279
If we also consider the top-12 features of both
models, which are pretty similar,
(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast
we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.
Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.
Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.
Comparing Differently Trained Models的更多相关文章
- 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015
Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...
- (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search
The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...
- (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance
Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...
- Keras vs. PyTorch
We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...
- Understanding Tensorflow using Go
原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...
- How to handle Imbalanced Classification Problems in machine learning?
How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...
- understanding backpropagation
几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...
- 论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
- 论文翻译——Character-level Convolutional Networks for Text Classification
论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...
随机推荐
- pandas 初识(二)
基本统计 pivot_table(数据透视表 ): 使用appfunc, 按不同index分类统计各特征values的值 df.pivot_table(index="Pclass" ...
- 冒泡排序算法的C++,Java和Python实现和冒泡排序算法三种语言效率的比较
冒泡排序原理: 这一篇百度经验讲得很好,我不多说了 https://jingyan.baidu.com/article/6525d4b13f920bac7d2e9484.html 他讲的是C语言,没有 ...
- From today 2019.02.27
HIT开设软件构造课程,需要在博客上分享记录学习体验,感觉还是挺好的. 以后会不定期更新一些关于学下java的笔记和实验相关的内容.
- PAT甲题题解-1048. Find Coins (25)-水
给n,m以及n个硬币 问,是否存在两个硬币面值v1+v2=m 因为面值不会超过500,所以实际上最多500个不同的硬币而已 #include <iostream> #include < ...
- PAT甲题题解-1096. Consecutive Factors(20)-(枚举)
题意:一个正整数n可以分解成一系列因子的乘积,其中会存在连续的因子相乘,如630=3*5*6*7,5*6*7即为连续的因子.给定n,让你求最大的连续因子个数,并且输出其中最小的连续序列. 比如一个数可 ...
- svn命令行创建和删除分支和tags
首页 分类首页 目录 原文: http://blog.csdn.net/yangzhongxuan/article/details/7519948 http://zccst.iteye.com/b ...
- LINUX内核分析第四周学习总结——扒开系统调用的“三层皮”
LINUX内核分析第四周学习总结--扒开系统调用的"三层皮" 标签(空格分隔): 20135321余佳源 余佳源 原创作品转载请注明出处 <Linux内核分析>MOOC ...
- 2013337朱荟潼 Linux第十八章读书笔记——调试
第十八章 调试 0.总结 oops 内核的调试配置 用Git进行二分搜索 bug总会有,简洁描述发给LKML 1. 准备开始 在用户级的程序里,bug表现比较直接:在内核中却不清晰. 2. 内核中的b ...
- Java单元测试框架 JUnit
Java单元测试框架 JUnit JUnit是一个Java语言的单元测试框架.它由Kent Beck和Erich Gamma建立,逐渐成为源于KentBeck的sUnit的xUnit家族中为最成功的一 ...
- input 清空值。(转载)
ref顾名思义我们知道,其实它就可以被看座是一个组件的参考,也可以说是一个标识.作为组件的属性,其属性值可以是一个字符串也可以是一个函数. 其实,ref的使用不是必须的.即使是在其适用的场景中也不是非 ...