Comparing Differently Trained Models
Comparing Differently Trained Models
At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.
To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.
In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.
(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)
As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.
Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:
(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279
If we also consider the top-12 features of both
models, which are pretty similar,
(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast
we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.
Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.
Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.
Comparing Differently Trained Models的更多相关文章
- 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015
Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...
- (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search
The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...
- (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance
Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...
- Keras vs. PyTorch
We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...
- Understanding Tensorflow using Go
原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...
- How to handle Imbalanced Classification Problems in machine learning?
How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...
- understanding backpropagation
几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...
- 论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
- 论文翻译——Character-level Convolutional Networks for Text Classification
论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...
随机推荐
- 关于java线程池的一丢丢
线程池应用达到的目的 1.降低资源消耗:可以重复利用已创建的线程从而降低线程创建和销毁所带来的消耗. 2.提高响应速度:当任务到达时,不需要等线程创建就可以立即执行. 3.提高线程的可管理性:使用线程 ...
- MOSFET的小信号模型和频率响应
这部分内容大部分参考W.Y.Choi的课堂讲义第三讲和第四讲:http://tera.yonsei.ac.kr/class/2007_1/main.htm 一.小信号模型 首先要明确一点,大部分情形M ...
- Vigenere加密
Vigenere加密法原理很简单,实现起来也不难.与普通的单码加密法不同,明文经过加密之后,每个字母出现的频率就不会有高峰和低峰. 密钥中字母代表行和明文中的字母代表行.在vigenere表中找到对应 ...
- 微软职位内部推荐-Senior Software Engineer_HPC
微软近期Open的职位: Job Title: Senior Software Engineer_HPC Location: Shanghai, China Are you passionate ab ...
- PHP学习 文件访问和写入
<?php $path = $_SERVER['PHP_SELF']; //PHP_SELF:当前执行脚本的文件名,与 document root 有关 echo basename($path) ...
- Redis学习笔记之底层数据结构
1.简单动态字符串(simple dynamic string, SDS) 定义: struct sdshdr { int len;//记录buf中使用的字节数量 int ...
- SQL Sever——远程过程调用失败(0x800706be)
最近重装了系统,VS和SQL Sever莫名奇妙的不能用了.下面总结一下这个过程中遇到的问题,跟大家分享一下经验~~ 大概是以前的安装过程都十分顺利,这次,在尝试了数次登陆不上去之后,我仍然怀疑是自己 ...
- 运行Maven时报错:No goals have been specified for this build
No goals have been specified for this build. You must specify a valid lifecycle phase or a goal in t ...
- java词频统计——改进后的单元测试
测试项目 博客文章地址:[http://www.cnblogs.com/jx8zjs/p/5862269.html] 工程地址:https://coding.net/u/jx8zjs/p/wordCo ...
- git如何删除已经 add 的文件 (如何撤销已放入缓存区文件的修改)
使用 git rm 命令即可,有两种选择, 一种是 git rm –cached “文件路径”,不删除物理文件,仅将该文件从缓存中删除: 一种是 git rm –f “文件路径”,不仅将该文件从缓存中 ...