Comparing Differently Trained Models
Comparing Differently Trained Models
At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.
To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.
In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.
(I) loss=1.64000 ||W||=3.56,
bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073
(L-BFGS)
As we can see, the L2 norm of the final weight vector is
similar, also the bias, but of course we do not care for absolute norms but
rather for the correlation of both solutions. For that reason, we converted both
weight vectors W to unit-norm and determined the cosine similarity: correlation
= W_sgd.T * W_lbfgs = 0.977.
Since we do not have any empirical data for such correlations, we analyzed
the magnitude of the features in the weight vectors. More precisely the top-5
most important features:
(I) werewolf=0.6652, vampire=0.2394,
creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II)
werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511,
teenagers=0.1279
If we also consider the top-12 features of both
models, which are pretty similar,
(I) werewolf, vampire, creature,
forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural,
mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers,
curse, forbidden-love, supernatural, pregnancy, hunting, undead,
beast
we can see some patterns here: First, a lot of the movies in
the dataset seem to combine the theme with love stories that may involve
teenagers. This makes sense because this is actually a very popular pattern
these days and second, vampires and werewolves are very likely to co-occur in
the same movie.
Those patterns were learned by both models, regardless of the actual
optimization method but with minor differences which can be seen by considering
the magnitude of the individual weights in W. However, as the correlation of the
parameters vectors confirmed, both solutions are pretty close together.
Bottom line, we should be careful with interpretations since the data at hand
was limited, but nevertheless the results confirmed that with proper
initializations and hyper-parameters, good solutions can be both achieved with
1st and 2nd order methods. Next, we will study the ability of models to
generalize for unseen data.
Comparing Differently Trained Models的更多相关文章
- 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015
Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...
- (转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search
The Evolved Transformer - Enhancing Transformer with Neural Architecture Search 2019-03-26 19:14:33 ...
- (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance
Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...
- Keras vs. PyTorch
We strongly recommend that you pick either Keras or PyTorch. These are powerful tools that are enjoy ...
- Understanding Tensorflow using Go
原文: https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/ Tensorflow is no ...
- How to handle Imbalanced Classification Problems in machine learning?
How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...
- understanding backpropagation
几个有助于加深对反向传播算法直观理解的网页,包括普通前向神经网络,卷积神经网络以及利用BP对一般性函数求导 A Visual Explanation of the Back Propagation A ...
- 论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
- 论文翻译——Character-level Convolutional Networks for Text Classification
论文地址 Abstract Open-text semantic parsers are designed to interpret any statement in natural language ...
随机推荐
- Jq_Js_Js、Jq获取浏览器和屏幕各种高度宽度
$(document).ready(function() {alert($(window).height()); //浏览器当前窗口可视区域高度alert($(document).he ...
- Unity5.6之前版本VRTK插件基础交互
一.VR运行环境配置: 安装steam,在steam上安装SteamVR驱动. 在Unity项目中需要导入VRTool插件包(已上传服务器),里面包含两个插件一个是SteamVR插件,一个是VRTK插 ...
- Vue全家桶介绍
一直不清楚全家桶是什么玩意,上网搜了一下,才知道就是平时项目中使用的几个依赖包,下面分享一下 Vue 全家桶介绍 Vue有著名的全家桶系列,包含了vue-router(http://router.vu ...
- 基于神念TGAM的脑波小车(1)
作者声明:此博客是作者的毕设心得,拿来分享. 拿到模块,在网上查了一圈,发现基本没什么有用的资料,有也是一些废话,经过我几个月的攻克,现在已初步搞定,分享给大家. 废话不多说,直接步入正题. 这是通过 ...
- 微软职位内部推荐-Senior Software Engineer-DUT
微软近期Open的职位: Document Understanding and Task (DUT) team in STCA focuses on semantic understanding an ...
- “Linux内核分析”实验三报告
构造一个简单的Linux系统 张文俊+原创作品转载请注明出处+<Linux内核分析>MOOC课程http://mooc.study.163.com/course/USTC-10000290 ...
- C#代码分析--阅读程序,回答问题
阅读下面程序,请回答如下问题: 问题1:这个程序要找的是符合什么条件的数? 问题2:这样的数存在么?符合这一条件的最小的数是什么? 问题3:在电脑上运行这一程序,你估计多长时间才能输出第一个结果?时间 ...
- JAVA面对对象(一)——封装
1.封装思想:将对象的属性和行为封装起来的载体是类,类通常对客户隐藏其实现的细节 2.封装就是将属性私有化(private),并提供公共的方法(public)访问私有属性 3.通过封装,实现对属性数据 ...
- 使用phantomjs进行无界面UI自动化测试
PhantomJS(http://phantomjs.org/) 是一个基于WebKit的服务器端JavaScript API.它全面支持web而不需浏览器支持,其快速.原生支持各种Web标准:DOM ...
- C语言入门:01.C语言概述
一.计算机和软件常识 1.计算机运行原理 (1)硬件基本组成:硬盘.内存.CPU (2)个部件之间的运作协调(下图)