【ML】Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos

Towards Good Practices for Very Deep Two-Stream ConvNets

Note here: it's a learning note on the topic of video representations. This note incorporates two papers about popular two-stream architecture.

Link: http://arxiv.org/pdf/1406.2199v2.pdf

http://arxiv.org/pdf/1507.02159v1.pdf

Motivation: CNN has significantly boosted the performance of object recognition in still images. However, the use of it for video recognition with stacked frames doesn’t outperform the one with individual frame (work by Karpathy), which indicates traditional way of adapting CNN to video clips doesn’t capture the motion well.

Proposed Model:

In order to learn the spatio-temporal features well, this paper proposed a two-stream architecture for video recognition. It passes the spatial information (single static RGB frame) and another temporal information (optical flow of multiple frames) through the ConvNet. Then fuse the parallel outputs of two streams to form the final class score fusion.

The overall pipeline is shown below:

- ConvNet input configurations:

There are some options for the input of temporal stream. The author discussed about utilizing optical flow stacking and trajectory stacking as motional information. The former one considers displacements of each point between consecutive frames, while the latter one focuses on the displacements of every point in the initial frame throughout the entire sequences.

They also mentioned bi-directional optical flow to enhance the capacity of video representations; and mean flow subtraction to avoid the influences of camera motion.

Visualization:

The visualization of filters in this architecture is shown below.

Each column corresponds to a filter, each row – to an input channel.

As we can draw from the image, one single filter composed with half black and half white means to compute spatial derivative; and the filters in a column with black turning into white gradually means to compute temporal derivative.

With the intuition above, we can see how the two-stream architecture captures the spatio-temporal features well.

Improvements:

There is another paper named Towards Good Practices for Very Deep Two-Stream ConvNets, which improves the efficiency of two-stream model in practice.

They argue that previous two-stream model didn’t significantly outperform other hand-crafted features for the mainly two reasons: first, the network is not deep enough as VGGNet&GoogLeNet; second, the lack of plenty training data limits its performance.

Thus, they proposed some suggestions to learn a more powerful two-stream model:

- Pre-training for Two-stream ConvNets: pre-train both spatial and temporal nets on ImageNet.

- Smaller Learning Rate.

- More Data Augmentation Techniques

- High Dropout Ratio: make the training of deep network with small amount of data easier.

- Multi-GPU training.

【ML】Two-Stream Convolutional Networks for Action Recognition in Videos的更多相关文章

【CV论文阅读】Two stream convolutional Networks for action recognition in Vedios
论文的三个贡献 (1)提出了two-stream结构的CNN,由空间和时间两个维度的网络组成. (2)使用多帧的密集光流场作为训练输入,可以提取动作的信息. (3)利用了多任务训练的方法把两个数据集联 ...
【ML】ICLR2016_Delving Deeper into Convolutional Networks
ICLR2016_DELVING DEEPER INTO CONVOLUTIONAL NETWORKS Note here: Ballas recently proposed a novel fram ...
目标检测--Spatial pyramid pooling in deep convolutional networks for visual recognition(PAMI, 2015)
Spatial pyramid pooling in deep convolutional networks for visual recognition 作者: Kaiming He, Xiangy ...
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Kaiming He, Xiangyu Zh ...
SPPNet论文翻译-空间金字塔池化Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
http://www.dengfanxin.cn/?p=403 原文地址我对物体检测的一篇重要著作SPPNet的论文的主要部分进行了翻译工作.SPPNet的初衷非常明晰,就是希望网络对输入的尺寸更加 ...
深度学习论文翻译解析（九）：Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
论文标题:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition 标题翻译:用于视觉识别的深度卷积神 ...
【Semantic Segmentation】 Instance-sensitive Fully Convolutional Networks论文解析(转)
这篇文章比较简单,但还是不想写overview,转自: https://blog.csdn.net/zimenglan_sysu/article/details/52451098 另外,读这篇pape ...
【注意力机制】Attention Augmented Convolutional Networks
注意力机制之Attention Augmented Convolutional Networks 原始链接:https://www.yuque.com/lart/papers/aaconv 核心内容 ...
【ML】Predict and Constrain: Modeling Cardinality in Deep Structured Prediction -预测和约束：在深度结构化预测中建模基数
[论文标题]Predict and Constrain: Modeling Cardinality in Deep Structured Prediction (35th-ICML,PMLR) [ ...

随机推荐

11LaTeX学习系列之---LaTeX的特殊字符
目录目录前言 (一)源代码 (二)输出效果目录本系列是有关LaTeX的学习系列,共计19篇,本章节是第11篇. 前一篇:10LaTeX学习系列之---Latex的文档结构后一篇:12LaTe ...
row_number() over() 一句话概括，以及max()函数的一种查询分组中最大值的用法
row_number() over(partition by col1 order by col2) 根据COL1分组可能会有多个组,每组组内根据COL2进行排序.每组内都有自动生成的序号,从1开始, ...
java死锁示例及其发现方法
在java多线程编程中很容易出现死锁,死锁就是多个线程相互之间永久性的等待对方释放锁,这和数据库多个会话之间的死锁类似.下面的代码示例了一个最简单的死锁的例子,线程1和线程2相互之间等待对方释放锁来取 ...
IE浏览器打不开网页的解决方法
前阵子一下子安装了很多软件,后来使用IE游览器的时候,莫名其妙的打不开网页,虽然用其他浏览器(比如谷歌.火狐)可以正常浏览网页,但是由于很多软件内嵌页面都会调用Windows的IE浏览器来加载,所以I ...
Python解析器
当我们编写Python代码时,我们得到的是一个包含Python代码的以.py为扩展名的文本文件.要运行代码,就需要Python解释器去执行.py文件. 由于整个Python语言从规范到解释器都是开源的 ...
(15)Python时间
Jenkins临时空间不足处理办法
环境: Jenkins版本 jenkins-2.89.4Jenkins 主从都在一台主机os版本 redhat7.2 使用yum的方式安装jenkins 发现在7.2上安装,剩余临时空间很小,通过登陆 ...
理解Selection对象
理解Selection对象 Selection对象的属性如下: var selection = window.getSelection(); console.log(selection); 通过上面的 ...
栈(stack)信息
栈在JVM虚拟机中是线程的一块私有空间,比如存储函数的调用信息.局部变量等特性先进后出和后进先出即FIFO 借用网络的一个图,感觉看完就可以了解了最先调用的函数压入栈低,最后压入得函数在栈顶,函 ...
ubuntu 系统升级 cmake
由于Ubuntu14.04的cmake版本为2.8.x,而如果需要cmake3.x版本时,无法生成makefile,有两种方法可以安装cmake3.4.1: 方法1: sudo apt-get ins ...

【ML】Two-Stream Convolutional Networks for Action Recognition in Videos

【ML】Two-Stream Convolutional Networks for Action Recognition in Videos的更多相关文章

随机推荐

热门专题