Two-Stream Convolutional Networks for Action Recognition in Videos

&

Towards Good Practices for Very Deep Two-Stream ConvNets

Note here: it's a learning note on the topic of video representations. This note incorporates two papers about popular two-stream architecture.

Link: http://arxiv.org/pdf/1406.2199v2.pdf

http://arxiv.org/pdf/1507.02159v1.pdf

Motivation: CNN has significantly boosted the performance of object recognition in still images. However, the use of it for video recognition with stacked frames doesn’t outperform the one with individual frame (work by Karpathy), which indicates traditional way of adapting CNN to video clips doesn’t capture the motion well.

Proposed Model:

In order to learn the spatio-temporal features well, this paper proposed a two-stream architecture for video recognition. It passes the spatial information (single static RGB frame) and another temporal information (optical flow of multiple frames) through the ConvNet. Then fuse the parallel outputs of two streams to form the final class score fusion.

The overall pipeline is shown below:

-      ConvNet input configurations:

There are some options for the input of temporal stream. The author discussed about utilizing optical flow stacking and trajectory stacking as motional information. The former one considers displacements of each point between consecutive frames, while the latter one focuses on the displacements of every point in the initial frame throughout the entire sequences.

They also mentioned bi-directional optical flow to enhance the capacity of video representations; and mean flow subtraction to avoid the influences of camera motion.

Visualization:

The visualization of filters in this architecture is shown below.

Each column corresponds to a filter, each row – to an input channel.

As we can draw from the image, one single filter composed with half black and half white means to compute spatial derivative; and the filters in a column with black turning into white gradually means to compute temporal derivative.

With the intuition above, we can see how the two-stream architecture captures the spatio-temporal features well.

Improvements:

There is another paper named Towards Good Practices for Very Deep Two-Stream ConvNets, which improves the efficiency of two-stream model in practice.

They argue that previous two-stream model didn’t significantly outperform other hand-crafted features for the mainly two reasons: first, the network is not deep enough as VGGNet&GoogLeNet; second, the lack of plenty training data limits its performance.

Thus, they proposed some suggestions to learn a more powerful two-stream model:

-      Pre-training for Two-stream ConvNets: pre-train both spatial and temporal nets on ImageNet.

-      Smaller Learning Rate.

-      More Data Augmentation Techniques

-      High Dropout Ratio: make the training of deep network with small amount of data easier.

-      Multi-GPU training.

【ML】Two-Stream Convolutional Networks for Action Recognition in Videos的更多相关文章

  1. 【CV论文阅读】Two stream convolutional Networks for action recognition in Vedios

    论文的三个贡献 (1)提出了two-stream结构的CNN,由空间和时间两个维度的网络组成. (2)使用多帧的密集光流场作为训练输入,可以提取动作的信息. (3)利用了多任务训练的方法把两个数据集联 ...

  2. 【ML】ICLR2016_Delving Deeper into Convolutional Networks

    ICLR2016_DELVING DEEPER INTO CONVOLUTIONAL NETWORKS Note here: Ballas recently proposed a novel fram ...

  3. 目标检测--Spatial pyramid pooling in deep convolutional networks for visual recognition(PAMI, 2015)

    Spatial pyramid pooling in deep convolutional networks for visual recognition 作者: Kaiming He, Xiangy ...

  4. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

    Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Kaiming He, Xiangyu Zh ...

  5. SPPNet论文翻译-空间金字塔池化Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

    http://www.dengfanxin.cn/?p=403 原文地址 我对物体检测的一篇重要著作SPPNet的论文的主要部分进行了翻译工作.SPPNet的初衷非常明晰,就是希望网络对输入的尺寸更加 ...

  6. 深度学习论文翻译解析(九):Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

    论文标题:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition 标题翻译:用于视觉识别的深度卷积神 ...

  7. 【Semantic Segmentation】 Instance-sensitive Fully Convolutional Networks论文解析(转)

    这篇文章比较简单,但还是不想写overview,转自: https://blog.csdn.net/zimenglan_sysu/article/details/52451098 另外,读这篇pape ...

  8. 【注意力机制】Attention Augmented Convolutional Networks

    注意力机制之Attention Augmented Convolutional Networks 原始链接:https://www.yuque.com/lart/papers/aaconv 核心内容 ...

  9. 【ML】Predict and Constrain: Modeling Cardinality in Deep Structured Prediction -预测和约束:在深度结构化预测中建模基数

    [论文标题]Predict and Constrain: Modeling Cardinality in Deep Structured Prediction   (35th-ICML,PMLR) [ ...

随机推荐

  1. shell基本语法记录

    Shell 是一个用 C 语言编写的程序,它是用户使用 Linux 的桥梁.Shell 既是一种命令语言,又是一种程序设计语言. Shell 是指一种应用程序,这个应用程序提供了一个界面,用户通过这个 ...

  2. 3星|《AI极简经济学》:AI的预测、决策、战略等方面的应用案例介绍

    AI极简经济学 主要内容是AI的各种应用案例介绍.作者把这些案例分到五个部分介绍:预测.决策.工具.战略.社会. 看书名和介绍以为会从经济学的角度解读AI,有更多的新鲜的视角和观点,读后比较失望,基本 ...

  3. Mac下配置apache

    一.前言 今天遇到问题,怎么配置apache在Mac上,原来Mac自带apache,只需要自己开启配置一下就行了. 二.步骤: 1.修改apache的http_conf文件 打开finder前往/pr ...

  4. Alpha冲刺! Day4 - 磨刀

    Alpha冲刺! Day4 - 磨刀 今日已完成 晨瑶:和大家交流了一下,反思这阶段团队遇到的问题. 昭锡:今天跟学长交流了点问题,学习了Gson使用. 永盛:Gravel 数据库重新设计. 立强:看 ...

  5. (14)Python类

  6. cpu的控制单元与语言中的控制逻辑有没有关系?

    cpu的控制单元与语言中的控制逻辑有没有关系?

  7. 20145203盖泽双《网络对抗技术》实践五:MSF基础应用

    20145203盖泽双<网络对抗技术>实践五:MSF基础应用 1.实践目标 掌握metasploit的基本应用方式,掌握常用的三种攻击方式的思路.下面是我自己做的时候用的四个套路. (1) ...

  8. HttpMessageNotReadableException(一)

    1.今天移动端调用接口时候出现下面异常 org.springframework.http.converter.HttpMessageNotReadableException: JSON parse e ...

  9. leetcode 51. N-Queens 、52. N-Queens II

    51. N-Queens 使用isValid判断当前的位置是否合法 每次遍历一行,使用queenCol记录之前行的存储位置,一方面是用于判断合法,另一方面可以根据存储结果输出最终的结果 class S ...

  10. Druid加密

    至于为什么加密,主要防止一些过多人知道数据库密码,可能造成公司的损失,同时也避免一些潜在的危害,因此,数据库密码最好还是只有几个人知道,太多人知道的话,影响不好. 最近删库的事情,太多了,个人觉得一个 ...