Learning Temporal Embeddings for Complex Video Analysis

Note here: it's a review note on novel work from Feifei-Li's group about video representations, published on ICCV2015.

Link: http://www.cv-foundation.org/openaccess/content_iccv_2015/html/Ramanathan_Learning_Temporal_Embeddings_ICCV_2015_paper.html

Motivation:

- Labeled video data is short for learning video representations, we need an unsupervised way.

- Context(temporal structure) is significant for video representations.

Proposed model:

- give one query frame, we can predict corresponding context representations(embeddings) of it through this model.

- Pipline:

\(f_{vj}(s_{vj};w_{e})\): embedding function

(\(W_{e}\) is the only parameter here we need to train for)

- Training:

\(h_{vj}=\frac{1}{2T}\sum_{t=1}^T(f_{vj+t}+f_{vj-t})\): context vector

Unsupervised learning objective (SVM Loss):

\(J(W_{e})=\sum_{v\in V}\sum_{S_{vj\in V},S\neq S_{vj}}max(0,1-(f_{vj}-f_{\_})\cdot h_{vj})\)

(\(f_{vj}\) is the embedding of frame \(S_{vj}\))

(\(f_{\_}\) is a negative frame which is not highly relevant to \(S_{vj}\))

(\(h_{vj}\) is the context embedding of frame \(S_{vj}\))

We’ll go further into the choosing of negative frames and context range later.

Intuition:

This model momorizes the context of specific frame. It utilizes the spatial appearance of the frame to form an embedding vector, which infers its context information.

Spatial feature learned from CNN \(\xrightarrow{\;\;\;W_{e}\;\;projection\;\;\;}\) Temporal feature embeds context

(\(W_{e}\) memorizes the temporal pattern during training)

With the temporal structure, even though some frames are not appearance similar, they can also be near in the feature space as long as they share similar context. Like following:

There’re two takeaways in the training process:

- Multi-resolution sampling: it’s hard to decide a generic context range(T), for videos own different paces, some may be quick while some are slow. This paper proposed a multi-resolution sampling strategy, instead of only sampling the context with same frame gap, it sampling with various gap lengths. That’s a trade-off between semantic relatedness and visual variaty.

- Hard Negative: choosing of negative samples are important for a robust model. It’s natural to come up with sampling negative frames in other videos and context frames from the same video, but this may cause the model overfit for some video-specific, less sementic properties, like lighting, camera characteristics and background. As a result, this paper also samples negative frames that are out of context range from the same video to avoid this problem.

【CV】ICCV2015_Learning Temporal Embeddings for Complex Video Analysis的更多相关文章

  1. 【CV】ICCV2015_Describing Videos by Exploiting Temporal Structure

    Describing Videos by Exploiting Temporal Structure Note here: it's a learning note on the topic of v ...

  2. 【转载】Hierarchal Temporal Memory (HTM)

    最近在看机器学习,看能否根据已有的历史来预测Hardware的故障发生概率.下文是一篇很有意思的文章,转自 http://numenta.org/htm.html. NuPIC是一个开源项目,用来实现 ...

  3. 【CV】ICCV2015_Unsupervised Learning of Spatiotemporally Coherent Metrics

    Unsupervised Learning of Spatiotemporally Coherent Metrics Note here: it's a learning note on the to ...

  4. 【DB2】SQL0437W Performance for this complex query may be sub-optimal

    参考链接 Technote (troubleshooting) Problem(Abstract) Error [IBM][CLI Driver][DB2/6000] SQL0437W Perform ...

  5. 【CV】CVPR2015_A Discriminative CNN Video Representation for Event Detection

    A Discriminative CNN Video Representation for Event Detection Note here: it's a learning note on the ...

  6. 【CV】ICCV2015_Unsupervised Visual Representation Learning by Context Prediction

    Unsupervised Visual Representation Learning by Context Prediction Note here: it's a learning note on ...

  7. 【CV】ICCV2015_Unsupervised Learning of Visual Representations using Videos

    Unsupervised Learning of Visual Representations using Videos Note here: it's a learning note on Prof ...

  8. 【题解】[USACO12JAN]视频游戏的连击Video Game Combos

    好久没有写博客了,好惭愧啊……虽然这是一道弱题但还是写一下吧. 这道题目的思路应该说是很容易形成:字符串+最大值?自然联想到学过的AC自动机与DP.对于给定的字符串建立出AC自动机,dp状态dp[i] ...

  9. 【ML】ICML2015_Unsupervised Learning of Video Representations using LSTMs

    Unsupervised Learning of Video Representations using LSTMs Note here: it's a learning notes on new L ...

随机推荐

  1. 第 16 章 C 预处理器和 C 库(预定义宏)

    /*------------------------------------- predef.c -- 预定义宏和预定义标识符 ------------------------------------ ...

  2. C# 响应微信发送的Token验证,文字、图文自动回复、请求客服对话.....

    代码如下,有需要的可以参考: using System; using System.Collections.Generic; using System.Linq; using System.Web; ...

  3. js常见错误类型

    (1)SyntaxError SyntaxError是解析代码时发生的语法错误 // 变量名错误 var 1a; // 缺少括号 console.log 'hello'); (2)ReferenceE ...

  4. eureka分区的深入讲解

    背景 用户量比较大或者用户地理位置分布范围很广的项目,一般都会有多个机房.这个时候如果上线springCloud服务的话,我们希望一个机房内的服务优先调用同一个机房内的服务,当同一个机房的服务不可用的 ...

  5. 2.1 View与ViewGroup的概念

    http://www.runoob.com/w3cnote/android-tutorial-view-viewgroup-intro.html UI Overview 在Android APP中,所 ...

  6. 转载 精进不休 .NET 4.0 (5) - C# 4.0 新特性之并行运算(Parallel) https://www.cnblogs.com/webabcd/archive/2010/06/03/1750449.html

    精进不休 .NET 4.0 (5) - C# 4.0 新特性之并行运算(Parallel)   介绍C# 4.0 的新特性之并行运算 Parallel.For - for 循环的并行运算 Parall ...

  7. VMware虚拟机将英文改成中文的方法

    由于之前安装的虚拟机和老师要求的不同,我安装的是VMware,所以没有安装教程,没能修改系统语言,用了几次发现英文的不太方便,特别是出错的时候,看不懂系统的出错提示. 我从网上参考了https://b ...

  8. android RadioGroup中设置selector后出现多个别选中的RadioButton的解决办法

    在一个RadioGroup组中假如有三个或者以上的RadioButton,当然你需要给这些RadioButton设置selector.设置其中的一个为默认选中状态(在xml中设置).当程序在手机上运行 ...

  9. JavaScript中的slice函数

    String.slice(start,end)returns a string containing a slice, or substring, of string. It does not mod ...

  10. 大牛blog

    分布式: 分布式基础学习[一] —— 分布式文件系统 分布式基础学习[二] —— 分布式计算系统(Map/Reduce) Java分布式应用技术架构介绍