【计算机视觉】行为识别(action recognition)相关资料
================华丽分割线=================这部分来自知乎====================
Deep Learning之前最work的是INRIA组的Improved Dense Trajectories(IDT) + fisher vector, paper and code:
LEAR - Improved Trajectories Video Description
基本上INRIA的东西都挺work 恩..
然后Deep Learning比较有代表性的就是VGG组的2-stream:
http://arxiv.org/abs/1406.2199
其实效果和IDT并没有太大区别,里面的结果被很多人吐槽难复现,我自己也试了一段时间才有个差不多的数字。
然后就是在这两个work上面就有很多改进的方法,目前的state-of-the-art也是很直观可以想到的是xiaoou组的IDT+2-stream:
http://wanglimin.github.io/papers/WangQT_CVPR15.pdf
还有前段时间很火,现在仍然很多人关注的G社的LSTM+2-stream:
http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43793.pdf
然后安利下zhongwen同学的paper:
http://www.cs.cmu.edu/~zhongwen/pdf/MED_CNN.pdf
最后你会发现paper都必需和IDT比,
================华丽分割线=================这部分也来自知乎====================
[1] Action Recognition from a Distributed Representation of Pose and Appearance, CVPR,2010
[2] Combining Randomization and Discrimination for Fine-Grained Image Categorization, CVPR,2011
[3] Object and Action Classification with Latent Variables, BMVC, 2011
[4] Human Action Recognition by Learning Bases of Action Attributes and Parts, ICCV, 2011
[5] Learning person-object interactions for action recognition in still images, NIPS, 2011
[6] Weakly Supervised Learning of Interactions between Humans and Objects, PAMI, 2012
[7] Discriminative Spatial Saliency for Image Classification, CVPR, 2012
[8] Expanded Parts Model for Human Attribute and Action Recognition in Still Images, CVPR, 2013
[9] Coloring Action Recognition in Still Images, IJCV, 2013
[10] Semantic Pyramids for Gender and Action Recognition, TIP, 2014
[11] Actions and Attributes from Wholes and Parts, arXiv, 2015
[12] Contextual Action Recognition with R*CNN, arXiv, 2015
[13] Recognizing Actions Through Action-Specific Person Detection, TIP, 2015
2010之前的都没看过,在10年左右的这几年(11,12)主要的思路有3种:1.以所交互的物体为线索(person-object interaction),建立交互关系,如文献5,6;2.建立关于姿态(pose)的模型,通过统计姿态(或者更广泛的,部件)的分布来进行分类,如文献1,4,还有个poselet上面好像没列出来,那个用的还比较多;3.寻找具有鉴别力的区域(discriminative),抑制那些meaningless 的区域,如文献2,7。10和11也用到了这种思想。
文献9,10都利用了SIFT以外的一种特征:color name,并且描述了在动作分类中如何融合多种不同的特征。
文献12探讨如何结合上下文(因为在动作分类中会给出人的bounding box)。
比较新的工作都用CNN特征替换了SIFT特征(文献11,12,13),结果上来说12是最新的。
静态图像中以分类为主,检测的工作出现的不是很多,文献4,13中都有关于检测的工作。可能在2015之前分类的结果还不够promising。现在PASCAL VOC 2012上分类mAP已经到了89%,以后的注意力可能会更多地转向检测。
================华丽分割线=================这部分来自互联网====================
[1] http://lear.inrialpes.fr/software(干货较多,可以进去浏览浏览)
[2] Action Recognition Paper Reading
- Tian, YingLi, et al. "Hierarchical filtered motion for action recognition in crowded videos." Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 42.3 (2012): 313-323.
- A new 3D interest point detector, based on 2D Harris and Motion History Image (MHI). Essentially, 2D Harris points with recent motion are selected as interest point.
- A new descriptors based on HOG on image intensity and MHI. Some filtering is performed to remove cluttered motion and normalize descriptors.
- KTH and MSR Action dataset
- Yuan, Junsong, Zicheng Liu, and Ying Wu. "Discriminative subvolume search for efficient action detection." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
- A discriminative matching techniques based on mutual information and nearest neighbor algorithm
- A better upper bound for Branching and Bounding to locate matched action that maximize mutual information
- The key idea is to decompose the search space into spatial and temporal.
- Lampert, Christoph H., Matthew B. Blaschko, and Thomas Hofmann. "Beyond sliding windows: Object localization by efficient subwindow search." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
- Code online: https://sites.google.com/site/christophlampert/software (Efficient
Subwindow Search) - Reducing the complexity of sliding window from n4 to averagely n2
- Branching and Bounding techniques
- Relies on a bounding funtion that gives a upper bound of the scoring function over a set of potential box
- works well with linear classifiers and BOW features.
- Code online: https://sites.google.com/site/christophlampert/software (Efficient
- Li, Li-Jia, et al. "Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification." NIPS. Vol. 2. No. 3. 2010.
- Images are represented as a scale-invariant map of object detector response
- Detectors are applied to novel images in multiple scales. At each scale, a 3 level spatial pyramid is applied. Responses are concatenated to form the descriptors for the image.
- 200 objecst are selected from a 1000 objects pool
- Evaluated In Scene classification task
- L1 and L1/L2 regularized LR is applied to discover sparsity. The the L1/L2 group sparsity, group is defined for each object, hence object level sparsity. Bear in mind that there are multiple entries in the descriptors for each object.
(marginal improvements)
- Wang, Heng, et al. "Dense trajectories and motion boundary descriptors for action recognition." International journal of computer vision 103.1 (2013): 60-79.
- Tracking over densely sampled points to get trajectories, in contrast with local representation. Not really dense sampling, grids are filtered by minEigen value criterion (Shi and Tomasi)
- Motion boundary (derivative over optical flow field), to overcome camera motion
- Code online: http://lear.inrialpes.fr/people/wang/dense_trajectories
- Optical Flow field is filtered by Median Filter. based on opencv
- Limit trajectory to overcome drift. Filter static point and error trajectories.
- Trajectory shape, HOG, HOF and MBH descriptors along the trajectory
- KTH (94.2%), Youtube (84.1%), Hollywood2 (58.2%), UCF Sports (88.0%), IXMAS (93.5%), UIUC (98.4%), Olympic Sports (74.1%), UCF50 (84.5%), HMDB51 (46.6%)
- Liang, Xiaodan, Liang Lin, and Liangliang Cao. "Learning latent spatio-temporal compositional model for human action recognition." Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.
- Laptev STIP with HOF and HOG, with BOW quantization
- Leaf node for detecting action parts
- Or node to account for intra-class variability
- And node to aggregate action in a frame
- Root node to identify temporal composition
- Contextual interaction (connecting leaf nodes)
- Everything is formulated in a latent SVM framework and solved by CCCP
- Since the leaf node can move around from one Or-node to another, a reconfiguration step is used to rearrange the feature vector
- UCF Youtube and Olympic Sports dataset
- Sadanand, Sreemanananth, and Jason J. Corso. "Action bank: A high-level representation of activity in video." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
- 98.2% KTH, 95.0% UCF Sports, 57.9% UCF50, 26.9% HMDB51
- 205 video clips used as template to detect action from novel video.
- Detectors are sampled from multi viewpoint and run with multiple scales
- Output of detectors are maxpooled for ST volume through various pooling unit
- "Action Spoting" for template detector
- Code online: http://www.cse.buffalo.edu/~jcorso/r/actionbank/
- Liu, Jingen, Benjamin Kuipers, and Silvio Savarese. "Recognizing human actions by attributes." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
- 22 manually selected action attributes as semantic representation
- Data Driven attributes as complementary information
- Attributes as latent variable, just the parts in DPM model
- Account for the class matching, attribute matching, attributes cooccurcance.
- STIP by 1D-Gabor detector. Gradient based + BOW over ST volume
- UIUC dataset, KTH, Olympic Sports Dataset
- Niebles, Juan Carlos, Hongcheng Wang, and Li Fei-Fei. "Unsupervised learning of human action categories using spatial-temporal words." International Journal of Computer Vision 79.3 (2008): 299-318.
- Unsupervised video categorizaton, using pLSA and LDA
- Action Localization
- Laptev's STIP is too sparse comparing with Dollar's
- Simple gradient based descriptors and PCA applied to reduce dimensionality --> rely on codebook to deal with invariance
- K-means with Euclidean distance metric
- pLSA or LDA on top of BOW (# topic is equal to the categories to be recognized)
- Each STIP is associated with a BOW, hence topic distribution, so it's trivial to perform Localization
- Laptev, Ivan, et al. "Learning realistic human actions from movies." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
- Annotating videos by aligning transcriptes
- A movie dataset
- Space-Time interest points + HOG + HOF around a ST volume
- ST BOW. Given a video sequence, multiple way to segment it, each of which is called a channel
- Multi-Channel \chi^2 kernel classification. Channel selection using greedy shrink
- KTH (91.8%) and Movie (18.2% ~ 53.3%) dataset
- STIP + HOG and HOF code: http://www.di.ens.fr/~laptev/download.html
[3] Action Recognition Datasets
Links to Datasets:
- "Free Viewpoint Action Recognition using Motion History Volumes (CVIU Nov./Dec. '06)."
D. Weinland, R. Ronfard, E. Boyer - "Actions as Space-Time Shapes (ICCV '05)."
M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri - "Recognizing Human Actions: A Local SVM Approach (ICPR '04)."
C. Schuldt, I. Laptev and B. Caputo - "Propagation Networks for Recognizing Partially Ordered Sequential Activity (CVPR
'04)."
Y. Shi, Y. Huang, D. Minnen, A. Bobick, I. Essa - "Tracking Multiple Objects through Occlusions (CVPR '05)."
Y. Huang, I. Essa - Sixth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS - ECCV 2004)
Recent Action Recognition Papers:
- D. Weinland, R. Ronfard, E. Boyer (CVIU Nov./Dec. '06)
"Free Viewpoint Action Recognition using Motion History Volumes"
11 actors each performing 3 times 13 actions: Check Watch, Cross Arms, Scratch Head, Sit Down, Get Up, Turn Around, Walk, Wave, Punch, Kick, Point, Pick Up, Throw.
Multiple views of 5 synchronized and calibrated cameras are provided. - A. Yilmaz, M. Shah (ICCV '05)
"Recognizing Human Actions in Videos Acquired by Uncalibrated Moving Cameras"
18 Sequences, 8 Actions: 3 x Running, 3 x Bicycling, 3 x Sitting-down, 2 x Walking, 2 x Picking-up, 1 x Waving Hands, 1 x Forehand Stroke, 1 x Backhand Stroke - Y. Sheikh, M. Shah (ICCV '05)
"Exploring the Space of an Action for Human Action Recognition"
6 Actions: Sitting, Standing, Falling, Walking, Dancing, Running - M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri (ICCV '05)
"Actions as Space-Time Shapes"
81 Sequences, 9 Actions, 9 People: Running, Walking, Bending, Jumping-Jack, Jumping-Forward-On-Two-Legs, Jumping-In-Place-On-Two-Legs, Galloping-Sideways, Waving-Two-Hands, Waving-One-Hand Ballet - A. Yilmaz, M. Shah (CVPR '05)
"Action Sketch: A Novel Action Representation"
28 Sequences, 12 Actions: 7 x Walking, 4 x Aerobics, 2 x Dancing, 2 x Sit-down, 2 x Stand-up, 2 x Kicking, 2 x Surrender, 2 x Hands-down, 2 x Tennis, 1 x Falling - E. Shechtman, M. Irani (CVPR '05)
"Space-Time Behavioral Correlation"
Walking, Diving, Jumping, Waving Arms, Waving Hands, Ballet Figure, Water Fountain - Y. Shi, Y. Huang, D. Minnen, A. Bobick, I. Essa (CVPR '04)
"Propagation Networks for Recognition of Partially Ordered Sequential Actions"
Glucose Monitor Calibration - C. Schuldt, I. Laptev and B. Caputo (ICPR '04)
"Recognizing Human Actions: A Local SVM Approach."
6 Actions x 25 Subjects x 4 Scenarios - V. Parameswaran, R. Chellappa (CVPR '03)
"View Invariants for Human Action Recognition"
25 x Walk, 6 x Run, 18 x Sit-down - D. Minnen, I. Essa, T. Starner (CVPR '03)
"Expectation Grammars: Leveraging High-Level Expectations for Activity Recognition"
Towers of Hanoi (only hands) - A. Efros, A. Berg, G. Mori, J. Malik (ICCV '03)
"Recognizing Actions at a Distance"
Soccer, Tennis, Ballet
[4] CVPR 2014 Tutorial on Emerging Topics in Human Activity Recognition
[5] http://yangxd.org/projects/surveillance/SED13
[6] Recognition of human actions
Sample sequences for each action (DivX-compressed)
person15_walking_d1_uncomp.avi
person15_jogging_d1_uncomp.avi
person15_running_d1_uncomp.avi
person15_boxing_d1_uncomp.avi
person15_handwaving_d1_uncomp.avi
person15_handclapping_d1_uncomp.avi
Action database in zip-archives (DivX-compressed)
Note: The database is publicly available for non-commercial use. Please refer to [Schuldt, Laptev and Caputo, Proc.
ICPR'04, Cambridge, UK ] if you use this database in your publications.
walking.zip(242Mb)
jogging.zip(168Mb)
running.zip(149Mb)
boxing.zip(194Mb)
handwaving.zip(218Mb)
handclapping.zip(176Mb)
Related publications
"Recognizing Human Actions: A Local SVM Approach",
Christian Schuldt, Ivan Laptev and Barbara Caputo; in Proc. ICPR'04, Cambridge, UK. [Abstract PDF]"Local Spatio-Temporal Image Features for Motion Interpretation",
Ivan Laptev; PhD Thesis, 2004, Computational Vision and Active Perception Laboratory (CVAP), NADA, KTH, Stockholm [Abstract, PDF]"Local Descriptors for Spatio-Temporal Recognition",
Ivan Laptev and Tony Lindeberg; ECCV Workshop "Spatial Coherence for Visual Motion Analysis" [Abstract, PDF]"Velocity adaptation of space-time interest points",
Ivan Laptev and Tony Lindeberg; in Proc. ICPR'04, Cambridge, UK. [Abstract, PDF]"Space-Time Interest Points",
I. Laptev and T. Lindeberg; in Proc. ICCV'03, Nice, France, pp.I:432-439. [Abstract, PDF]
【计算机视觉】行为识别(action recognition)相关资料的更多相关文章
- 行为识别(action recognition)相关资料
转自:http://blog.csdn.net/kezunhai/article/details/50176209 ================华丽分割线=================这部分来 ...
- Recent papers on Action Recognition | 行为识别最新论文
CVPR2019 1.An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognit ...
- CNN相关资料
转子http://blog.csdn.net/qianqing13579/article/details/71076261 前言 入职之后,逐渐转到深度学习方向.很早就打算写深度学习相关博客了,但是由 ...
- 计算机视觉(ComputerVision, CV)相关领域的站点链接
关于计算机视觉(ComputerVision, CV)相关领域的站点链接,当中有CV牛人的主页.CV研究小组的主页,CV领域的paper,代码.CV领域的最新动态.国内的应用情况等等. (1)goog ...
- Skeleton-Based Action Recognition with Directed Graph Neural Network
Skeleton-Based Action Recognition with Directed Graph Neural Network 摘要 因为骨架信息可以鲁棒地适应动态环境和复杂的背景,所以经常 ...
- Two-Stream Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition
Two-Stream Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition 摘要 基于骨架的动作识别因为 ...
- Collaborative Spatioitemporal Feature Learning for Video Action Recognition
Collaborative Spatioitemporal Feature Learning for Video Action Recognition 摘要 时空特征提取在视频动作识别中是一个非常重要 ...
- Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition (ST-GCN)
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition 摘要 动态人体骨架模型带有进行动 ...
- iOS10以及xCode8相关资料收集
兼容iOS 10 资料整理笔记 源文:http://www.jianshu.com/p/0cc7aad638d9 1.Notification(通知) 自从Notification被引入之后,苹果就不 ...
随机推荐
- C++反汇编中的循环语句
do while 效率是最高的 #include "pch.h" #include <iostream> int main() { ; ; do { nSum += n ...
- WSAStartup() - 使用方法
当一个应用程序调用WSAStartup函数时, 操作系统根据请求的Socket版本来搜索相应的Socket库,然后绑定找到的Socket库到该应用程序中. 以后应用程序就可以调用所请求的Socket库 ...
- 基于腾讯云监控 API 的 Grafana App 插件开发
Tencent Cloud Monitor App Grafana 是一个开源的时序性统计和监控平台,支持例如 elasticsearch.graphite.influxdb 等众多的数据源,并以功能 ...
- [内网渗透]Mimikatz使用大全
0x00 简介 Mimikatz 是一款功能强大的轻量级调试神器,通过它你可以提升进程权限注入进程读取进程内存,当然他最大的亮点就是他可以直接从 lsass.exe 进程中获取当前登录系统用户名的密码 ...
- OpenFOAM清理计算结果(只保留原始的0,system,constant)
原视频下载地址:https://yunpan.cn/cMpyLZq8sWQgq(提取码:a08b)
- html5中output元素详解
html5中output元素详解 一.总结 一句话总结: output元素是HTML5新增的元素,用来设置不同数据的输出,没什么大用,了解即可 <form action="L3_01. ...
- qt 单例程序
1.http://qt.nokia.com的网站把QtSingleApplication 的源代码qtsingleapplication-2.6_1-opensource.zip 下载下来,然后解压缩 ...
- [Web前端] WEEX、React-Native开发App心得 (转载)
转自: https://www.jianshu.com/p/139c5074ae5d 2018 JS状态报告: https://2018.stateofjs.com/mobile-and-deskto ...
- 创建加载bean的实例
一.创建实例 工程的结构如下图 1.创建接口 public interface Person { public void setName(String name); public String say ...
- Java 泛型高级
1.限制泛型可用类型 在定义泛型类别时,预设可以使用任何的类型来实例化泛型中的类型,但是如果想要限制使用泛型的类别时,只能用某个特定类型或者其子类型才能实例化该类型时,使用extends关键字指定这个 ...