Many believe that the successes of deep learning on image understanding problems can be replicated in the realm of video understanding. However, due to the scale and temporal nature of video, the span of video understanding problems and the set of proposed deep learning solutions is arguably wider and more diverse than those of their 2D image siblings. Finding, identifying, and predicting actions are a few of the most salient tasks in this emerging and rapidly evolving field. With a pedagogical emphasis, this tutorial introduces and systematizes fundamental topics, basic concepts, and notable examples in supervised video action understanding. Specifically, we clarify a taxonomy of action problems, catalog and highlight video datasets, describe common video data preparation methods, present the building blocks of state-of-the art deep learning model architectures, and formalize domain-specific metrics to baseline proposed solutions. This tutorial is intended to be accessible to a general computer science audience and assumes a conceptual understanding of supervised learning.
翻译:许多人认为,关于图像理解问题的深层次学习的成功可以在视频理解领域复制,但是,由于视频的规模和时间性质,视频理解问题的范围以及拟议的一套深层次学习解决办法可以说比其2D形象兄弟姐妹的范围更广,而且更加多样化。寻找、确定和预测行动是这个新兴和迅速变化的领域最突出的任务之一。在强调教学的同时,这种辅导性介绍和系统化了基本主题、基本概念以及监督视频行动理解的显著例子。具体地说,我们澄清了行动问题的分类、目录和突出视频数据集,描述了共同的视频数据编制方法,介绍了最先进的深层学习模型结构的构件,并正式确定了用于基线拟议解决办法的针对具体域的衡量标准。这种辅导性旨在让一般计算机科学受众了解,并对监督的学习形成概念性理解。