Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.
翻译:视频理解是一个日益增长的领域,是一个密集研究的主题,它包括许多有趣的任务,以了解空间和时间信息,例如行动探测、行动识别、视频字幕、视频检索。视频理解中最棘手的问题之一是地貌提取,即,由于未受限制的视频的长期和复杂的时间结构,从未加剪裁的视频中提取背景视觉表述。与现有的方法不同,现有方法采用预先培训的骨干网络作为黑盒,提取视觉表述,我们的方法旨在用一个可解释的机制提取最上下文的信息。正如我们所观察到的那样,人类通常通过三种主要因素(即行为者、相关对象和周围环境)之间的相互作用来看待视频。因此,设计一个可以根据背景解释的视频表述提取,能够捕捉每一种此类因素并模拟它们之间的关系,非常重要。在本文中,我们讨论了将人类认知过程纳入模型行为者、对象和环境的方法。我们选择了视频段落说明和时间行动探测,以说明视频理解中基于直截面的图像代表的有效性。源码可在http://Regrespresent A/RKA 代码公开查阅。