Video-based computer vision tasks can benefit from estimation of the salient regions and interactions between those regions. Traditionally, this has been done by identifying the object regions in the images by utilizing pre-trained models to perform object detection, object segmentation and/or object pose estimation. Although using pre-trained models is a viable approach, it has several limitations in the need for an exhaustive annotation of object categories, a possible domain gap between datasets, and a bias that is typically present in pre-trained models. In this work, we propose to utilize the common rationale that a sequence of video frames capture a set of common objects and interactions between them, thus a notion of co-segmentation between the video frame features may equip the model with the ability to automatically focus on task-specific salient regions and improve the underlying task's performance in an end-to-end manner. In this regard, we propose a generic module called ``Co-Segmentation inspired Attention Module'' (COSAM) that can be plugged in to any CNN model to promote the notion of co-segmentation based attention among a sequence of video frame features. We show the application of COSAM in three video-based tasks namely: 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification and demonstrate that COSAM is able to capture the task-specific salient regions in video frames, thus leading to notable performance improvements along with interpretable attention maps for a variety of video-based vision tasks, with possible application to other video-based vision tasks as well.
翻译:在这项工作中,我们提议利用一个共同的理论,即视频框架的序列捕捉一组共同对象和它们之间的相互作用,从而在视频框架功能之间形成一个共同的组合概念,使模型有能力自动侧重于特定任务突出区域,并以端对端方式改进基本任务绩效。 在这方面,我们提议了一个称为“Co-Saction激励注意模块(COSAM)”的通用模块,该模块可以插在任何CNN模型中,用于在视频框架功能序列中促进基于共同目标的注意概念,从而视频框架功能之间的共同组合概念可以使模型有能力自动侧重于特定任务突出区域,并以端对端方式改进基本任务绩效。