Efficient video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices. Most existing methods select the salient frames without awareness of the class-specific saliency scores, which neglect the implicit association between the saliency of frames and its belonging category. To alleviate this issue, we devise a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement. Specifically, we model the class-specific saliency measuring process as a query-response task. For each category, the common pattern of it is employed as a query and the most salient frames are responded to it. Then, the calculated similarities are adopted as the frame saliency scores. To achieve it, we propose a Temporal Saliency Query Network (TSQNet) that includes two instantiations of the TSQ mechanism based on visual appearance similarities and textual event-object relations. Afterward, cross-modality interactions are imposed to promote the information exchange between them. Finally, we use the class-specific saliencies of the most confident categories generated by two modalities to perform the selection of salient frames. Extensive experiments demonstrate the effectiveness of our method by achieving state-of-the-art results on ActivityNet, FCVID and Mini-Kinetics datasets. Our project page is at https://lawrencexia2008.github.io/projects/tsqnet .
翻译:高效视频识别是一个热点研究主题,互联网和移动设备多媒体数据爆炸性增长。大多数现有方法选择突出框架,而没有意识到特定阶级的突出分数,忽视了框架及其所属类别之间隐含的联系。为了缓解这一问题,我们设计了一个新型的 " 时温调频查询 " (TSQ)机制,根据视觉外观和文字事件对象关系,引入了两个特定阶级信息的即时提示,为突出度测量提供细微的提示。具体地说,我们将特定阶级的突出度度测量进程作为查询任务来模拟。对于每一类别,其共同模式被作为查询对象,而最突出的框则得到响应。随后,计算出的相似点被作为框架显著分数。为了实现这一点,我们建议建立一个 " 时温调调调调调频查询网络 " (TSQNet) 机制,包括两个基于视觉外观和文字事件对象关系的即时序的TSQ机制。之后,将跨模式互动用于促进它们之间的信息交流。最后,我们用类特定模式的 " C " 最信任的 " 网络 " VI " 数据类别 ",通过两种方式,通过我们所生成的 " 的 " Flacal-de Flical 的 " 项目结果,通过两种方式,对 " 的 " 的 " 的 " 的 " 的 " 的 " 的 " 数据选择,对 " 的 " 的 " 的 " 的 " 选择方式,对 " 的 " 的 " 的 " 数据- " 数据- " 数据- " 选择方式,对 " 数据- " 的 " 数据- " 基框架的 " 的 " 的 " 的 " 的 " 的 " 的 " 的 " 的 " 的 " 进行。