Content-based Video Retrieval (CBVR) is used on media-sharing platforms for applications such as video recommendation and filtering. To manage databases that scale to billions of videos, video-level approaches that use fixed-size embeddings are preferred due to their efficiency. In this paper, we introduce Video Region Attention Graph Networks (VRAG) that improves the state-of-the-art of video-level methods. We represent videos at a finer granularity via region-level features and encode video spatio-temporal dynamics through region-level relations. Our VRAG captures the relationships between regions based on their semantic content via self-attention and the permutation invariant aggregation of Graph Convolution. In addition, we show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval. We evaluate our VRAG over several video retrieval tasks and achieve a new state-of-the-art for video-level retrieval. Furthermore, our shot-level VRAG shows higher retrieval precision than other existing video-level methods, and closer performance to frame-level methods at faster evaluation speeds. Finally, our code will be made publicly available.
翻译:基于内容的视频检索(CBVR)用于媒体共享平台,用于视频建议和过滤等应用程序。管理数据库,将数据库规模扩大到数十亿个视频,由于使用固定规模嵌入器的效率,更倾向于采用视频级办法;在本文中,我们介绍视频区关注图图网络(VRAG),改善视频级方法的最新艺术水平;我们通过区域层面的功能和通过区域层面的关系将视频时空动态动态编码编码编码转换成视频级(VRAG),以其语义内容为基础,以数十亿个视频规模为基础,管理数据库,使用固定规模嵌入器的视频级方法;此外,我们显示视频级和框架级方法之间的性能差距可以通过将视频段分进入镜头和使用视频级嵌入器来缩小;我们通过一些视频级的功能和视频级关系来代表视频级的微粒度微粒度视频,并实现新的视频级检索状态。此外,我们的图像级VRAG通过自我注意和图层变异组合组合来捕捉到区域间的关系。此外,我们的视频级图像级的检索精确度将比其他现有视频级标准级方法更接近于公开水平。