While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.
翻译:虽然通过探索更好的代表性学习,在视频文本检索方面最近取得了进展,但在本文件中,我们展示了一个新的多空间多空间多级监督学习框架(SUMA),以学习视频和视频文本检索文本之间共享的统一代表空间;共享统一空间初始化时有一定数量的概念组群,每个组群都提到一些基本概念(词);有了手头的文本数据,我们能够利用拟议的相似性和一致性损失,以监督的方式更新共享的统一空间;此外,为了能够实现多重调整,我们纳入了框架代表,以便更好地建模视频模式,计算细细细和粗细相似性;从共享共享空间和多级相似性中获益,对若干视频文本检索基准进行的广泛实验表明SUMA优于现有方法。