Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, since users often tend to discuss topics only vaguely related to the video. Despite the ubiquity of user comments online, there is currently no multi-modal representation learning datasets that includes comments. In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations. Project page: https://unitaryai.github.io/vtc-paper.
翻译:对于许多应用程序,例如建议和搜索,多式检索是一个重要问题。当前的基准甚至数据集往往是手工构建的,而且大多由清洁的样本组成,所有模式都与内容密切相关。因此,目前的视频文本检索文献主要侧重于视频标题或录音誊本,而忽视用户的评论,因为用户往往只讨论与视频有模糊关系的专题。尽管网上用户的评论很普遍,但目前没有包含评论的多式代表学习数据集。在本文中,我们a)引入新的视频、标题和评论数据集;b)提供一种基于关注的机制,使模型能够从有时无关的数据(例如评论)中学习;c)表明,通过使用评论,我们的方法能够学习更好、更符合背景的图像、视频和音频表达方式。项目网页:https://licliplai.github.io/vtc-paper。