Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.
翻译:从视频数据中进行多式学习最近引起越来越多的注意,因为这样可以对没有人类批注的、具有意义的嵌入系统进行培训,而无需进行人工批注,例如零射检索和分类。在这项工作中,我们展示了一种多式、模式的混合变压器方法,学会在视频、音频和文字等多种模式之间交流信息,并将其纳入一个合并的多式代表器,以获得一个包含多式时间信息的嵌入器。我们提议对该系统进行培训,同时对每件、单一模式和模式组合进行组合式损失,明确排除任何附加内容,例如位置或模式编码。在测试时,由此产生的模型可以处理和整合任何数量的投入模式。此外,变压器的隐含特性使得能够处理不同长度的投入。为了评价拟议方法,我们培训了大规模To100M数据集模型,并评价由此在四个具有挑战性的基准数据集上嵌入的空间,以获得零射镜头的图像检索和零射镜头视频行动本地化的状态结果。