In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to MIR benchmark. Furthermore, we devise an adaptive multi-instance max-margin loss to effectively fine-tune the model and equip the dual-softmax technique for reliable inference. Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG. The code is available at https://github.com/showlab/EgoVLP.
翻译:在本报告中,我们为 EPIC- Kitchens- 100 Multi-Instance Retreatval (MIR) 挑战提出了一个基于视频语言预培训(VLP) 的解决方案 。 特别是, 我们利用最近发布的 Ego4D 数据集\ cite{grauman2021ego4d} 将Egocentic VLP 从预培训数据集、 预培训目标 和成套开发中先入为主。 基于上述三个设计, 我们开发了一个经过预先培训的视频语言模型, 能够将其以自我为中心的视频文本代表制转换到 MIR 基准。 此外, 我们设计了一个适应性性性性性性性多功能最大margin 损失, 以有效微调模型, 并装备双软计算技术, 以可靠推导力。 我们最好的单一模型在挑战测试集中获得了47.39% mAP 和 61.44% nDCG 的有力性能。 代码可在 https://github.com/showlab/EgoVLP 上查阅 。