In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2022. Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained. To tackle this task, we propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips. Besides, we propose two data augmentation strategies to increase the diversity of training samples. The experimental results demonstrate the effectiveness of our method. The final submission ranked first on the leaderboard.
翻译:在本报告中,我们向2022年CVPR中Ego4D自然语言查询(NLQ)挑战(NLQ)提交ReLER ⁇ JU-Alibaba的呈件。如果有一个视频剪辑和一个文字查询,这项挑战的目标是找到一个视频剪辑的时间点,以便获得对查询的答案。为了完成这项任务,我们提议了一个多尺度的跨模式变压器和一个视频框架级的对比性损失,以充分发现语言查询和视频剪辑之间的相互关系。此外,我们提出了两个数据扩充战略,以增加培训样本的多样性。实验结果显示了我们的方法的有效性。最后的提交排在头板上。