通过差异多模式培训前,了解中文视频和语言 (Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training)

The pre-trained neural models have recently achieved impressive performances in understanding multimodal content. However, it is still very challenging to pre-train neural models for video and language understanding, especially for Chinese video-language data, due to the following reasons. Firstly, existing video-language pre-training algorithms mainly focus on the co-occurrence of words and video frames, but ignore other valuable semantic and structure information of video-language content, e.g., sequential order and spatiotemporal relationships. Secondly, there exist conflicts between video sentence alignment and other proxy tasks. Thirdly, there is a lack of large-scale and high-quality Chinese video-language datasets (e.g., including 10 million unique videos), which are the fundamental success conditions for pre-training techniques. In this work, we propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. Besides general proxy tasks such as masked language modeling, VICTOR constructs several novel proxy tasks under the contrastive learning paradigm, making the model be more robust and able to capture more complex multimodal semantic and structural relationships from different perspectives. VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions. We apply the pre-trained VICTOR model to a series of downstream applications and demonstrate its superior performances, comparing against the state-of-the-art pre-training methods such as VideoBERT and UniVL. The codes and trained checkpoints will be publicly available to nourish further developments of the research community.

翻译：培训前神经模型在理解多式联运内容方面最近取得了令人印象深刻的成绩,然而,由于以下原因,对视频和语言理解,特别是中文视频数据而言,培训前神经模型的预培训仍然非常困难。首先,现有的视频培训前算法主要侧重于语言和视频框架的共同出现,但忽视了视频内容中其他宝贵的语义和结构信息,如相继秩序和时际关系等。第二,视频句调整和其他代理任务之间存在冲突。第三,缺乏大规模和高质量的中国视频语言数据集(例如,包括1 000万个独特的视频),这是培训前技术的基本成功条件。在这项工作中,我们提议了一个名为VICTOR的新型视频语言理解框架,这个框架是视频语言内容的维迪奥语言理解,通过对比性模范模式模拟等普通替代任务,国际中心在对比学习模式下建立了若干新型的替代任务,使高级视频语言应用模式(包括1 000万个独特的视频)更可靠,并能够从高层次的模板到高层次的图像关系(包括高层次的模板)更精细的模板,从而更能、更精确地获取更精通俗的图像。