Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create EgoClip, a 1st-person video-text pretraining dataset comprising 3.8M clip-text pairs well-chosen from Ego4D, covering a large variety of human daily activities. (ii) We propose a novel pretraining objective, dubbed EgoNCE, which adapts video-text contrastive learning to the egocentric domain by mining egocentric-aware positive and negative samples. (iii) We introduce EgoMCQ, a development benchmark that is close to EgoClip and hence can support effective validation and fast exploration of our design decisions in EgoClip and EgoNCE. Furthermore, we demonstrate strong performance on five egocentric downstream tasks across three datasets: video-text retrieval on EPIC-KITCHENS-100; action recognition on Charades-Ego; natural language query, moment query, and object state change classification on Ego4D challenge benchmarks. The dataset and code are available at https://github.com/showlab/EgoVLP.
翻译:视频-语言预科培训(VLP)旨在学习可转让代表制,以推进广泛的视频文本下游任务,最近受到越来越多的关注;最佳的绩效工作依赖大规模、第3人视频-文本数据集,如 HowTo100M。 在这项工作中,我们利用最近发布的Ego4D数据集向先驱Egocenter VLP 沿着三个方向开发。 (一) 我们创建了EgoClip,这是由Ego4D的3.8M 短文本配对组成的第一人视频-文本预培训数据集,涵盖大量的人类日常活动。 (二) 我们提出了一个创新的预培训目标,即调制版EgoNCE, 将视频-文字对比学习与自我中心领域相适应,通过采矿自我中心中心-认识正反样本进行。 (三) 我们引入EgoMCQ,这是一个接近EgoClip的开发基准,因此可以支持有效验证和快速探索我们在EgoClip和EgoNCE的设计决定。 此外,我们展示了五个自我中心下游任务在三个数据状态上的目标,CRIC Restal-creal ex reviewal:Cal decal dislation:Cal disal dislation:Cal-cional