We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.
翻译:我们引入了LaVila, 这是一种利用大语言模型(LLMs)学习视频语言表现的新方法。 我们重新使用预先培训的LLMs, 以视觉输入为条件, 并微调它们来创建自动视频旁白。 我们自动生成的叙事提供了许多优势, 包括长视频覆盖密度高,视觉信息和文字的时间同步性更好,文本的多样性要高得多。 与这些额外自动生成的叙事相对应的视频文本嵌入比照了先前关于多个第一人和第三人视频任务的最新技术, 无论是零发式还是微调组合。 最显著的是, LaVILa在EGTEA分类上获得了10.1%的绝对收益,Epic-Kitchens- 100多因应检索基准为5.9% 。 此外, LaVILa只接受了EGO4D数据集中一半的叙事培训, 展示了全套培训的基线模型, 并展示了在增加培训前数据和模型大小方面的积极扩展行为。