Visual information is central to conversation: body gestures and facial expressions, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning. Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations. We release data, models, and code at https://seungjuhan.me/champagne.
翻译:视觉信息是对话的核心:例如,肢体手势和面部表情有助于超越单词本身的含义。然而,迄今为止,大多数神经对话模型都仅限于文本。我们引入CHAMPAGNE,这是一个能考虑视觉上下文的对话生成模型。为了训练CHAMPAGNE,我们收集并发布了YTD-18M,这是一个由1800万个基于网络视频的对话组成的大规模语料库。YTD-18M是从网络视频中构建的:对我们的数据收集管道至关重要的是一个预训练语言模型,它将容易出错的自动转录转换为较干净的对话格式,并在保持意义的同时维护视觉基础性。人类评估表明,与先前的资源(MMDialog,100万个对话)相比,YTD-18M更加合理和具体,同时保持了视觉基础性。实验表明,1)CHAMPAGNE从YTD-18M学习进行对话;2)微调后,在四个关注现实对话的视觉语言任务上实现了最先进的结果。我们在https://seungjuhan.me/champagne发布了数据、模型和代码。