As sharing images in an instant message is a crucial factor, there has been active research on learning a image-text multi-modal dialogue model. However, training a well-generalized multi-modal dialogue model is challenging because existing multi-modal dialogue datasets contain a small number of data, limited topics, and a restricted variety of images per dialogue. In this paper, we present a multi-modal dialogue dataset creation pipeline that involves matching large-scale images to dialogues based on CLIP similarity. Using this automatic pipeline, we propose a large-scale multi-modal dialogue dataset, DialogCC, which covers diverse real-world topics and various images per dialogue. With extensive experiments, we demonstrate that training a multi-modal dialogue model with our dataset can improve generalization performance. Additionally, existing models trained with our dataset achieve state-of-the-art performance on image and text retrieval tasks. The source code and the dataset will be released after publication.
翻译:由于在瞬间信息中分享图像是一个关键因素,因此对学习图像文本多模式对话模式进行了积极研究,然而,培训一个广泛化的多模式对话模式具有挑战性,因为现有的多模式对话数据集包含少量数据、有限的专题和有限的每种对话图像。在本文中,我们提出了一个多模式对话数据集创建管道,将大型图像与基于 CLIP 相似性的对话相匹配。我们利用这一自动管道,建议建立一个大型多模式对话数据集, DialogCC, 涵盖不同的现实世界主题和每个对话的各种图像。通过广泛的实验,我们证明以我们的数据集培训多模式对话模式可以提高通用性能。此外,通过我们的数据集培训的现有模型在图像和文本检索任务上达到最新水平的性能。源代码和数据集将在发布后发布。