Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the generated dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe strong performance gains in the low-data regime (up to 9.35 absolute points on NDCG).
翻译:视觉对话框( VisDial) 是用对话框历史作为背景回答基于图像的一系列问题的任务。 先前的工作通过监督学习或相关视觉和语言数据集的杠杆前培训, 仅对VisDial数据进行培训。 本文为视觉基础对话提供了一个半监督的学习方法, 叫做“ 启发自我培训 ” ( GST), 以利用网络上未贴标签的图像。 具体地说, GST 首次通过发行外检测检索主页图像, 并通过多式有条件文本生成生成图像, 生成合成对话框。 GST 然后通过合成和原始 VisDial数据培训一个对话代理。 结果, GST 将培训数据的数量提高到VisDial( 1. 2M 到 12. 9M QA 数据 ) 。 为了对生成的对话框进行强有力的培训, 我们还建议基于不易解的数据选择和多式一致性规范。 对 VisDial v1.0 和 v0. 9 数据集的评估显示, GST 实现了新状态的合成和原始 VisD 数据系统。 我们进一步观察了N35 的绝对数据系统。