We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Constructing a large-scale labeled image captioning dataset is an expensive task in terms of labor, time, and cost. In contrast to manually annotating all the training samples, separately collecting uni-modal datasets is immensely easier, e.g., a large-scale image dataset and a sentence dataset. We leverage such massive unpaired image and caption data upon standard paired data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples in an adversarial learning fashion, where the joint distribution of image and caption is learned. Our method trains a captioner to learn from a paired data and to progressively associate unpaired data. This approach shows noticeable performance improvement even in challenging scenarios including out-of-task data (i.e., relational captioning, where the target task is different from the unpaired data) and web-crawled data. We also show that our proposed method is theoretically well-motivated and has a favorable global optimal property. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired COCO dataset demonstrate the consistent effectiveness of our semisupervised learning method with unpaired data compared to competing methods.
翻译:我们提出了一个新的数据效率半监督框架,以改善图像说明模型的通用性。 构建一个大型标签标签图像说明数据集是一项昂贵的劳动、时间和成本方面的任务。 与手动说明所有培训样本相比,单独收集单式数据集非常容易,例如大型图像数据集和句子数据集。 我们利用标准配对数据上的大规模未覆盖图像和字幕数据,学习将其与它们联系起来。 为此,我们提议的半监督学习方法以对抗性学习方式为未标定的样本指定假标签,这是联合分发图像和字幕的学习方式。 与手动手动说明所有培训样本相比,单独收集单式数据集非常容易得多,例如大型图像数据集和句子数据集。 这种方法显示,即使在具有挑战性的假设情景中(例如,关系说明,目标任务不同于未标定的数据)和网络标定的双轨比标本数据中,我们的拟议方法(例如,基于未标定的数据)和基于网络标定的双比标本的双比标注样本样本样本样本样本样本数据中,我们还显示,我们所提议的方法在理论上对配对数据进行稳度和精确度的全面数据分析。