Training large-scale image captioning (IC) models demands access to a rich and diverse set of training examples, gathered from the wild, often from noisy alt-text data. However, recent modeling approaches to IC often fall short in terms of performance in this case, because they assume a clean annotated dataset (as opposed to the noisier alt-text--based annotations), and employ an end-to-end generation approach, which often lacks both controllability and interpretability. We address these problems by breaking down the task into two simpler, more controllable tasks -- skeleton prediction and skeleton-based caption generation. Specifically, we show that selecting content words as skeletons} helps in generating improved and denoised captions when leveraging rich yet noisy alt-text--based uncurated datasets. We also show that the predicted English skeletons can be further cross-lingually leveraged to generate non-English captions, and present experimental results covering caption generation in French, Italian, German, Spanish and Hindi. We also show that skeleton-based prediction allows for better control of certain caption properties, such as length, content, and gender expression, providing a handle to perform human-in-the-loop semi-automatic corrections.
翻译:培训大型图像字幕模型(IC)要求获取来自野生的丰富多样的培训范例,这些范例往往来自吵闹的变异文本数据。然而,最近对IC的模拟方法在本案中往往表现不佳,因为它们假定了一个干净的附加说明的数据集(而不是以新音变异文本为基础的附加说明),并采用一种端到端生成方法,这往往缺乏可控性和可解释性。我们通过将任务分为两个更简单、更可控的任务来解决这些问题 -- -- 骨架预测和基于骨架的字幕生成。具体地说,我们表明,选择内容单词作为骨架,有助于在利用丰富但又吵闹的以文本为基础的未经精确的数据集时产生改进和删除的字幕。我们还表明,预测的英国骨架可以进一步以跨语为杠杆,生成非英语字幕,并展示涵盖法文、意大利文、德文、西班牙文和印地文的字幕生成的实验结果。我们还表明,基于骨架的预测可以更好地控制某些字幕属性,例如长度、内容和半自动表达方式,提供人类的处理。