When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
翻译:在大规模数据集培训时,图像字幕模型可以理解一般域图像的内容,但往往无法生成准确、详细的字幕。为了提高性能,预培训和调整一直是图像字幕的关键战略。然而,我们发现图像和文字之间的大型双向培训能够提供零光图像字幕。在本文中,我们引入了在 ragerer 规模、 BITTERS 中双向图像文本培训,一个高效的零发图像字幕培训和推断框架。我们还提出了一个新的评价基准,其中包括高质量的数据集和一套广泛的衡量标准,以适当评估零发字幕准确性和社会偏见。我们另外为关键词提取提供了高效的微调方法。我们表明,仔细选择大型培训成套和模型架构是实现零发图像字幕的关键。