Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning. As an alternative to large models, we present SmallCap, which generates a caption conditioned on an input image and related captions retrieved from a datastore. Our model is lightweight and fast to train as the only learned parameters are in newly introduced cross-attention layers between a pre-trained CLIP encoder and GPT-2 decoder. SmallCap can transfer to new domains without additional finetuning and exploit large-scale data in a training-free fashion because the contents of the datastore can be readily replaced. Our experiments show that SmallCap, trained only on COCO, has competitive performance on this benchmark, and also transfers to other domains without retraining, solely through retrieval from target-domain data. Further improvement is achieved through the training-free exploitation of diverse human-labeled and web data, which proves effective for other domains, including the nocaps image captioning benchmark, designed to test generalization to unseen visual concepts.
翻译:在图像字幕方面最近的进展侧重于扩大数据和模型大小,大幅提高培训前和微调的成本。作为大型模型的替代方案,我们介绍SmallCap,它产生一个字幕,以输入图像和从数据存储处检索的相关字幕为条件。我们的模式是轻量和快速的培训,因为唯一学到的参数是新引入的CLIP预培训编码器和GPT-2解码器之间的交叉注意层。小Cap可以不做额外的微调而向新领域转移,并且以无培训方式利用大型数据,因为数据存储器的内容可以随时被替换。我们的实验显示,SmallCap仅以CO为培训对象,在这一基准上具有竞争性的绩效,而且不经过再培训,仅通过从目标域数据检索,即可转移到其他领域。通过培训免费利用多种人类标签和网络数据,取得了进一步的改进,这证明对其他领域是有效的,包括用于测试普通视觉概念的无上限图像描述基准。