Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models seems to weaken the importance of the single IC task. However, we demonstrate that IC still has its specific significance in this age by analyzing the connections between IC with some popular self-supervised learning paradigms. Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning.
翻译:通过将各种技术纳入CNN-RNN 编码器-解码器架构,图像显示(IC)取得了惊人的发展。然而,由于CNN和RNN不分享基本网络组件,这种混杂的管道很难在视觉编码器无法从字幕监管中学到任何东西的地方得到训练端对端的管道。这一缺陷激励了研究人员开发一个促进端对端培训的同质结构,而对于这种结构,变异器是完美的结构,在视觉和语言领域都证明了它的巨大潜力,因此可以用作IC管道中视觉编码器和语言解码器的基本组成部分。中,中段、自我监督的学习释放了变异器结构的力量,这种变异的大型编码器可以在包括IC在内的各种任务中普及。这些大规模模型的成功似乎削弱了单一IC任务的重要性。然而,我们通过分析ICC与一些大众自我监督的学习模式之间的联系,证明IC仍然具有其特殊的意义。由于页数有限,我们只提到在本次短期调查中找到的重要文件。