Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner, i.e., generating descriptions word by word, which suffers from slow decoding issue and becomes a bottleneck in real-time applications. Non-autoregressive image captioning with continuous iterative refinement, which eliminates the sequential dependence in a sentence generation, can achieve comparable performance to the autoregressive counterparts with a considerable acceleration. Nevertheless, based on a well-designed experiment, we empirically proved that iteration times can be effectively reduced when providing sufficient prior knowledge for the language decoder. Towards that end, we propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC), to make a better trade-off between performance and speed. The proposed SAIC model maintains autoregressive property in global but relieves it in local. Specifically, SAIC model first jumpily generates an intermittent sequence in an autoregressive manner, that is, it predicts the first word in every word group in order. Then, with the help of the partially deterministic prior information and image features, SAIC model non-autoregressively fills all the skipped words with one iteration. Experimental results on the MS COCO benchmark demonstrate that our SAIC model outperforms the preceding non-autoregressive image captioning models while obtaining a competitive inference speedup. Code is available at https://github.com/feizc/SAIC.
翻译:目前最先进的图像字幕描述方法通常采用自动递减方式,即逐字生成描述词,这有缓慢的解码问题,在实时应用程序中成为瓶颈。非自动递减图像说明,连续迭代完善,在句子生成中消除顺序依赖性,可以与自动递减的对应方取得相当的效绩,并大大加速。然而,根据精心设计的实验,我们从经验上证明,在为语言解码者提供足够的先前知识时,循环时间可以有效缩短。为此,我们提议了一个新型的两阶段框架,称为半自动递增图像显示(SAIIC),以便在性能和速度之间实现更好的交替。拟议的SAIC模型在全球保持自动递减性属性,但在当地则会大大减轻。具体地说,SAIC模型首先以自动递增方式产生间歇性顺序,也就是说,它预测每个词组的第一个词顺序。随后,在部分确定性递增前递增图像显示我们之前的缩缩缩缩缩略图中,SAIC在前一个不显示之前的缩缩缩略图中,而正在显示前的缩缩缩缩缩缩缩缩缩缩缩缩缩图。