We combine a neural image captioner with a Rational Speech Acts (RSA) model to make a system that is pragmatically informative: its objective is to produce captions that are not merely true but also distinguish their inputs from similar images. Previous attempts to combine RSA with neural image captioning require an inference which normalizes over the entire set of possible utterances. This poses a serious problem of efficiency, previously solved by sampling a small subset of possible utterances. We instead solve this problem by implementing a version of RSA which operates at the level of characters ("a","b","c"...) during the unrolling of the caption. We find that the utterance-level effect of referential captions can be obtained with only character-level decisions. Finally, we introduce an automatic method for testing the performance of pragmatic speaker models, and show that our model outperforms a non-pragmatic baseline as well as a word-level RSA captioner.
翻译:我们把神经图像字幕与合理语音法案(RSA)模型结合起来,使系统具有实用性,信息丰富:它的目标是制作不仅真实而且区分其输入与类似图像的字幕。以前试图将神经图像字幕与神经图像字幕相结合,需要一种比喻,这对整个可能的语句组的演练情况具有共性。这造成了一个严重的效率问题,以前是通过抽样取样一小撮可能的语句来解决的。相反,我们通过在标题解开期间执行一个在字符级别(“a”,“b”,“c”)操作的RSA版本来解决这个问题。我们发现,只有字符级决定,才能获得特惠字幕的发音级效应。最后,我们引入了一种自动方法来测试实用演讲模型的性能,并显示我们的模型比非文字基线和一个字级的RSA字幕系统。