Sentence representation models trained only on language could potentially suffer from the grounding problem. Recent work has shown promising results in improving the qualities of sentence representations by jointly training them with associated image features. However, the grounding capability is limited due to distant connection between input sentences and image features by the design of the architecture. In order to further close the gap, we propose applying self-attention mechanism to the sentence encoder to deepen the grounding effect. Our results on transfer tasks show that self-attentive encoders are better for visual grounding, as they exploit specific words with strong visual associations.
翻译:仅在语言方面受过培训的判刑代表模式可能会受到地面问题的影响。最近的工作表明,通过用相关的图像特征共同培训,在提高服刑表现质量方面取得了可喜的成果。然而,由于输入句子与建筑设计图象特征之间的遥远联系,地面能力受到限制。为了进一步缩小差距,我们建议对句子编码器采用自我注意机制,以深化地面效应。我们的转移任务结果表明,自我注意的编码器对视觉定位更好,因为它们利用了与强大的视觉协会的特殊语言。