Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with the text references. This is different from human language processing, for which visual imaginations often improve comprehension. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of CLIP and DALL-E, two cross-modal models pre-trained on large-scale image-text pairs, we automatically generate an image as the embodied imagination for the text snippet and compute the imagination similarity using contextual embeddings. Experiments spanning several text generation tasks demonstrate that adding imagination with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics' correlations with human similarity judgments in many circumstances.
翻译:自然语言生成( NLG) 的自动评估通常依赖于与文本引用的符号级或嵌入级比较。 这与人类语言处理不同, 视觉想象力往往能提高理解力。 在这项工作中, 我们提议了基于想象力的自然语言生成自动评估标准ImaginE。 在CLIP 和 DALL-E 的帮助下, 两个在大型图像文本配对上预先培训的跨模式, 我们自动生成一个图像作为文本片断的内含想象力, 并用背景嵌入来计算想象力相似性。 覆盖多个文本生成任务的实验表明, 与我们的 ImaginE 添加想象力在将多模式信息引入 NLG 评估中具有巨大潜力, 并在许多情形下改善现有自动计量与人类相似性判断的关联性。