Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with text references. This differs from human language processing, for which visual imagination often improves comprehension. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of StableDiffusion, a state-of-the-art text-to-image generator, we automatically generate an image as the embodied imagination for the text snippet and compute the imagination similarity using contextual embeddings. Experiments spanning several text generation tasks demonstrate that adding machine-generated images with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics' correlations with human similarity judgments in both reference-based and reference-free evaluation scenarios.
翻译:自然语言生成( NLG) 的自动评估通常依赖于象征性或嵌入级别与文本引用的比较。 这与人类语言处理不同, 视觉想象力往往能提高理解力。 在这项工作中, 我们提出以想象力为基础的自然语言生成自动评估指标ImaginE。 在StailDiful的帮助下, 一个最先进的文本到图像生成器, 我们自动生成一个图像, 以体现文本片断的想象力, 并用背景嵌入来计算想象力的相似性。 覆盖多个文本生成任务的实验表明, 用我们的 ImaginE 添加机器生成图像在将多模式信息引入 NLG 评估中具有巨大潜力, 并改进了现有自动计量与参考和无参考评估情景中人类相似性判断的关联性。