Aiming to improve the Automatic Speech Recognition (ASR) outputs with a post-processing step, ASR error correction (EC) techniques have been widely developed due to their efficiency in using parallel text data. Previous works mainly focus on using text or/ and speech data, which hinders the performance gain when not only text and speech information, but other modalities, such as visual information are critical for EC. The challenges are mainly two folds: one is that previous work fails to emphasize visual information, thus rare exploration has been studied. The other is that the community lacks a high-quality benchmark where visual information matters for the EC models. Therefore, this paper provides 1) simple yet effective methods, namely gated fusion and image captions as prompts to incorporate visual information to help EC; 2) large-scale benchmark datasets, namely Visual-ASR-EC, where each item in the training data consists of visual, speech, and text information, and the test data are carefully selected by human annotators to ensure that even humans could make mistakes when visual information is missing. Experimental results show that using captions as prompts could effectively use the visual information and surpass state-of-the-art methods by upto 1.2% in Word Error Rate(WER), which also indicates that visual information is critical in our proposed Visual-ASR-EC dataset
翻译:为了通过后处理步骤改善自动语音识别(ASR)输出,ASR错误修正(EC)技术已经得到广泛发展,因为它们在使用并行文本数据方面效率高。以往的工作主要集中在使用文本或/和语音数据,这在视觉信息至关重要的情况下会限制性能提升,这促使我们在此研究领域中对这个问题进行了深入探讨。困难主要有两个方面:一是以往的工作未能强调视觉信息,因此做出了很少的探索;二是研究社区缺乏一个高质量的基准数据集,其中视觉信息对EC模型至关重要。因此,本文提供了1)简单而有效的方法,即门控融合和图像标题提示,以结合视觉信息来帮助EC;2)大规模基准数据集,即Visual-ASR-EC,其中训练数据中的每个项目都包含视觉、语音和文本信息,并且测试数据由人类注释者仔细选择,以确保即使是人类在缺少视觉信息时也可能犯错。实验结果表明,使用标题作为提示可以有效地利用视觉信息,在Word Error Rate(WER)方面超过了最先进的方法,最多可达1.2%,这也表明视觉信息在我们提出的Visual-ASR-EC数据集中是至关重要的。