Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models.
翻译:语言指导的体现物剂的基准通常假定基于文本的指示,但部署的物剂会遇到口头指示。自动语音识别模型可以弥补输入差距,错误的ASR记录誊本会损害代理人完成任务的能力。在这项工作中,我们提议培训一种多式联运ASR模型,通过考虑所附的视觉背景来减少口述指示的抄写错误。我们用从ALFRED任务完成数据集合成的口述指示模型来培训我们的模型,我们通过系统遮盖口语来模拟声响。我们发现,利用视觉观测可促进隐蔽的单词恢复,多式联运ASR模型比单式基线多恢复了30%的蒙面字。我们还发现,经文本培训的体外装剂通过采用来自ASR模式的转录指示来更经常地成功完成任务。</s>