The usage of automatic speech recognition (ASR) systems are becoming omnipresent ranging from personal assistant to chatbots, home, and industrial automation systems, etc. Modern robots are also equipped with ASR capabilities for interacting with humans as speech is the most natural interaction modality. However, ASR in robots faces additional challenges as compared to a personal assistant. Being an embodied agent, a robot must recognize the physical entities around it and therefore reliably recognize the speech containing the description of such entities. However, current ASR systems are often unable to do so due to limitations in ASR training, such as generic datasets and open-vocabulary modeling. Also, adverse conditions during inference, such as noise, accented, and far-field speech makes the transcription inaccurate. In this work, we present a method to incorporate a robot's visual information into an ASR system and improve the recognition of a spoken utterance containing a visible entity. Specifically, we propose a new decoder biasing technique to incorporate the visual context while ensuring the ASR output does not degrade for incorrect context. We achieve a 59% relative reduction in WER from an unmodified ASR system.
翻译:自动语音识别系统(ASR)的使用正在变得无所不在,从个人助理到聊天室、家用系统和工业自动化系统等,从个人助理到聊天室、家用系统和工业自动化系统等,自动语音识别系统正在变得无所不在。现代机器人也装备有与人互动的自动识别能力,因为语言是最自然的互动模式。然而,机器人的自动识别系统与个人助理相比,面临更多的挑战。机器人作为内装代理人,必须识别周围的物理实体,从而可靠地识别含有此类实体描述的语音。然而,目前的自动识别系统往往无法这样做,因为ASR培训有限,例如通用数据集和开放式词汇模型等。此外,在推断过程中的不利条件,例如噪音、重音和远方语言,使得抄录不准确。在这项工作中,我们提出了一个方法,将机器人的视觉信息纳入ASR系统,并改进对含有可见实体的语音的识别。具体地说,我们建议一种新的分解偏差技术,以纳入视觉背景,同时确保ASR产出不会因不正确的背景而降低。我们从AER系统获得59%的相对减少。