This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments. A major approach that has actively been studied in simulated environments is to sequentially perform speech enhancement and automatic speech recognition (ASR) based on deep neural networks (DNNs) trained in a supervised manner. In our task, however, such a pretrained system fails to work due to the mismatch between the training and test conditions and the head movements of the user. To enhance only the utterances of a target speaker, we use beamforming based on a DNN-based speech mask estimator that can adaptively extract the speech components corresponding to a head-relative particular direction. We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions. Comparative experiments using the state-of-the-art distant speech recognition system show that the proposed method significantly improves the ASR performance.
翻译:本文描述了在真实的多党对话环境中,对扩大的现实耳机进行扩增的语音识别,这有助于言语交流。在模拟环境中积极研究的一个主要办法是,根据经过监督培训的深层神经网络,按顺序进行语音增强和自动语音识别。然而,在我们的任务中,由于培训与测试条件和用户头部运动不匹配,这种预先培训的系统无法发挥作用。为了只加强目标演讲者的言语,我们使用基于DNN的语音遮罩显示器,该显示器可以适应性地提取与头部-反方向相对的语音组成部分。我们建议采用半监督的适应方法,利用地面-真理记录和带有高度自疑性估计的语音信号的清洁语音信号,在运行时联合更新遮罩估计符号和ASR模型。使用最新语音识别系统进行的比较实验表明,拟议方法大大改进了ASR的性能。