This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset that helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party). One may use a state-of-the-art blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) that works well in various environments thanks to its unsupervised nature. Its heavy computational cost, however, prevents its application to real-time processing. In contrast, a supervised beamforming method that uses a deep neural network (DNN) for estimating spatial information of speech and noise readily fits real-time processing, but suffers from drastic performance degradation in mismatched conditions. Given such complementary characteristics, we propose a dual-process robust online speech enhancement method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF (back end) is performed in a mini-batch style and the noisy and enhanced speech pairs are used together with the original parallel training data for updating the direction-aware DNN (front end) with backpropagation at a computationally-allowable interval. This method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker, which can be detected from video or selected by a user's hand gesture or eye gaze, in a streaming manner and spatially showing the transcriptions with an AR technique. Our experiment showed that the word error rate was improved by more than 10 points with the run-time adaptation using only twelve minutes of observation.
翻译:本文描述了为强化现实(AR)耳机进行在线语音强化的实际反应和性能意识开发,以强化现实(AR)耳机,帮助用户理解在实实在在的噪音回声环境中(如鸡尾酒派对)进行的对话。可以使用最先进的盲源分离方法,称为快速多通道非阴性矩阵因子化(FastMMMMMF),这种方法在各种环境中都因其不受监督的性质而运作良好。但其计算成本过重,无法应用于实时处理。相比之下,一种受监督的线形调整方法,即使用深层神经网络(DNNN)来估计语音和噪音的空间信息,便于实时处理,但也有在不匹配的条件下发生急剧的性能退化。基于 DNNMMF 的双轨制的双进程强在线语音增强方法。快速MM(后端) 其快速MNM(后端) 的计算成本很高, 使其无法应用于实时处理。 与原始的平行培训数据一起, 更新DNNW(前端) 和噪音的实时观测的空间信息, 并且用一个智能路路路机路路路路路法进行计算。