Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEnet) and show that it can achieve performance levels competitive with enrollment-based approaches, without the need for pre-recordings. We further demonstrate the potential of the proposed approach for processing meeting-like recordings, where the speaker activity is obtained from a diarization system. We show that this simple yet practical approach can successfully extract speakers after diarization, which results in improved ASR performance, especially in high overlapping conditions, with a relative word error rate reduction of up to 25%.
翻译:目标语音提取在配有辅助语音提示的混合物中抽出目标发言者的演讲内容,最近引起了越来越多的兴趣;对各种线索进行了调查,如预先录制的录制录制录制音量、方向信息或目标发言者的视频;在本文中,我们探索使用发言者活动信息作为单一频道神经网络语音提取的辅助线索;我们提议使用一个由发言者活动驱动的语音提取神经网络(ADEnet),并表明它能够实现与以注册为基础的方法的性能竞争水平,而无需事先记录;我们进一步展示了处理类似会议录音的拟议方法的潜力,这里的发言者活动是从对称系统获得的。我们表明,这种简单而实用的方法可以在对称后成功地抽取发言者,从而改进了ASR的性能,特别是在高度重叠的条件下,使词出错率相对降低25%。