In recent years, researchers have become increasingly interested in speaker extraction (SE), which is the task of extracting the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information about the target speaker. Several forms of auxiliary information have been employed in single-channel SE, such as a speech snippet enrolled from the target speaker or visual information corresponding to the spoken utterance. Many SE studies have reported performance improvement compared to speaker separation (SS) methods with oracle selection, arguing that this is due to the use of auxiliary information. However, such works have not considered state-of-the-art SS methods that have shown impressive separation performance. In this paper, we revise and examine the role of the auxiliary information in SE. Specifically, we compare the performance of two SE systems (audio-based and video-based) with SS using a common framework that utilizes the state-of-the-art dual-path recurrent neural network as the main learning machine. In addition, we study how much the considered SE systems rely on the auxiliary information by analyzing the systems' output for random auxiliary signals. Experimental evaluation on various datasets suggests that the main purpose of the auxiliary information in the considered SE systems is only to specify the target speaker in the mixture and that it does not provide consistent extraction performance gain when compared to the uninformed SS system.
翻译:近年来,研究人员越来越对扩音器的提取(SE)感兴趣,这是从干扰性演讲者的混合组合中提取目标发言者的演讲词的任务,并辅以关于目标发言者的辅助信息。在单一的SE频道中,使用了几种形式的辅助信息,例如从目标演讲者注册的语音片段或与口头发言相对应的视觉信息。许多SE研究报告说,与使用隔音器选择语言器的方法相比,其性能有所改善,认为这是因为使用了辅助信息。然而,这类工作没有考虑到显示令人印象深刻的分离性能的最新SS方法。在本文件中,我们修订和审查SE的辅助信息的作用。具体地说,我们用一个共同的框架将两个SEE系统(基于音频和视频的系统)的性能与SS的性能进行比较,这个共同的框架是使用最先进的双向重复神经网络作为主要的学习机器。此外,我们研究认为SE的系统在多大程度上依赖辅助信息,分析系统输出随机辅助信号。对各种数据系统的实验性评估表明,在SEE系统中,在连续的提取性能中,只有比SEE的混合系统更精确性的目的。