To extract the voice of a target speaker when mixed with a variety of other sounds, such as white and ambient noises or the voices of interfering speakers, we extend the Transformer network to attend the most relevant information with respect to the target speaker given the characteristics of his or her voices as a form of contextual information. The idea has a natural interpretation in terms of the selective attention theory. Specifically, we propose two models to incorporate the voice characteristics in Transformer based on different insights of where the feature selection should take place. Both models yield excellent performance, on par or better than published state-of-the-art models on the speaker extraction task, including separating speech of novel speakers not seen during training.
翻译:为了在与诸如白人和环境噪音或干扰性演讲者的声音等各种其他声音混杂在一起时获取目标演讲者的声音,我们扩大了变换器网络,以关注与目标演讲者最相关的信息,因为其声音是作为背景信息的一种形式。这种想法在选择性关注理论方面有着自然的解释。具体地说,我们提出两种模式,根据对地物选择地点的不同认识,将变换器的声音特点纳入其中。两种模式在与发言者的提取任务上均优于或优于公布的最新模型,包括将培训期间未见的新演讲者分开讲话。