The target speech extraction has attracted widespread attention in recent years. In this work, we focus on investigating the dynamic interaction between different mixtures and the target speaker to exploit the discriminative target speaker clues. We propose a special attention mechanism without introducing any additional parameters in a scaling adaptation layer to better adapt the network towards extracting the target speech. Furthermore, by introducing a mixture embedding matrix pooling method, our proposed attention-based scaling adaptation (ASA) can exploit the target speaker clues in a more efficient way. Experimental results on the spatialized reverberant WSJ0 2-mix dataset demonstrate that the proposed method can improve the performance of the target speech extraction effectively. Furthermore, we find that under the same network configurations, the ASA in a single-channel condition can achieve competitive performance gains as that achieved from two-channel mixtures with inter-microphone phase difference (IPD) features.
翻译:近年来,目标语音提取引起了广泛的关注。 在这项工作中,我们侧重于调查不同混合物与目标演讲者之间的动态互动,以利用目标演讲者歧视性线索。我们建议一个特别关注机制,而不在缩放适应层中引入任何额外的参数,以更好地调整网络以提取目标演讲。此外,通过引入混合嵌入矩阵集合方法,我们提议的基于关注的缩放适应(ASA)可以更高效地利用目标演讲者的线索。关于空间变异性WSJ0 2-mix数据集的实验结果显示,拟议方法可以有效改进目标演讲提取的性能。此外,我们发现,在同一网络配置下,单一频道条件下的ASA能够取得具有麦克风阶段差异特征的双频道混合物所取得的竞争性绩效收益。