Because the performance of speech separation is excellent for speech in which two speakers completely overlap, research attention has been shifted to dealing with more realistic scenarios. However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. Speaker and environment mismatches have been studied in the existing literature. Nevertheless, there are few studies on speech content and channel mismatches. Moreover, the impacts of language and channel in these studies are mostly tangled. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels. In our experiments, training on data recorded by Android phones leads to the best generalizability. Moreover, we provide a new solution for channel mismatch by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data.
翻译:由于语言分离的表演对讲演者来说是极好的,因为有两名发言者完全重叠,因此研究注意力已转向处理更现实的场景,然而,由于诸如演讲者、内容、频道和环境等因素,培训/测试情况之间的域错配仍然是语言分离的一个严重问题。在现有文献中研究了演讲者和环境错配问题。然而,关于语言内容和频道错配的研究很少。此外,这些研究中语言和频道的影响大多是交织在一起的。在本研究中,我们为各种实验创建了若干数据集。结果显示,不同语言的影响很小,与不同频道的影响相比,可以忽略。在我们的实验中,关于由安道手机记录的数据的培训可以产生最佳的通用性。此外,我们通过评价预测,为频道错配提供了一种新的解决办法,可以测量和运用这些频道的相似性,从而有效地选择额外的培训数据,以改进在微量的测试数据性能。