Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings' cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised embeddings preserve the integrity of the speaker information, but the former consistently outperforms the latter in a cross-dataset evaluation. The competitive separation and generalization performance of the previously overlooked filterbank embedding is consistent across our study, which calls for future research on better upstream features.
翻译:典型的深层TSS框架包括一个上游模型,获得注册语音嵌入器,以及一个在嵌入器上进行分离的下游模型。与此相反,过滤库和自我监督嵌入器都维护了发言者信息的完整性,但前者在交叉数据集评价中始终超越后者。先前被忽略的过滤库的竞争性分离和一般化表现贯穿于我们的研究中,这要求今后对更好的上游特征进行研究。