Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted variations, such as speaker variations, from the content. However, disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well, and the damage of the latter usually far outweighs the benefit of the former. In this paper, we propose a new SSL method that can achieve speaker disentanglement without severe loss of content. Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teacher labels and the learned representations. We evaluate the benefit of speaker disentanglement on a set of content-related downstream tasks, and observe a consistent and notable performance advantage of our speaker-disentangled representations.
翻译:在演讲中自我监督的学习涉及在大规模、无附加说明的演讲文稿上培训演讲代表网络,然后对下游任务应用所学的表述方法。由于SSL在演讲中学习的下游任务大多侧重于演讲中的内容信息,因此最理想的演讲文稿表述应当能够解开不想要的变异,如演讲者变异的内容。然而,脱钩的演讲者非常具有挑战性,因为删除演讲者信息很容易导致内容损失,而后者的损害通常远远超过前者的好处。我们在本文件中提议一种新的SSL方法,在不严重丧失内容的情况下实现演讲者分解。我们的方法是从HuBERT框架改编的,纳入分离机制,以规范教师标签和学习的表述。我们评估了演讲者在一系列与内容有关的下游任务上脱钩的好处,并观察了我们演讲者分解的表述的一贯和显著的业绩优势。