Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.
翻译:自我监督的代表学习方法越来越受欢迎,因为有能力对大量未贴标签的数据进行模型培训,并在自然语言处理、计算机视觉和语言等不同领域表现出成功。以前在语言领域的自我监督工作已经分解了语言内容、语言身份和节奏等多种语言特征。在这项工作中,我们引入了一种自监督方法,将室内声学与语音分离,并在设备仲裁的下游任务中使用声学代表方法。我们的结果表明,当标签培训数据稀少时,我们拟议的方法大大改善了基线的性能,表明我们的培训前计划学会将室内声学信息编码,同时对语音信号的其他属性则不动。