Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle, we introduce Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"), an audio self-supervised learning method based on BYOL for learning general-purpose audio representation. Unlike most previous audio self-supervised learning methods that rely on agreement of vicinity audio segments or disagreement of remote ones, BYOL-A creates contrasts in an augmented audio segment pair derived from a single audio segment. With a combination of normalization and augmentation techniques, BYOL-A achieves state-of-the-art results in various downstream tasks. Extensive ablation studies also clarified the contribution of each component and their combinations.
翻译:在利用数据扩增进行监管的计算机视觉自监督学习最近取得进展的启发下,我们探索了一种新的通用语音代表学习方法。我们建议从单一音频段学习通用音频代表,而不期望音频样本不同时段之间的关系。为了实施这一原则,我们引入了音频(BYOL-A, 宣布为“viola ” ) 的自监督的音频学习方法(BYOL-A, 宣布为“Viola ” ),这是基于BYOL学习通用音频代表的音频自监督学习方法。与以往大多数依赖附近音频段协议或远程部分分歧的自监督学习方法不同, BYOL-A在从单一音频段衍生的扩增音频段配对中制造对比。结合了常规和增强技术,BYOL-A在各种下游任务中实现了最新成果。广泛的博研究还澄清了每个部分的贡献及其组合。