Self-supervised learning approaches have lately achieved great success on a broad spectrum of machine learning problems. In the field of speech processing, one of the most successful recent self-supervised models is wav2vec 2.0. In this paper, we explore the effectiveness of this model on three basic speech classification tasks: speaker change detection, overlapped speech detection, and voice activity detection. First, we concentrate on only one task -- speaker change detection -- where our proposed system surpasses the previously reported results on four different corpora, and achieves comparable performance even when trained on out-of-domain data from an artificially designed dataset. Then we expand our approach to tackle all three tasks in a single multitask system with state-of-the-art performance on the AMI corpus. The implementation of the algorithms in this paper is publicly available at https://github.com/mkunes/w2v2_audioFrameClassification.
翻译:自我监督的学习方法近来在一系列广泛的机器学习问题上取得了巨大成功。 在语言处理领域,最近最成功的自监管模式之一是 wav2vec 2. 0。 在本文中,我们探讨了这一模式在三种基本语言分类任务上的有效性:语音变换检测、语音检测重叠和语音活动检测。首先,我们只专注于一个任务 -- -- 语音变换检测 -- -- 我们的拟议系统超过了先前报告的关于四个不同子体的结果,并且即使在对人工设计的数据集的外部数据进行培训时,也取得了可比的性能。然后,我们扩大了我们的方法,在一个单一的多任务系统中处理所有三项任务,在AMI文体上具有最先进的性能。本文中算法的实施在https://github.com/mkunes/w2v2_audioFrameClassification中公开提供。