Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised datasets. Therefore, a major challenge for supervised VAD systems is their generalization towards noisy, real-world data. This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training. Unlike previous approaches, only weak labels during teacher training are required, enabling the utilization of any real-world, potentially noisy dataset. Our approach firstly trains a teacher model on a source dataset (Audioset) using clip-level supervision. After training, the teacher provides frame-level guidance to a student model on an unlabeled, target dataset. A multitude of student models trained on mid- to large-sized datasets are investigated (Audioset, Voxceleb, NIST SRE). Our approach is then respectively evaluated on clean, artificially noised, and real-world data. We observe significant performance gains in artificially noised and real-world scenarios. Lastly, we compare our approach against other unsupervised and supervised VAD methods, demonstrating our method's superiority.
翻译:语音活动探测是语言相关任务,如自动语音识别(ASR)的一个基本预处理部分。传统受监督的 VAD系统通过使用隐藏的Markov 模型,从ASR管道中获取框架级标签。这些ASR模型通常在清洁和完全转录数据方面受过培训,限制VAD系统在清洁或合成无记名数据集方面受过培训。因此,受监督的VAD系统面临的一个主要挑战是对杂音、真实世界的数据进行概括化。这项工作为VAD提出了数据驱动的师资研究方法,该方法利用广泛和不受限制的音频数据进行培训。与以往的方法不同,只需要在教师培训期间使用薄弱的标签,从而能够利用任何真实世界,可能十分混乱的数据集。我们的方法首先在源数据集(Audiosett)上培训教师模型,使用清洁或合成无合成的数据集。培训后,教师在未标定的、目标数据集上为学生模型提供框架级指导。许多在中到大数据集方面受过培训的学生模型,用于培训。与以往的方法不同,只需要在教师培训期间使用薄弱的标签标签标签,只有弱的标签,从而能够利用任何真正的真实世界的系统。我们真实的系统,在最后对真实的系统进行透明的系统进行实地评估。