Robust Voice Activity Detection (VAD) remains a challenging task, especially under noisy, diverse, and unseen acoustic conditions. Beyond algorithmic development, a key limitation in advancing VAD research is the lack of large-scale, systematically controlled, and publicly available datasets. To address this, we introduce LibriVAD - a scalable open-source dataset derived from LibriSpeech and augmented with diverse real-world and synthetic noise sources. LibriVAD enables systematic control over speech-to-noise ratio, silence-to-speech ratio (SSR), and noise diversity, and is released in three sizes (15 GB, 150 GB, and 1.5 TB) with two variants (LibriVAD-NonConcat and LibriVAD-Concat) to support different experimental setups. We benchmark multiple feature-model combinations, including waveform, Mel-Frequency Cepstral Coefficients (MFCC), and Gammatone filter bank cepstral coefficients, and introduce the Vision Transformer (ViT) architecture for VAD. Our experiments show that ViT with MFCC features consistently outperforms established VAD models such as boosted deep neural network and convolutional long short-term memory deep neural network across seen, unseen, and out-of-distribution (OOD) conditions, including evaluation on the real-world VOiCES dataset. We further analyze the impact of dataset size and SSR on model generalization, experimentally showing that scaling up dataset size and balancing SSR noticeably and consistently enhance VAD performance under OOD conditions. All datasets, trained models, and code are publicly released to foster reproducibility and accelerate progress in VAD research.
翻译:鲁棒的语音活动检测(VAD)仍然是一项具有挑战性的任务,尤其是在噪声、多样且未经见的声学条件下。除了算法开发之外,推进VAD研究的一个关键限制在于缺乏大规模、系统可控且公开可用的数据集。为解决此问题,我们引入了LibriVAD——一个源自LibriSpeech并利用多样化的真实世界和合成噪声源进行增强的可扩展开源数据集。LibriVAD能够系统控制信噪比、静默语音比(SSR)和噪声多样性,并以三种规模(15 GB、150 GB和1.5 TB)和两种变体(LibriVAD-NonConcat和LibriVAD-Concat)发布,以支持不同的实验设置。我们对多种特征-模型组合进行了基准测试,包括波形、梅尔频率倒谱系数(MFCC)和伽马通滤波器组倒谱系数,并引入了Vision Transformer(ViT)架构用于VAD。我们的实验表明,在已见、未见和分布外(OOD)条件下(包括在真实世界VOiCES数据集上的评估),采用MFCC特征的ViT模型始终优于已建立的VAD模型,如增强深度神经网络和卷积长短期记忆深度神经网络。我们进一步分析了数据集规模和SSR对模型泛化能力的影响,实验表明,扩大数据集规模并平衡SSR能显著且持续地提升VAD在OOD条件下的性能。所有数据集、训练模型和代码均已公开发布,以促进可重复性并加速VAD研究的进展。