We propose a novel voice activity detection (VAD) model in a low-resource environment. Our key idea is to model VAD as a denoising task, and construct a network that is designed to identify nuisance features for a speech classification task. We train the model to simultaneously identify irrelevant features while predicting the type of speech event. Our model contains only 7.8K parameters, outperforms the previously proposed methods on the AVA-Speech evaluation set, and provides comparative results on the HAVIC dataset. We present its architecture, experimental results, and ablation study on the model's components. We publish the code and the models here https://www.github.com/jsvir/vad.
翻译:我们提议在低资源环境中采用新型语音活动检测模式。 我们的关键想法是将 VAD 模型作为拆卸任务,并建立一个旨在识别语言分类任务的骚扰特征的网络。 我们训练模型,同时识别不相干特征,同时预测演讲事件的类型。 我们的模型只包含7.8K参数,优于AVA-Speech 评估集中先前建议的方法,并提供关于HAVIC数据集的比较结果。 我们展示了其结构、实验结果和模型组成部分的反差研究。 我们在这里公布了代码和模型 https://www.github.com/jsvir/vad。