Voice Activity Detection (VAD) is a fundamental preprocessing step in automatic speech recognition. This is especially true within the broadcast industry where a wide variety of audio materials and recording conditions are encountered. Based on previous studies which indicate that xvector embeddings can be applied to a diverse set of audio classification tasks, we investigate the suitability of x-vectors in discriminating speech from noise. We find that the proposed x-vector based VAD system achieves the best reported score in detecting clean speech on AVA-Speech, whilst retaining robust VAD performance in the presence of noise and music. Furthermore, we integrate the x-vector based VAD system into an existing STT pipeline and compare its performance on multiple broadcast datasets against a baseline system with WebRTC VAD. Crucially, our proposed x-vector based VAD improves the accuracy of STT transcription on real-world broadcast audio
翻译:语音活动探测(VAD)是自动语音识别的一个基本的预处理步骤,在广播行业尤其如此,因为在广播行业遇到各种各样的音频材料和录音条件。根据以往的研究,可以对多种音频分类任务应用 xVVictor 嵌入器,我们调查 xVictor 是否适合用噪音来歧视言论。我们发现,基于 VAD 的拟设 xVictor 系统在发现AVA-Speech 的清洁言词方面,取得了所报告的最佳分数,同时在有噪音和音乐的情况下保留了强大的VAD 性能。此外,我们将基于 xVAD 的XVctor 系统纳入现有的STT 管道,并与WebRTC VAD 的基准系统比较其在多个广播数据集上的性能。 克鲁西,我们基于VAD的拟议的XVAD 提高了实时广播音频的STT的精度。