Most automatic speech processing systems are sensitive to the acoustic environment, with degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a pipeline to simulate audio segments recorded in noisy and reverberant conditions. We then use the simulated audio to jointly train the Brouhaha model for voice activity detection, signal-to-noise ratio estimation, and C50 room acoustics prediction. We show how the predicted SNR and C50 values can be used to investigate and help diagnose errors made by automatic speech processing tools (such as pyannote.audio for speaker diarization or OpenAI's Whisper for automatic speech recognition). Both our pipeline and a pretrained model are open source and shared with the speech community.
翻译:多数自动语音处理系统都对声学环境敏感,在应用到吵闹或回响的语音时性能会降低。但人们如何判断语言是吵闹还是回响?我们建议布鲁哈哈(Brouhaha)作为模拟音频段的管道,以模拟在噪音和回响条件下录制音段。然后我们用模拟音频来联合培训布鲁哈哈(Brouhaha)模型,以探测声音活动、信号对噪音比率估计和C50室声学预测。我们展示了如何利用预测的SNR和C50值来调查和帮助诊断自动语音处理工具(例如Pyannote.audio)造成的错误。我们的输电管和预培训模式都是开源,并与语言界共享。