It has been shown recently that deep learning based models are effective on speech quality prediction and could outperform traditional metrics in various perspectives. Although network models have potential to be a surrogate for complex human hearing perception, they may contain instabilities in predictions. This work shows that deep speech quality predictors can be vulnerable to adversarial perturbations, where the prediction can be changed drastically by unnoticeable perturbations as small as $-30$ dB compared with speech inputs. In addition to exposing the vulnerability of deep speech quality predictors, we further explore and confirm the viability of adversarial training for strengthening robustness of models.
翻译:最近已经表明,深层次的学习模型对语言质量预测是有效的,从不同角度看,可能优于传统指标。虽然网络模型有可能替代复杂的人类听觉感知,但可能含有预测中的不稳定性。这项工作表明,深层次的语音质量预测器可能易受对抗性扰动的影响,因为与演讲投入相比,低至30美元、低至30美元、低至30美元、低至20美元等的无法察觉的扰动可以大大改变预测。除了暴露深度的语音质量预测器的脆弱性外,我们还进一步探索和确认对抗性培训的可行性,以加强模型的稳健性。