向强力波形声波模型迈进 (Towards Robust Waveform-Based Acoustic Models)

We propose an approach for learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. Our approach is an instance of vicinal risk minimization, which aims to improve risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We characterize the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus our evaluation on the waveform-based setting. Our empirical results show that the proposed approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances (i.e., optimal vicinal densities).

翻译：我们提出在不利环境中学习稳健的声学模型的方法,其特点是培训与测试条件之间严重不匹配。这个问题对于在看不见环境中部署需要良好表现的语音识别系统至关重要。我们的方法是一个昆虫风险最小化的例子,目的是在培训期间改善风险估计,在培训期间,用培训样本附近边际人口密度的近似值来界定输入空间的经验密度。更具体地说,我们假定,以培训样本为中心的地方邻居可以使用高斯人混杂在一起的方法进行比对,并在理论上表明,这可以在学习过程中纳入强有力的感知偏差。我们通过数据增强计划间接地描述单个混合物成分,目的是解决声学模型中虚假相关性的共同来源。为了避免由于信息损失而对稳健性的影响产生混杂效应,因为信息损失与标准地貌提取技术(例如FBANK和MFCC特征)相关联。我们假设,我们的评估侧重于基于波状的设置。我们的经验显示,拟议的方法可以概括到看不见的噪音条件,在比比比性样本的比比性标准性标准性测试中,我们用150%的比比标准性标准性标准性标准性标准性测试来展示业绩。