向强力波形声波模型迈进 (Towards Robust Waveform-Based Acoustic Models)

We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveform-based setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.

翻译：我们研究在不利环境中学习强大的声学模型的问题,其特点是培训和测试条件之间严重不匹配。这个问题对于部署语言识别系统至关重要,需要在看不见环境中很好地发挥作用。首先,我们将数据理论上的增强定性为昆虫风险最小化的一个实例,目的是在培训期间改进风险估计,用培训样品附近边际人口密度近似于标准地物提取技术(例如FBANK和MFCC特征)来界定输入空间的经验密度。更具体地说,我们假设以培训样本为中心的地方邻居可以使用高山混合体进行比对,并在理论上表明,这可以在学习过程中包含强有力的感应偏差。我们然后通过数据增强计划间接地具体说明单个混合混合物组成部分,目的是解决声学模型中模糊性相关性的共同来源。为了避免因信息损失而对稳健性的影响,因为信息损失与标准地物提取技术(例如FBANK和MFCC特征)相关联,我们侧重于基于波形的设置。我们的实证结果表明,这种方法可以概括到看不见的噪音状况,通过数据增强计划,通过数据强化计划,用150%的相对性地标比性标准测试模型,用最起码的比标准测试结果,展示了标准性标准测试结果。