Although end-to-end automatic speech recognition (E2E ASR) has achieved great performance in tasks that have numerous paired data, it is still challenging to make E2E ASR robust against noisy and low-resource conditions. In this study, we investigated data augmentation methods for E2E ASR in distant-talk scenarios. E2E ASR models are trained on the series of CHiME challenge datasets, which are suitable tasks for studying robustness against noisy and spontaneous speech. We propose to use three augmentation methods and thier combinations: 1) data augmentation using text-to-speech (TTS) data, 2) cycle-consistent generative adversarial network (Cycle-GAN) augmentation trained to map two different audio characteristics, the one of clean speech and of noisy recordings, to match the testing condition, and 3) pseudo-label augmentation provided by the pretrained ASR module for smoothing label distributions. Experimental results using the CHiME-6/CHiME-4 datasets show that each augmentation method individually improves the accuracy on top of the conventional SpecAugment; further improvements are obtained by combining these approaches. We achieved 4.3\% word error rate (WER) reduction, which was more significant than that of the SpecAugment, when we combine all three augmentations for the CHiME-6 task.
翻译:虽然端到端自动语音识别(E2E ASR)在众多配对数据的任务中取得了巨大的成绩,但使E2E ASR在噪音和低资源条件下变得强大,仍然具有挑战性。在本研究中,我们调查了远程对话情景中E2E ASR的数据增强方法。E2E ASR模型接受了关于CHime挑战数据集系列的培训,这些数据集是研究对噪音和自发语音的稳健性的适当任务。我们提议使用三种增强方法和超强组合:1)使用文本到语音数据的数据增强数据,2)使用循环一致的组合式对抗网络(Cycle-GAN)来增强数据,以绘制两种不同的音频特征,一种是清洁的语音和噪音录音,与测试条件相符,3)由预先经过培训的ASR模块提供的用于平滑动标签分布的假标签增强数据。我们提议使用CHime-6/ChiME-4数据集进行实验的结果显示,每一种增强方法都提高了常规Specaugh(TSpecAugA)顶端数据的准确性;在将S-6AUnistrual A/A的所有错误合并起来后,我们通过这些方法取得了更大的改进。