We enhance the vanilla adversarial training method for unsupervised Automatic Speech Recognition (ASR) by a diffusion-GAN. Our model (1) injects instance noises of various intensities to the generator's output and unlabeled reference text which are sampled from pretrained phoneme language models with a length constraint, (2) asks diffusion timestep-dependent discriminators to separate them, and (3) back-propagates the gradients to update the generator. Word/phoneme error rate comparisons with wav2vec-U under Librispeech (3.1% for test-clean and 5.6% for test-other), TIMIT and MLS datasets, show that our enhancement strategies work effectively.
翻译:我们通过扩散 GAN 改进了无监督自动语音识别(ASR)的基本对抗训练方法。我们的模型(1)向生成器的输出和预训练的语音素语言模型采样的未标记参考文本注入多种强度的实例噪声,并进行长度约束,(2)用扩散时间步鉴别器来分离它们,(3)反向传播梯度来更新生成器。在 Librispeech(test-clean 和 test-other 的错误率分别为 3.1% 和 5.6%)、TIMIT 和 MLS 数据集上,我们的增强策略与 wav2vec-U 进行了单词/音素错误率比较,效果显著。