Diffusion models have shown exceptional scaling properties in the image synthesis domain, and initial attempts have shown similar benefits for applying diffusion to unconditional text synthesis. Denoising diffusion models attempt to iteratively refine a sampled noise signal until it resembles a coherent signal (such as an image or written sentence). In this work we aim to see whether the benefits of diffusion models can also be realized for speech recognition. To this end, we propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features. Specifically, we propose TransFusion: a transcribing diffusion model which iteratively denoises a random character sequence into coherent text corresponding to the transcript of a conditioning utterance. We demonstrate comparable performance to existing high-performing contrastive models on the LibriSpeech speech recognition benchmark. To the best of our knowledge, we are the first to apply denoising diffusion to speech recognition. We also propose new techniques for effectively sampling and decoding multinomial diffusion models. These are required because traditional methods of sampling from acoustic models are not possible with our new discrete diffusion approach. Code and trained models are available: https://github.com/RF5/transfusion-asr
翻译:在图像合成领域,传播模型显示了不同寻常的缩放特性,最初的尝试也显示了将传播应用到无条件文本合成的类似好处。拒绝扩散模型试图迭接地完善抽样噪音信号,直到它类似于一个一致信号(如图像或书面句子)。在这项工作中,我们的目标是看传播模型的好处能否也实现语音识别。为此,我们提出一种新的方法,使用以预先训练的语音特征为条件的传播模型来进行语音识别。具体地说,我们提议TransFusion:一种将随机字符序列叠加成与调试词记录相对应的一致文本的转录式扩散模型。我们展示了与LibriSpeech语音识别基准上现有高性对比模型的相似性。我们最了解的是,我们是首先将传播方法分解到语音识别中。我们还提议了一种新技术,以有效采样和分解多种语言传播模型。我们之所以需要这些新技术,是因为通过新的离散传播方法无法从声学模型中提取传统的采样方法。代码和受过训练的模型: http://githrustrivis5/travestrymblyasismusismismismmmission smission am am am am is is be supolable supol