Diffusion models have achieved state-of-the-art synthesis quality on visual and audio tasks, and recent works adapt them to textual data by diffusing on the embedding space. But the difference between the continuous data space and the embedding space raises challenges to the diffusion model, which have not been carefully explored. In this paper, we conduct systematic studies and analyze the challenges threefold. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the loss function. Secondly, as the norm of embedding varies between popular and rare words, adding the same noise scale will lead to sub-optimal results. In addition, we find that noises sampled from a standard Gaussian distribution may distract the diffusion process. To solve the above challenges, we propose Difformer, a denoising diffusion probabilistic model based on Transformer, which consists of three techniques including utilizing an anchor loss function, a layer normalization module for embeddings, and a norm factor to the Gaussian noise. All techniques are complementary to each other and critical to boosting the model performance together. Experiments are conducted on benchmark datasets over two seminal text generation tasks including machine translation and text summarization. The results show that Difformer significantly outperforms the embedding diffusion baselines, while achieving competitive results with strong autoregressive baselines.
翻译:集成模型在视觉和音频任务上达到了最先进的合成质量, 最近的工作通过在嵌入空间上进行漂移, 使其适应文本数据。 但是, 连续数据空间和嵌入空间之间的差别会给扩散模型带来挑战, 但这些挑战尚未仔细探讨。 在本文中, 我们进行系统研究和分析挑战。 首先, 数据分布可以用于嵌入, 这可能导致损失功能的崩溃。 其次, 由于嵌入的规范在流行和稀有词之间有所不同, 添加相同的噪音比例将导致亚优效果。 此外, 我们发现, 从标准高斯分布中抽样的噪音可能会分散扩散进程。 为了解决上述挑战, 我们提议 Difexer, 一个基于变异器的分解扩散概率模型, 由三种技术组成, 包括使用锁定丢失功能, 嵌入层正常化模块, 以及高斯噪音的规范因素。 所有技术都是互相补充的, 并且对于提升模型的超优异性效果。 测试了具有竞争力的模型化模型化结果, 包括快速的模型化模型化, 测试了双级的模型化模型化结果, 并测试了模型的模型升级了模型的模制模模模模模模模模 。