Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies and analyze the challenges between the continuous data space and the embedding space which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the loss function. Secondly, as the norm of embeddings varies between popular and rare words, adding the same noise scale will lead to sub-optimal results. In addition, we find the normal level of noise causes insufficient training of the model. To address the above challenges, we propose Difformer, an embedding diffusion model based on Transformer, which consists of three essential modules including an additional anchor loss function, a layer normalization module for embeddings, and a noise factor to the Gaussian noise. Experiments on two seminal text generation tasks including machine translation and text summarization show the superiority of Difformer over compared embedding diffusion baselines.
翻译:传播模型在视觉和音频任务上都达到了最先进的合成质量,最近的工作通过在嵌入空间上进行漂移,进一步将它们与文本数据相适应。在本文中,我们进行系统研究和分析连续数据空间与尚未仔细探索的嵌入空间之间的挑战。首先,数据分布可为嵌入学习,这可能导致丢失功能的崩溃。第二,由于嵌入的规范在流行和稀有词之间各不相同,添加相同的噪音比例将导致亚最佳结果。此外,我们发现正常的噪音水平导致模型培训不足。为了应对上述挑战,我们提议DifExer,一个基于变异器的嵌入扩散模型,由三个基本模块组成,包括额外的锚落功能、嵌入层正常化模块以及高斯噪音的噪音系数。在两个半文本生成任务上进行的实验,包括机器翻译和文本总和,显示Diffrench相对于嵌入扩散基线的优越性。