The past several years have witnessed Variational Auto-Encoder's superiority in various text generation tasks. However, due to the sequential nature of the text, auto-regressive decoders tend to ignore latent variables and then reduce to simple language models, known as the KL vanishing problem, which would further deteriorate when VAE is combined with Transformer-based structures. To ameliorate this problem, we propose DELLA, a novel variational Transformer framework. DELLA learns a series of layer-wise latent variables with each inferred from those of lower layers and tightly coupled with the hidden states by low-rank tensor product. In this way, DELLA forces these posterior latent variables to be fused deeply with the whole computation path and hence incorporate more information. We theoretically demonstrate that our method can be regarded as entangling latent variables to avoid posterior information decrease through layers, enabling DELLA to get higher non-zero KL values even without any annealing or thresholding tricks. Experiments on four unconditional and three conditional generation tasks show that DELLA could better alleviate KL vanishing and improve both quality and diversity compared to several strong baselines.
翻译:过去几年中,在各种文本生成任务中,自动-自动编码器的优势是变化式的。然而,由于文本的顺序性质,自动递减式解码器往往忽视潜伏变量,然后缩为简单的语言模型,称为KL的消失问题,当VAE与基于变异器的结构相结合时,这一问题会进一步恶化。为了解决这个问题,我们建议DELLA,一个新的变异变异器框架DELLA。DELLA从从低层和低层产品与隐藏状态紧密结合中,学习一系列层次上的潜在变量。这样,DELLA将这些后层潜在变量与整个计算路径紧密结合,从而纳入更多的信息。我们理论上证明,我们的方法可以被视为潜在的潜在变量,以避免后层信息减少,使DELLA获得更高的非零KL值,即使不产生任何内嵌或临界的把戏。在四种无条件和三种有条件的一代任务上进行的实验表明,DELLLA可以更好地减轻KL的消失,并改进质量和多样性,而将几个基线加以加强。