The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. From a distributional view, MLE in fact minimizes the Kullback-Leibler divergence (KLD) between the distribution of the real data and that of the model. However, this approach forces the model to distribute non-zero (sometimes large) probability mass to all training samples regardless of their quality. Moreover, in the attempt to cover the low-probability regions in the data distribution, the model systematically overestimates the probability of corrupted text sequences, which we conjecture is one of the main reasons for text degeneration during autoregressive decoding. To remedy this problem, we leverage the total variation distance (TVD) with its robustness to outliers, and develop practical bounds to apply it to language generation. Then, we introduce the TaiLr objective that balances the tradeoff of estimating TVD. Intuitively, TaiLr downweights real data samples that have low model probabilities with tunable penalization intensity. Experimental results show that our method alleviates the overestimation of degenerated sequences without sacrificing diversity and improves generation quality on a wide range of text generation tasks.
翻译:神经语言生成的标准模式将最大概率估计(MLE)作为最优化的方法。从分布的角度来看,MLE实际上将真实数据分布和模型分布之间的 Kullback-Leiper 差异(KLD) 最小化了实际数据分布与模型分布之间的 KURLD 差异(KLD) 。然而,这种方法迫使模型将非零(有时大)概率质量(非零(有时大)质量)质量分配给所有培训样本,而不论其质量质量如何。此外,为了在数据分布中覆盖低概率区域,模型系统地高估了腐败文本序列的概率(MLME ) 。从分布的角度看,MLELE实际上将实际文本序列的概率偏差(MLE)作为优化方法。我们推测,这是在自动递增递增的解调调调调调调调调时,文本序列是造成文本在自动递归回的解解解解解调调调调调调调调时,造成文本变出的主要原因之一。为了解决这个问题,我们利用总变差距离(TVD)来利用总变差距离(TVD)和制定实际界限,将其应用于语言的平衡应用。然后,不牺牲多样化。然后我们提出TTTTVD(TVD)的目标平衡。我们提出,不改进了TVD多样化,不改进了TVD的多样性和实验结果。我们提出,我们提出实际数据样本的真正数据样本,不改进了数据样本。我们的方法减轻了数据样本,不改进了方法,不改进了高,不改进了高的,不改进了生产制制制制制制制制制制制制制制制的,不改进了高,不改进了生产方法,不改进了生产质量,不改进了生产质量。</s>