While recent work has shown that scores from models trained by the ubiquitous masked language modeling (MLM) objective effectively discriminate probable and improbable sequences, it is still an open question if these MLMs specify a principled probability distribution over the space of possible sequences. In this paper, we interpret MLMs as energy-based sequence models and propose two energy parametrizations derivable from the trained MLMs. In order to draw samples correctly from these models, we develop a tractable \emph{sampling} scheme based on the Metropolis--Hastings Monte Carlo algorithm. In our approach, samples are proposed from the same masked conditionals used for training the masked language models, and they are accepted or rejected based on their energy values according to the target distribution. We validate the effectiveness of the proposed parametrizations by exploring the quality of samples drawn from these energy-based models on the conditional generation task of machine translation. We theoretically and empirically justify our sampling algorithm by showing that the masked conditionals on their own do not yield a Markov chain whose stationary distribution is that of our target distribution, and our approach generates higher quality samples than other recently proposed undirected generation approaches (Wang et al., 2019, Ghazvininejad et al., 2019).
翻译:虽然最近的工作表明,通过无处不在的蒙面语言模型(MLM)所培训的模型的分数有效地区别了可能性和不可能实现的序列,但如果这些MLM在可能的序列空间中指定了原则性概率分布,则仍是一个未决问题。在本文中,我们将MLM解释为基于能源的序列模型,并提议从经过培训的MLMM中得出两种能源配比。为了从这些模型中正确提取样本,我们根据MMMM(MLM)制定了一种基于Metopolis-Hastings Monte Carlo算法的可伸缩式方法。在我们的方法中,从用于培训蒙面语言模型的相同的蒙面条件中提出样本,根据目标分布的能源价值来接受或拒绝这些样本。我们通过探索从这些基于能源的模型中提取的样本的质量来验证拟议的配比。我们从机器翻译的有条件的生成任务中得出的取样算法,我们理论上和实验性地证明我们的抽样算法是,它们本身的蒙面条件不会产生一个不高的Markov链,其固定式分布方式是20个我们的目标分布方式,也就是的20摄制的2019和其他方法。