Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that are repetitive, incoherent, or irrelevant to the prefix; as such, model-generated text also contains such artifacts. To address these issues we present RankGen, a 1.2B parameter encoder model for English that scores model generations given a prefix. RankGen can be flexibly incorporated as a scoring function in beam search and used to decode from any pretrained language model. We train RankGen using large-scale contrastive learning to map a prefix close to the ground-truth sequence that follows it and far away from two types of negatives: (1) random sequences from the same document as the prefix, and (2) sequences generated from a large language model conditioned on the prefix. Experiments across four different language models (345M-11B parameters) and two domains show that RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling, as well as contrastive decoding and search, on both automatic metrics (85.0 vs 77.3 MAUVE over nucleus) as well as human evaluations with English writers (74.5% human preference over nucleus sampling). Analysis reveals that RankGen outputs are more relevant to the prefix and improve continuity and coherence compared to baselines. We release our model checkpoints, code, and human preference data with explanations to facilitate future research.
翻译:根据输入序列(或前缀),现代语言模型往往对重复、不连贯或与前缀无关的产出序列具有很高的概率;因此,模型生成的文本也含有这类文物。为了解决这些问题,我们提出了RankGen, 即一个有前缀的英文1.2B参数编码器模型,分数代代代代。RankGen可以灵活地纳入成一个评分功能,用于在波音搜索中解码任何预先训练的语言模型。我们用大规模对比学习来培训RankGen,绘制接近地面图谱序列的前缀,该前缀紧随其后,远离两种负序列:(1) 与前缀相同的文档随机序列,以及(2) 以前缀为条件的大型语言模型产生的序列。 四个不同的语言模型(345M-11B参数)和两个领域显示,RankGen大大超越了核心、上K和典型的解码算算算法,以及对比解码式解码和搜索模型,即紧随其后的,在自动基准(8.5OV) 和人类核心分析前(7AU) 改进了我们的基线,然后改进了我们的核心数据序列,然后又改进了我们的模型。