Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that are repetitive, incoherent, or irrelevant to the prefix; as such, model-generated text also contains such artifacts. To address these issues we present RankGen, a 1.2B parameter encoder model for English that scores model generations given a prefix. RankGen can be flexibly incorporated as a scoring function in beam search and used to decode from any pretrained language model. We train RankGen using large-scale contrastive learning to map a prefix close to the ground-truth sequence that follows it and far away from two types of negatives: (1) random sequences from the same document as the prefix, and (2) sequences generated from a large language model conditioned on the prefix. Experiments across four different language models (345M-11B parameters) and two domains show that RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling on both automatic metrics (85.0 vs 77.3 MAUVE) as well as human evaluations with English writers (74.5% human preference over nucleus sampling). Analysis reveals that RankGen outputs are more relevant to the prefix and improve continuity and coherence compared to baselines. We release our model checkpoints, code, and human preference data with explanations to facilitate future research.
翻译:根据输入序列(或前缀),现代语言模型往往对重复、不连贯或与前缀无关的产出序列具有很高的概率;因此,模型生成的文本也含有这类文物。为了解决这些问题,我们提出了RankGen, 即一个有前缀的、 排名数代的英语1.2B参数编码器模型。 RankGen可以灵活地纳入成一个评分功能,用于在波音搜索中解码任何预先训练的语言模型。我们用大规模对比学习来培训RankGen,绘制接近地面图谱序列的前缀,该前缀紧随其后,远离两种负序列:(1) 与前缀相同的文档随机序列,以及(2) 以前缀为条件的大型语言模型生成的序列。 四个不同的语言模型(345M-11B参数)和两个域的实验显示,RankGen大大超越了核心、顶级和典型的解码算算法,从而绘制了紧紧紧紧紧的地面图序列,紧随其后的预置图,远远远离了两种负数序列:(1) 来自与前缀的随机序列的随机序列序列序列,我们做了比较的抽样分析。