Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs without adding computational cost. Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. We hypothesize that randomly masking past tokens prevents over-attending to recent tokens and encourages attention to tokens in the distant past. By randomly masking input tokens in the PaLM model, we show that we can significantly improve 1B and 8B PaLM's zero-shot performance on the SuperGLUE benchmark from 55.7 to 59.2 and from 61.6 to 64.0, respectively. Our largest 8B model matches the score of PaLM with an average score of 64, despite the fact that PaLM is trained on a much larger dataset (780B tokens) of high-quality conversation and webpage data, while ours is trained on the smaller C4 dataset (180B tokens). Experimental results show that our method also improves PaLM's zero and few-shot performance on a diverse suite of tasks, including commonsense reasoning, natural language inference and cloze completion. Moreover, we show that our technique also helps representation learning, significantly improving PaLM's finetuning results.
翻译:使用下端口令目标( 如 GPT3 和 PaLM ) 培训的大型语言模型( LLM ), 使用下端口令目标( 如 GPT3 和 PALM ) 培训的大型语言模型( LLM ), 最近几年里, 通过展示令人印象深刻的零射和几射能力, 使自然语言处理发生了革命性的变化。 在这项工作中, 我们提出了一个简单的方法, 大大提升了 LLM 的性能, 但不增加计算成本。 我们的主要观察是, 通过执行下一个象征性的预测任务, 以随机选择的过去口号( 如 GPPT3 和 PALM ), 来提高下游语言理解任务的学习质量质量。 我们最大的8B 模式( PALM ) 与64 平均分数相匹配, 尽管 PaLM 随机遮掩码防止过量过多到最近的代号, 并鼓励关注远端的代号。 通过随机掩码显示 PALM 。 在 PALM 模型模型中随机显示我们不断改进的正常数据的方法, 。