Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
翻译:大型语言模型(LLMs)显示了自动总结的前景,但其成功背后的原因却不尽人意。我们通过对10个LMs进行人文评估,通过不同的培训前方法、提示和模型尺度对10个LLMs进行人文评估,我们得出了两项重要意见。首先,我们发现教学调整而不是模型大小是LLM零光总结能力的关键。第二,现有研究受到低质量参考的限制,导致低估人文业绩,降低微小和微调性能。为了更好地评估LMs,我们对从自由职业作家那里收集的高质量摘要进行人文评估。尽管在语言学学上存在重大差异,例如副词数量,但我们发现LMMM摘要被认为与人文摘要相当。