The recent success of zero- and few-shot prompting with models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how zero-shot GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics, e.g. recently proposed QA- or entailment-based factuality approaches, cannot reliably evaluate zero-shot summaries. Finally, we discuss future research challenges beyond generic summarization, specifically, keyword- and aspect-based summarization, showing how dominant fine-tuning approaches compare to zero-shot prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and zero-shot models across 4 standard summarization benchmarks, (b) 1K human preference judgments and rationales comparing different systems for generic- and keyword-based summarization.
翻译:最近,GPT-3等模型的零点和点点点成功,导致了NLP研究的范式转变。在本文中,我们研究了它对文本总和的影响,重点是典型的新闻总和基准领域。首先,我们调查了零点GPT-3如何与经过大规模总和数据集培训的微调模型相比。我们发现,人类不仅压倒性地更喜欢GPT-3摘要,而且这些也不受诸如事实质量差等共同的数据集特有问题的影响。接下来,我们研究了这对评价的意义,特别是金标准测试组的作用。我们的实验表明,基于参考和无参考的自动计量方法,例如最近提出的QA-或基于要求的事实质量方法,都无法可靠地评价零点摘要。最后,我们讨论了除通用总和外,具体地、基于关键词和基于方位和方位的汇总之外,今后的研究挑战,表明与零点促动等主要的微调方法。为了支持进一步的研究,我们发行了以下文件:(a)从精确调整和零点模型中生成的10K摘要,用于对4个标准标准级的模型的模型进行对比。