Very large language models such as GPT-3 have shown impressive performance across a wide variety of tasks, including text summarization. In this paper, we show that this strong performance extends to opinion summarization. We explore several pipeline methods for applying GPT-3 to summarize a large collection of user reviews in a zero-shot fashion, notably approaches based on recursive summarization and selecting salient content to summarize through supervised clustering or extraction. On two datasets, an aspect-oriented summarization dataset of hotel reviews and a generic summarization dataset of Amazon and Yelp reviews, we show that the GPT-3 models achieve very strong performance in human evaluation. We argue that standard evaluation metrics do not reflect this, and evaluate against several new measures targeting faithfulness, factuality, and genericity to contrast these different methods.
翻译:GPT-3等非常大的语言模型在包括文本摘要在内的各种任务中表现出了令人印象深刻的业绩。在本文中,我们展示了这种强有力的业绩延伸到意见总结。我们探索了几种编审方法,用GPT-3来以零速方式总结大量用户审查,特别是基于循环总结和选择显著内容通过监督的集群或提取加以总结的方法。关于两个数据集,一个面向方的旅馆审查汇总数据集和一个亚马逊和叶尔普审查的通用汇总数据集,我们显示GPT-3模型在人类评价中取得了非常强大的绩效。我们争辩说,标准评价指标并不反映这一点,我们评价了针对忠诚、事实质量和通用性以对比这些不同方法的若干新措施。