Text summarization has been a crucial problem in natural language processing (NLP) for several decades. It aims to condense lengthy documents into shorter versions while retaining the most critical information. Various methods have been proposed for text summarization, including extractive and abstractive summarization. The emergence of large language models (LLMs) like GPT3 and ChatGPT has recently created significant interest in using these models for text summarization tasks. Recent studies \cite{goyal2022news, zhang2023benchmarking} have shown that LLMs-generated news summaries are already on par with humans. However, the performance of LLMs for more practical applications like aspect or query-based summaries is underexplored. To fill this gap, we conducted an evaluation of ChatGPT's performance on four widely used benchmark datasets, encompassing diverse summaries from Reddit posts, news articles, dialogue meetings, and stories. Our experiments reveal that ChatGPT's performance is comparable to traditional fine-tuning methods in terms of Rouge scores. Moreover, we highlight some unique differences between ChatGPT-generated summaries and human references, providing valuable insights into the superpower of ChatGPT for diverse text summarization tasks. Our findings call for new directions in this area, and we plan to conduct further research to systematically examine the characteristics of ChatGPT-generated summaries through extensive human evaluation.
翻译:数十年来,文本总和一直是自然语言处理(NLP)中的一个关键问题。 它旨在将长篇文件压缩为较短版本,同时保留最关键的信息。 已经为文本总和提出了各种方法,包括抽取和抽象总和。 GPT3 和 ChatGPT 等大型语言模型(LLMs)的出现最近使人们对使用这些模型进行文本总和任务产生了极大的兴趣。 最近的研究 {cite{goyal2022news, zhang2023benketing} 显示,LLMS 生成的新闻摘要已经与人类相当。然而,LLMS在像部分或基于查询的摘要这样更实际的应用方面,其性能不够。 然而,对于文本总和基于查询的摘要等更实用性的应用,LLMS的性能却被低估了。 为了填补这一空白,我们根据四种广泛使用的基准数据集对ChatGPT的性能进行了评估,我们从重编、新闻文章、对话会议和故事中,我们实验显示,CHPTPT在红色分数方面的传统微调整方法方法。 此外,我们强调CTPT- 和人文总摘要摘要和人类参考的参考性大研究领域,我们通过高端研究的极能研究领域,我们研究的超力研究领域,我们研究领域对CLTPLV的超能力进行有价值的深入洞察。