Evaluating text summarization is a challenging problem, and existing evaluation metrics are far from satisfactory. In this study, we explored ChatGPT's ability to perform human-like summarization evaluation using four human evaluation methods on five datasets. We found that ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation. Additionally, it outperformed commonly used automatic evaluation metrics on some datasets. Furthermore, we discussed the impact of different prompts, compared its performance with that of human evaluation, and analyzed the generated explanations and invalid responses.
翻译:评估文本摘要是一个具有挑战性的问题,现有的评估指标远不能令人满意。在本研究中,我们使用 ChatGPT 在五个数据集上,使用四种人类评估方法探索了它进行人类式摘要评估的能力。我们发现,ChatGPT 能够相对顺利地完成使用 Likert 比例评分、成对比较、金字塔和二元事实性评估的注释。此外,在某些数据集上,它比常用的自动评估指标表现更好。此外,我们讨论了不同提示的影响,比较了它与人类评估的表现,并分析了所生成的解释和无效响应。