We report the results of DialogSum Challenge, the shared task on summarizing real-life scenario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different methods to improve the performance of dialogue summarization. Although there is a great improvement over the baseline models regarding automatic evaluation metrics, such as Rouge scores, we find that there is a salient gap between model generated outputs and human annotated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion metrics are in need.
翻译:我们报告“对话挑战”的结果,这是总结INLG 2022实际生活情景对话的共同任务。四个小组参与了这一共同任务,三个小组提交了系统报告,探索了改进对话总结业绩的不同方法。尽管与红色分等自动评价指标基准模型相比,情况大有改进,但我们发现模型产生的产出与人类从多个方面对人的评价附加说明的摘要之间存在明显差距。这些结论表明对话总结的困难,并表明需要更多的精细评价指标。