The ROUGE metric is commonly used to evaluate extractive summarization task, but it has been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the extractive summarizer. Previous research has introduced a gain-based automated metric called Sem-nCG that addresses these issues, as it is both rank and semantic aware. However, it does not consider the amount of redundancy present in a model summary and currently does not support evaluation with multiple reference summaries. It is essential to have a model summary that balances importance and diversity, but finding a metric that captures both of these aspects is challenging. In this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how the revised Sem-nCG metric can be used to evaluate model summaries against multiple references as well which was missing in previous research. Experimental results demonstrate that the revised Sem-nCG metric has a stronger correlation with human judgments compared to the previous Sem-nCG metric and traditional ROUGE and BERTScore metric for both single and multiple reference scenarios.
翻译:暂无翻译