Automatic evaluation metrics have been facilitating the rapid development of automatic summarization methods by providing instant and fair assessments of the quality of summaries. Most metrics have been developed for the general domain, especially news and meeting notes, or other language-generation tasks. However, these metrics are applied to evaluate summarization systems in different domains, such as biomedical question summarization. To better understand whether commonly used evaluation metrics are capable of evaluating automatic summarization in the biomedical domain, we conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task. Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems as well. We also release a dataset of our human annotations to aid the research of summarization evaluation metrics in the biomedical domain.
翻译:自动评价指标通过提供即时公正的摘要质量评估,促进了自动摘要方法的快速发展。 大多数指标是为一般领域(特别是新闻和会议记录,以及其他语言生成任务)开发的。 但是,这些评价指标被应用于评估不同领域中的摘要系统,例如生物医学问题摘要。 为了更好地了解常用的评价指标是否能够评估生物医学领域中的自动摘要,我们从生物医学问题摘要任务的四个不同方面进行人类评估。 基于人工判断,我们确定了当前自动指标和摘要系统的不同值得注意的特征。 我们还发布了我们的人工注释数据集,以助于生物医学领域中摘要的评估指标研究。