Construction of human-curated annotated datasets for abstractive text summarization (ATS) is very time-consuming and expensive because creating each instance requires a human annotator to read a long document and compose a shorter summary that would preserve the key information relayed by the original document. Active Learning (AL) is a technique developed to reduce the amount of annotation required to achieve a certain level of machine learning model performance. In information extraction and text classification, AL can reduce the amount of labor up to multiple times. Despite its potential for aiding expensive annotation, as far as we know, there were no effective AL query strategies for ATS. This stems from the fact that many AL strategies rely on uncertainty estimation, while as we show in our work, uncertain instances are usually noisy, and selecting them can degrade the model performance compared to passive annotation. We address this problem by proposing the first effective query strategy for AL in ATS based on diversity principles. We show that given a certain annotation budget, using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores. Additionally, we analyze the effect of self-learning and show that it can further increase the performance of the model.
翻译:为抽象文本摘要化(ATS)而建造人造附加说明数据集非常费时和昂贵,因为创建每个实例都需要一名人类说明员阅读长的文件,并编写一个较短的摘要,以保存原始文件所传递的关键信息。积极学习(AL)是一种技术,旨在减少达到某种程度的机器学习模型性能所需的批注数量。在信息提取和文本分类中,AL可以将劳动量减少到多个时间。尽管据我们所知,AL查询战略对于苯丙胺类兴奋剂没有帮助昂贵的批注的潜力,但是它并没有有效的AL查询战略。这源于许多AL战略依赖于不确定性的估计,而正如我们在工作中显示的那样,不确定的事例通常很吵闹,选择它们可以降低模型性能,而不是被动的批注。我们通过提出基于多样性原则的AL苯丙胺类兴奋剂首次有效的查询战略来解决这一问题。我们用AL的注解战略表明,给一定的批注预算有助于在ROUGE和一致性模型方面改进模型性能。此外,我们分析了自我学习的效果,并表明其效果能够进一步提高ROUGE和一致性的分数。