Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets in text simplification are limited by a lack of annotations, unitary simplification types, and outdated models, making them unsuitable for this approach. To address these issues, we introduce the SIMPEVAL corpus that contains: SIMPEVAL_ASSET, comprising 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5. Training on SIMPEVAL_ASSET, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. To create the SIMPEVAL datasets, we introduce RANK & RATE, a human evaluation framework that rates simplifications from several models in a list-wise manner by leveraging an interactive interface, which ensures both consistency and accuracy in the evaluation process. Our metric, dataset, and annotation toolkit are available at https://github.com/Yao-Dou/LENS.
翻译:使用现代语言模式的培训学习指标最近成为自动评价机器翻译的一个很有希望的方法,然而,由于缺少说明、单一简化类型和过时的模式,现有文字简化中的人类评价数据集受到限制,因此无法采用这种方法。为了解决这些问题,我们引入了SIMPEVAL_ASSET, 其中载有:SIMPEVAL_ASSET, 其中包括对24个系统的2.4K简化系统的12K人评分,以及SIMPEVAL_2022, 这是一项具有挑战性的简化基准,包括360个简化的1K人评分,包括GPT-3.5的几代人。关于SIMPEVAL_ASSET的培训,我们介绍LENS, 一种可学习的文本简化评价方法。广泛的实证结果显示,LENS与人类判断比现有指标更相关,为今后在文本简化评估方面取得进展铺平了道路。为SIMPEVAL的数据集,我们引入了RAK & RATE, 这是一种人类评价框架,通过互动界面对若干模式进行简化,以确保评价过程的连贯性和准确性。我们的数据/YSATSets。