Modern text simplification (TS) heavily relies on the availability of gold standard data to build machine learning models. However, existing studies show that parallel TS corpora contain inaccurate simplifications and incorrect alignments. Additionally, evaluation is usually performed by using metrics such as BLEU or SARI to compare system output to the gold standard. A major limitation is that these metrics do not match human judgements and the performance on different datasets and linguistic phenomena vary greatly. Furthermore, our research shows that the test and training subsets of parallel datasets differ significantly. In this work, we investigate existing TS corpora, providing new insights that will motivate the improvement of existing state-of-the-art TS evaluation methods. Our contributions include the analysis of TS corpora based on existing modifications used for simplification and an empirical study on TS models performance by using better-distributed datasets. We demonstrate that by improving the distribution of TS datasets, we can build more robust TS models.
翻译:现代文本简化(TS)在很大程度上依赖金标准数据,以建立机器学习模型;然而,现有研究表明,平行的TS Corbora公司含有不准确的简化和不正确的校正;此外,通常通过使用BLEU或SARI等计量标准来将系统输出与黄金标准进行比较来进行评价;一个重大限制是,这些计量标准与人类的判断不一致,不同数据集和语言现象的性能差异很大。此外,我们的研究表明,平行数据集的测试和培训子集差异很大。在这项工作中,我们调查了现有的TS Corbora公司,提供了新的洞察力,将推动改进现有最先进的TS评价方法。我们的贡献包括根据简化所用的现有修改对TS Corora公司的分析,以及通过使用更精确分布的数据集对TS模型的性能进行经验研究。我们证明,通过改进TS数据集的分布,我们可以建立更可靠的TS模型。