高效生成质量估算数据集的新工具 (A New Tool for Efficiently Generating Quality Estimation Datasets)

Building of data for quality estimation (QE) training is expensive and requires significant human labor. In this study, we focus on a data-centric approach while performing QE, and subsequently propose a fully automatic pseudo-QE dataset generation tool that generates QE datasets by receiving only monolingual or parallel corpus as the input. Consequently, the QE performance is enhanced either by data augmentation or by encouraging multiple language pairs to exploit the applicability of QE. Further, we intend to publicly release this user friendly QE dataset generation tool as we believe this tool provides a new, inexpensive method to the community for developing QE datasets.

翻译：质量评估(QE)培训数据建设成本昂贵,需要大量人力。在这项研究中,我们侧重于以数据为中心的方法,同时进行量化评估,并随后提议一个完全自动的假量化评估数据集生成工具,通过仅接收单语或平行数据作为输入来生成量化评估数据集。因此,量化评估的性能要么通过数据增强,要么通过鼓励多种语言对口来利用量化评估的适用性而得到加强。此外,我们打算公开发布这一用户友好的量化评估数据集生成工具,因为我们认为这一工具为社区开发量化评估数据集提供了一种新的廉价方法。