Scientific publications are the primary means to communicate research discoveries, where the writing quality is of crucial importance. However, prior work studying the human editing process in this domain mainly focused on the abstract or introduction sections, resulting in an incomplete picture. In this work, we provide a complete computational framework for studying text revision in scientific writing. We first introduce arXivEdits, a new annotated corpus of 751 full papers from arXiv with gold sentence alignment across their multiple versions of revision, as well as fine-grained span-level edits and their underlying intentions for 1,000 sentence pairs. It supports our data-driven analysis to unveil the common strategies practiced by researchers for revising their papers. To scale up the analysis, we also develop automatic methods to extract revision at document-, sentence-, and word-levels. A neural CRF sentence alignment model trained on our corpus achieves 93.8 F1, enabling the reliable matching of sentences between different versions. We formulate the edit extraction task as a span alignment problem, and our proposed method extracts more fine-grained and explainable edits, compared to the commonly used diff algorithm. An intention classifier trained on our dataset achieves 78.9 F1 on the fine-grained intent classification task. Our data and system are released at tiny.one/arxivedits.
翻译:科学出版物是交流研究发现的主要手段,其写法质量至关重要。然而,以前研究人类编辑过程的工作主要侧重于抽象部分或介绍部分,结果造成不完整的图片。在这项工作中,我们为研究科学著作文本修订提供了完整的计算框架。我们首先引入了ArXivEdit,这是一套新的附加说明的751份完整论文汇编,由ArXiv版本的黄金句子在多个版本的修改中加以配对,还有细微的跨层编辑,以及对1 000对判刑进行的基本意图。它支持我们的数据驱动分析,以公布研究人员为修订论文而采用的共同战略。为了扩大分析,我们还开发了在文件、句子和字级上进行修改的自动方法。我们所培训的神经版的校正校正校正模型达到了93.8F1,使不同版本的句子能够进行可靠的匹配。我们将提取任务设计成一个宽度调问题,我们拟议的方法摘录和解释的编辑比通常使用的硬度缩略度算法要更精准和可解释的校正的校正。我们所推出的数据分类的系统是78.1。