Many tasks revolve around editing a document, whether code or text. We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems whose goal is to assess a revision to an existing document. We observe that revisions usually change only a small portion of an existing document, so the existing document and its immediate revisions share a majority of their content. We formulate five adequacy criteria for revision similarity measures, designed to align them with human judgement. We show that popular pairwise measures, like BLEU, fail to meet these criteria, because their scores are dominated by the shared content. They report high similarity between two revisions when humans would assess them as quite different. This is a fundamental flaw we address. We propose a novel static measure, Excision Score (ES), which computes longest common subsequence (LCS) to remove content shared by an existing document with the ground truth and predicted revisions, before comparing only the remaining divergent regions. This is analogous to a surgeon creating a sterile field to focus on the work area. We use approximation to speed the standard cubic LCS computation to quadratic. In code-editing evaluation, where static measures are often used as a cheap proxy for passing tests, we demonstrate that ES surpasses existing measures. When aligned with test execution on HumanEvalFix, ES improves over its nearest competitor, SARI, by 12% Pearson correlation and by >21% over standard measures like BLEU. The key criterion is invariance to shared context; when we perturb HumanEvalFix with increased shared context, ES' improvement over SARI increases to 20% and >30% over standard measures. ES also handles other corner cases that other measures do not, such as correctly aligning moved code blocks, and appropriately rewarding matching insertions or deletions.
翻译:许多任务都围绕文档(无论是代码还是文本)的编辑展开。我们提出修订相似性问题,以统一各类机器学习评估问题,其目标在于评估对现有文档的修订。我们观察到修订通常仅改变现有文档的一小部分,因此现有文档与其即时修订在内容上存在大量重叠。我们为修订相似性度量制定了五项充分性准则,旨在使其与人类判断保持一致。我们证明,诸如BLEU等常用成对度量方法无法满足这些准则,因为其评分主要受共享内容主导。当人类判断两个修订差异显著时,这些方法仍会报告较高的相似性。这是我们所解决的根本缺陷。我们提出一种新颖的静态度量方法——切除评分(ES),该方法通过计算最长公共子序列(LCS)来移除现有文档与基准修订及预测修订之间的共享内容,随后仅比较剩余的分歧区域。这类似于外科医生创建无菌区以聚焦于操作区域。我们采用近似方法将标准立方级LCS计算加速至平方级。在代码编辑评估中,静态度量常被用作通过测试的廉价代理,我们证明ES超越了现有度量方法。在HumanEvalFix数据集上与测试执行结果对齐时,ES相较于其最接近的竞争者SARI,皮尔逊相关性提高了12%,相较于BLEU等标准度量方法提升超过21%。关键准则在于对共享上下文的不变性;当我们在HumanEvalFix中增加共享上下文进行扰动时,ES相对于SARI的改进提升至20%,相对于标准度量方法提升超过30%。ES还能处理其他度量方法无法应对的边缘情况,例如正确对齐移动的代码块,并对匹配的插入或删除操作给予适当奖励。