CodeFuse-CommitEval：面向评估大语言模型在提交信息与代码变更不一致性检测能力上的基准构建 (CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection)

Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought, and extended context. Results show models detect inconsistent commits more reliably than consistent ones (average Recall 85.95%, Precision 80.28%, Specificity 63.8%); gpt-oss-20B performs best overall but uses over twice the tokens of others. Augmentation effects vary: adjacent context helps larger models but adds noise for smaller ones; few-shot improves accuracy and reduces token use, yet increases universally incorrect predictions; chain-of-thought boosts precision and specificity at the cost of recall and higher token consumption. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, but lower accuracy and higher token cost for intent-level "purpose" inconsistencies. CODEFUSE-COMMITEVAL provides a rigorous foundation for measuring, comparing, and advancing MCI detection, highlighting the need for richer context and balanced data to capture high-level semantic gaps.

翻译：版本控制依赖提交信息来传达代码变更的缘由，但这些信息往往质量较低，且更关键的是与其差异文件（diff）不一致——这被称为消息-代码不一致性（MCI）。MCI会误导评审者、阻碍维护工作、污染研究数据集，并可能掩盖安全补丁。然而，目前尚无专门用于评估MCI检测模型的基准。我们提出了CODEFUSE-COMMITEVAL，这是首个专为利用大语言模型（LLMs）进行MCI检测而设计的基准。该基准基于多样性和质量较高的ApacheCM数据集构建，通过对原始一致的提交进行规则引导的变异，生成了七种类型的不一致消息，并采用双重验证来确认正负样本。利用这个带标签的消息-差异对数据集，我们在基础设置下评估了六种最先进的开源LLM，并应用了三种增强策略：少样本提示、思维链和扩展上下文。结果表明，模型检测不一致提交的可靠性高于一致提交（平均召回率85.95%，精确率80.28%，特异性63.8%）；gpt-oss-20B整体表现最佳，但其令牌使用量是其他模型的两倍以上。增强策略的效果各异：相邻上下文有助于较大模型，但对较小模型则引入噪声；少样本提示提高了准确性并减少了令牌使用，但增加了普遍性错误预测；思维链以牺牲召回率和更高的令牌消耗为代价，提升了精确率和特异性。类型分析显示，组件、文件路径和操作不一致的检测率较高，而意图层面的“目的”不一致则准确率较低且令牌成本更高。CODEFUSE-COMMITEVAL为衡量、比较和推进MCI检测提供了严格的基础，突显了需要更丰富的上下文和平衡的数据以捕捉高层语义差距的必要性。