重复违规者：极度易错源代码方法的特征描述与预测 (The Repeat Offenders: Characterizing and Predicting Extremely Bug-Prone Source Methods)

Identifying the small subset of source code that repeatedly attracts bugs is critical for reducing long-term maintenance effort. We define ExtremelyBuggy methods as those involved in more than one bug fix and present the first large-scale study of their prevalence, characteristics, and predictability. Using a dataset of over 1.25 million methods from 98 open-source Java projects, we find that ExtremelyBuggy methods constitute only a tiny fraction of all methods, yet frequently account for a disproportionately large share of bugs. At their inception, these methods are significantly larger, more complex, less readable, and less maintainable than both singly-buggy and non-buggy methods. However, despite these measurable differences, a comprehensive evaluation of five machine learning models shows that early prediction of ExtremelyBuggy methods remains highly unreliable due to data imbalance, project heterogeneity, and the fact that many bugs emerge through subsequent evolution rather than initial implementation. To complement these quantitative findings, we conduct a thematic analysis of 265 ExtremelyBuggy methods, revealing recurring visual issues (e.g., confusing control flow, poor readability), contextual roles (e.g., core logic, data transformation, external resource handling), and common defect patterns (e.g., faulty conditionals, fragile error handling, misuse of variables). These results highlight the need for richer, evolution-aware representations of code and provide actionable insights for practitioners seeking to prioritize high-risk methods early in the development lifecycle.

翻译：识别那些反复引发缺陷的少量源代码子集，对于降低长期维护成本至关重要。本研究将“极度易错方法”定义为涉及多次缺陷修复的代码方法，并首次对其普遍性、特征及可预测性展开大规模实证分析。基于98个开源Java项目中超过125万个方法的数据集，我们发现极度易错方法仅占全部方法的极小比例，却往往在缺陷总量中占据不成比例的巨大份额。在创建初期，这些方法相较于单次缺陷方法和无缺陷方法，具有显著更大的规模、更高的复杂度、更低的可读性与可维护性。然而，尽管存在这些可量化的差异，通过对五种机器学习模型的综合评估表明，由于数据不平衡、项目异质性以及大量缺陷源于后续演化而非初始实现等因素，对极度易错方法的早期预测仍具有高度不确定性。为补充量化分析结果，我们对265个极度易错方法进行了主题分析，揭示了其反复出现的视觉问题（如混乱的控制流、低可读性）、上下文角色（如核心逻辑、数据转换、外部资源处理）以及常见缺陷模式（如条件判断错误、脆弱的异常处理、变量误用）。这些结果凸显了需要构建更丰富、具备演化感知能力的代码表征方式，并为开发者在软件生命周期早期识别高风险方法提供了可操作的实践洞见。