Defect identification at commit check-in time prevents the introduction of defects into software. Current defect identification approaches either rely on manually crafted features such as change metrics or involve training expensive machine learning or deep learning models. By relying on a complex underlying model, these approaches are not often explainable, which means the models' predictions cannot be understood by the developers. An approach that is not explainable might not be adopted in real-life development environments because of developers' lack of trust in its results. Furthermore, because of an extensive training process, these approaches cannot readily learn from new examples when they arrive, making them unsuitable for fast online prediction. To address these limitations, we propose an approach called IRJIT that employs information retrieval on source code, and labels new commits as buggy or clean based on their similarity to past buggy or clean commits. Our approach is online and explainable as it can learn from new data without retraining, and developers can see the documents that support a prediction. Through an evaluation of 8 open-source projects, we show that IRJIT achieves AUC and F1 score close to the state-of-the-art machine learning approach JITLine, without considerable re-training.
翻译:进行检查时发现缺陷后无法将缺陷引入软件。 目前的缺陷识别方法要么依靠人工制作的特征,如变化度量,要么涉及培训昂贵的机器学习或深层学习模式。 依靠复杂的基本模型,这些方法往往不易解释,这意味着模型的预测不能为开发者所理解。 在现实发展环境中,由于开发者对结果缺乏信任,在实际发展环境中可能无法解释这种方法。 此外,由于培训过程广泛,这些方法在到达时无法随时从新的例子中学习,因此不适合快速在线预测。为了解决这些限制,我们建议采用称作IRJIT的方法,即使用源码信息检索,并且根据与过去的错误或清洁承诺相似性,将新的标签当作错误或清洁。我们的方法是在线的,因为开发者可以在没有再培训的情况下从新数据中学习,因此无法解释,而且开发者可以看到支持预测的文件。 通过对8个公开源项目进行评估,我们显示IRJIT公司在接近州级机器学习方法的情况下取得了AUC和F1分。