Fault Localization (FL) is an important first step in software debugging and is mostly manual in the current practice. Many methods have been proposed over years to automate the FL process, including information retrieval (IR)-based techniques. These methods localize the fault based on the similarity of the reported bug report and the source code. Newer variations of IR-based FL (IRFL) techniques also look into the history of bug reports and leverage them during the localization. However, all existing IRFL techniques limit themselves to the current project's data (local data). In this study, we introduce Globug, which is an IRFL framework consisting of methods that use models pre-trained on the global data (extracted from open-source benchmark projects). In Globug, we investigate two heuristics: a) the effect of global data on a state-of-the-art IR-FL technique, namely BugLocator, and b) the application of a Word Embedding technique (Doc2Vec) together with global data. Our large scale experiment on 51 software projects shows that using global data improves BugLocator on average 6.6% and 4.8% in terms of MRR (Mean Reciprocal Rank) and MAP (Mean Average Precision), with over 14% in a majority (64% and 54% in terms of MRR and MAP, respectively) of the cases. This amount of improvement is significant compared to the improvement rates that five other state-of-the-art IRFL tools provide over BugLocator. In addition, training the models globally is a one-time offline task with no overhead on BugLocator's run-time fault localization. Our study, however, shows that a Word Embedding-based global solution did not further improve the results.
翻译:错误本地化( FL) 是软件调试的重要第一步, 大多是当前实践中的手工操作。 多年来, 提出了许多方法, 包括信息检索( IR) 技术, 包括信息检索( IR) 技术。 这些方法根据报告的错误报告和源代码的相似性将错误本地化。 以 IR 为基础的 FL ( IRFL) 技术的更新变异也查看了错误报告的历史, 并在本地化过程中加以利用。 然而, 所有现有的 IRFL 技术都局限于当前项目的数据( 本地数据 ) 。 在此研究中, 我们引入了 Globug, 这个由模型组成的IRL框架, 包括使用全球数据预先培训过的方法( 源自开源基准项目 ) 。 在 Globbbbt 中, 全球数据数据对最新版本的 RFRF技术( 即 BugLocator ) 的影响, 以及 应用Word Empload 改进工具( Doc2Vc) 工具, 以及全球数据 。 我们的大规模测试了51个软件模型( IMRBIL) 和 mLIL 的 平均 数据, 数据中, 这个系统运行中, 这个系统中的数据将显示一个百分比, 这个系统的数据为平均数据, 超过14 RBRBLLLLLLLLLL。