The exercise of detecting similar bug reports in bug tracking systems is known as duplicate bug report detection. Having prior knowledge of a bug report's existence reduces efforts put into debugging problems and identifying the root cause. Rule and Query-based solutions recommend a long list of potential similar bug reports with no clear ranking. In addition, triage engineers are less motivated to spend time going through an extensive list. Consequently, this deters the use of duplicate bug report retrieval solutions. In this paper, we have proposed a solution using a combination of NLP techniques. Our approach considers unstructured and structured attributes of a bug report like summary, description and severity, impacted products, platforms, categories, etc. It uses a custom data transformer, a deep neural network, and a non-generalizing machine learning method to retrieve existing identical bug reports. We have performed numerous experiments with significant data sources containing thousands of bug reports and showcased that the proposed solution achieves a high retrieval accuracy of 70% for recall@5.
翻译:在错误跟踪系统中检测类似错误报告的工作被称为重复错误报告检测。 先前对错误报告的存在了解后, 工作会减少调试问题并找出根源。 规则和基于查询的解决方案推荐了长长的类似错误报告列表, 但没有明确的排名。 此外, 排查工程师不太愿意花时间通过一个广泛的列表。 因此, 阻止使用重复的错误报告检索解决方案 。 在本文中, 我们建议了一种解决方案, 结合了 NLP 技术。 我们的方法考虑到错误报告的未结构化和结构化属性, 如摘要、 描述和严重性、 受影响产品、 平台、 类别等。 它使用自定义的数据转换器、 深神经网络和不普及的机器学习方法来检索现有的相同错误报告 。 我们用包含数千个错误报告的重要数据源进行了无数的实验, 并展示了拟议解决方案的回收率高达70%, 用于回忆@5 。