大型语言模型在分解缺陷报告能力方面的实证研究 (An Empirical Study on the Capability of LLMs in Decomposing Bug Reports)

Background: Bug reports are essential to the software development life cycle. They help developers track and resolve issues, but are often difficult to process due to their complexity, which can delay resolution and affect software quality. Aims: This study investigates whether large language models (LLMs) can assist developers in automatically decomposing complex bug reports into smaller, self-contained units, making them easier to understand and address. Method: We conducted an empirical study on 127 resolved privacy-related bug reports collected from Apache Jira. We evaluated ChatGPT and DeepSeek using different prompting strategies. We first tested both LLMs with zero-shot prompts, then applied improved prompts with demonstrations (using few-shot prompting) to measure their abilities in bug decomposition. Results: Our findings show that LLMs are capable of decomposing bug reports, but their overall performance still requires further improvement and strongly depends on the quality of the prompts. With zero-shot prompts, both studied LLMs (ChatGPT and DeepSeek) performed poorly. After prompt tuning, ChatGPT's true decomposition rate increased by 140\% and DeepSeek's by 163.64\%. Conclusions: LLMs show potential in helping developers analyze and decompose complex bug reports, but they still need improvement in terms of accuracy and bug understanding.

翻译：背景：缺陷报告在软件开发生命周期中至关重要。它们帮助开发者追踪和解决问题，但由于其复杂性往往难以处理，这可能导致解决延迟并影响软件质量。目的：本研究旨在探究大型语言模型（LLMs）能否协助开发者自动将复杂缺陷报告分解为更小、自包含的单元，从而使其更易于理解和处理。方法：我们对从Apache Jira收集的127个已解决的隐私相关缺陷报告进行了实证研究。通过不同提示策略评估了ChatGPT和DeepSeek。首先使用零样本提示测试两种LLM，随后应用带有示例的改进提示（采用少样本提示）以衡量它们在缺陷分解中的能力。结果：研究发现LLMs能够分解缺陷报告，但其整体性能仍需进一步提升，且强烈依赖于提示质量。在零样本提示下，两种被研究的LLM（ChatGPT和DeepSeek）表现均不佳。经过提示调优后，ChatGPT的真实分解率提升了140%，DeepSeek提升了163.64%。结论：LLMs在协助开发者分析和分解复杂缺陷报告方面展现出潜力，但在准确性和缺陷理解方面仍需改进。