AI-agents help developers in different coding tasks, such as developing new features, fixing bugs, and reviewing code. Developers can write a Github issue and assign it to an AI-agent like Copilot for implementation. Based on the issue and its related discussion, the AI-agent performs a plan for the implementation, and executes it. However, the performance of AI-agents and LLMs heavily depends on the input they receive. For instance, a GitHub issue that is unclear or not well scoped might not lead to a successful implementation that will eventually be merged. GitHub Copilot provides a set of best practice recommendations that are limited and high-level. In this paper, we build a set of 32 detailed criteria that we leverage to measure the quality of GitHub issues to make them suitable for AI-agents. We compare the GitHub issues that lead to a merged pull request versus closed pull request. Then, we build an interpretable machine learning model to predict the likelihood of a GitHub issue resulting in a merged pull request. We observe that pull requests that end up being merged are those originating from issues that are shorter, well scoped, with clear guidance and hints about the relevant artifacts for an issue, and with guidance on how to perform the implementation. Issues with external references including configuration, context setup, dependencies or external APIs are associated with lower merge rates. We built an interpretable machine learning model to help users identify how to improve a GitHub issue to increase the chances of the issue resulting in a merged pull request by Copilot. Our model has a median AUC of 72\%. Our results shed light on quality metrics relevant for writing GitHub issues and motivate future studies further investigate the writing of GitHub issues as a first-class software engineering activity in the era of AI-teammates.
翻译:AI助手能在多种编码任务中辅助开发者,例如开发新功能、修复漏洞和代码审查。开发者可撰写GitHub议题并将其分配给Copilot等AI助手进行实施。基于议题及相关讨论,AI助手会制定实施计划并执行。然而,AI助手与大型语言模型的性能高度依赖其接收的输入质量。例如,表述不清或范围界定不明的GitHub议题可能无法促成最终被合并的成功实施。GitHub Copilot提供的最佳实践建议集存在局限且较为宏观。本文构建了包含32项详细标准的评估体系,用以衡量GitHub议题是否适合AI助手处理。我们对比了促成合并拉取请求与导致关闭拉取请求的GitHub议题差异,进而构建可解释的机器学习模型来预测GitHub议题促成合并拉取请求的概率。研究发现:最终被合并的拉取请求往往源自那些篇幅较短、范围明确、对议题相关工件提供清晰指引与提示,并包含实施指导的议题;而涉及配置说明、环境设置、依赖项或外部API等外部引用的议题则与较低的合并率相关。我们构建的可解释机器学习模型可帮助用户识别如何改进GitHub议题,以提升Copilot促成合并拉取请求的成功率。该模型的中位AUC达72%。本研究揭示了撰写GitHub议题的质量评估标准,并推动未来研究将AI协作时代的议题撰写作为首要软件工程活动进行深入探索。