基于开源大语言模型的自动化提取方法重构：一项比较研究 (Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study)

Automating the Extract Method refactoring (EMR) remains challenging and largely manual despite its importance in improving code readability and maintainability. Recent advances in open-source, resource-efficient Large Language Models (LLMs) offer promising new approaches for automating such high-level tasks. In this work, we critically evaluate five state-of-the-art open-source LLMs, spanning 3B to 8B parameter sizes, on the EMR task for Python code. We systematically assess functional correctness and code quality using automated metrics and investigate the impact of prompting strategies by comparing one-shot prompting to a Recursive criticism and improvement (RCI) approach. RCI-based prompting consistently outperforms one-shot prompting in test pass rates and refactoring quality. The best-performing models, Deepseek-Coder-RCI and Qwen2.5-Coder-RCI, achieve test pass percentage (TPP) scores of 0.829 and 0.808, while reducing lines of code (LOC) per method from 12.103 to 6.192 and 5.577, and cyclomatic complexity (CC) from 4.602 to 3.453 and 3.294, respectively. A developer survey on RCI-generated refactorings shows over 70% acceptance, with Qwen2.5-Coder rated highest across all evaluation criteria. In contrast, the original code scored below neutral, particularly in readability and maintainability, underscoring the benefits of automated refactoring guided by quality prompts. While traditional metrics like CC and LOC provide useful signals, they often diverge from human judgments, emphasizing the need for human-in-the-loop evaluation. Our open-source benchmark offers a foundation for future research on automated refactoring with LLMs.

翻译：尽管提取方法重构（EMR）对于提升代码可读性和可维护性至关重要，但其自动化实现仍面临挑战，目前主要依赖人工操作。近期，资源高效的开源大语言模型（LLMs）的进展为自动化此类高级任务提供了前景广阔的新途径。本研究针对Python代码的EMR任务，对五种参数规模从30亿到80亿不等的先进开源LLMs进行了批判性评估。我们采用自动化指标系统评估了功能正确性与代码质量，并通过比较单次提示与递归批评与改进（RCI）方法，探究了提示策略的影响。基于RCI的提示方法在测试通过率和重构质量方面持续优于单次提示。表现最佳的模型Deepseek-Coder-RCI和Qwen2.5-Coder-RCI分别实现了0.829和0.808的测试通过百分比（TPP）得分，同时将每个方法的代码行数（LOC）从12.103分别降低至6.192和5.577，圈复杂度（CC）从4.602分别降至3.453和3.294。针对RCI生成重构的开发者调查显示超过70%的接受度，其中Qwen2.5-Coder在所有评估标准中评分最高。相比之下，原始代码的评分低于中性水平，特别是在可读性和可维护性方面，这凸显了基于高质量提示的自动化重构的优势。虽然传统指标如CC和LOC能提供有效参考，但它们常与人工判断存在偏差，强调了人在环评估的必要性。我们开源的基准测试为未来基于LLMs的自动化重构研究奠定了基础。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日