通过多智能体协作与语义架构建模实现面向真实项目级别的代码生成 (Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling)

In recent years, Large Language Models (LLMs) have achieved remarkable progress in automated code generation. In real-world software engineering, the growing demand for rapid iteration and continuous delivery underscores the importance of project-level code generation, where LLMs are expected to generate complete software projects directly from complex user requirements. Although existing studies have made initial explorations, they still face key limitations, including unrealistic datasets and unreliable evaluation metrics that fail to reflect real-world complexity, the semantic gap between human-written requirements and machine-interpretable structures, and difficulties in managing hierarchical dependencies and maintaining quality throughout the generation process. To address these limitations, we first introduce CodeProjectEval, a project-level code generation dataset built from 18 real-world repositories with 12.7 files and 2,388.6 lines of code per task on average, supplemented with documentation and executable test cases for automatic evaluation. We further propose ProjectGen, a multi-agent framework that decomposes projects into architecture design, skeleton generation, and code filling stages with iterative refinement and memory-based context management. Within this framework, we introduce the Semantic Software Architecture Tree (SSAT), a structured and semantically rich representation that effectively bridges user requirements and source code implementation. Experiments show that ProjectGen achieves state-of-the-art performance, passing 52/124 test cases on the small-scale project-level code generation dataset DevBench, a 57% improvement over the baseline approaches, and 310 test cases on CodeProjectEval, representing an improvement of roughly tenfold compared to the baselines.

翻译：近年来，大型语言模型（LLMs）在自动化代码生成领域取得了显著进展。在实际软件工程中，快速迭代和持续交付的日益增长需求凸显了项目级别代码生成的重要性，即期望LLMs能够直接从复杂的用户需求生成完整的软件项目。尽管现有研究已进行初步探索，但仍面临关键局限：包括无法反映真实世界复杂性的非现实数据集与不可靠评估指标、人工编写需求与机器可解释结构之间的语义鸿沟，以及在生成过程中管理层次依赖关系和保持质量的困难。为应对这些局限，我们首先引入CodeProjectEval——一个项目级别代码生成数据集，该数据集基于18个真实世界代码库构建，平均每个任务包含12.7个文件和2388.6行代码，并辅以文档和可执行测试用例用于自动评估。我们进一步提出ProjectGen，这是一个多智能体框架，通过迭代优化和基于记忆的上下文管理，将项目分解为架构设计、骨架生成和代码填充三个阶段。在该框架中，我们引入了语义软件架构树（SSAT），这是一种结构化且语义丰富的表示方法，能有效连接用户需求与源代码实现。实验表明，ProjectGen实现了最先进的性能：在小型项目级别代码生成数据集DevBench上通过52/124个测试用例，较基线方法提升57%；在CodeProjectEval上通过310个测试用例，较基线提升约十倍。