In this paper we evaluate the capabilities of LLM Agents in generating code for real-world problems. Specifically, we explore code synthesis for microservice-based applications, a widely used architectural pattern for building applications. We define a standard template for specifying these applications, and we propose a metric for scoring the difficulty of a specification. The higher the score, the more difficult it is to generate code for the specification. Our experimental results show that agents using strong LLMs (like GPT-3o-mini) do fairly well on medium difficulty specifications but do poorly on those of higher difficulty levels. This is due to more intricate business logic, a greater use of external services, database integration and inclusion of non-functional capabilities such as authentication. We analyzed the errors in LLM-synthesized code and report on the key challenges LLM Agents face in generating code for these specifications. Finally, we show that using a fine-grained approach to code generation improves the correctness of the generated code.
翻译:本文评估了大型语言模型智能体在真实场景问题中生成代码的能力。具体而言,我们探索了针对微服务架构应用的代码合成技术,这是一种广泛采用的应用程序构建模式。我们定义了用于描述此类应用的标准模板,并提出了一种用于量化规范难度的评分指标。评分越高,意味着为该规范生成代码的难度越大。实验结果表明,采用高性能LLM的智能体在中等难度规范上表现良好,但在高难度规范上表现欠佳。这主要源于更复杂的业务逻辑、对外部服务的更多依赖、数据库集成以及包含身份验证等非功能性需求。我们分析了LLM合成代码中的错误,并报告了LLM智能体在为这些规范生成代码时面临的关键挑战。最后,我们证明了采用细粒度代码生成方法能够有效提升生成代码的正确性。