In recent years, LLMs have been widely integrated into software engineering workflows, supporting tasks like code generation. However, while these models often generate functionally correct outputs, we still lack a systematic understanding and evaluation of their non-functional qualities. Existing studies focus mainly on whether generated code passes the tests rather than whether it passes with quality. Guided by the ISO/IEC 25010 quality model, this study conducted three complementary investigations: a systematic review of 108 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues using three LLMs. Motivated by insights from both the literature and practitioners, the empirical study examined the quality of generated patches on security, maintainability, and performance efficiency. Across the literature, we found that security and performance efficiency dominate academic attention, while maintainability and other qualities are understudied. In contrast, industry experts prioritize maintainability and readability, warning that generated code may accelerate the accumulation of technical debt. In our evaluation of functionally correct patches generated by three LLMs, improvements in one quality dimension often come at the cost of others. Runtime and memory results further show high variance across models and optimization strategies. Overall, our findings reveal a mismatch between academic focus, industry priorities, and model performance, highlighting the urgent need to integrate quality assurance mechanisms into LLM code generation pipelines to ensure that future generated code not only passes tests but truly passes with quality.
翻译:近年来,LLM已被广泛集成到软件工程工作流中,支持代码生成等任务。然而,尽管这些模型通常能生成功能正确的输出,我们仍对其非功能性质量缺乏系统性的理解和评估。现有研究主要关注生成代码是否通过测试,而非其是否具备质量地通过测试。本研究以ISO/IEC 25010质量模型为指导,开展了三项互补性调查:对108篇论文的系统性综述、与多家机构从业者进行的两次行业研讨会,以及使用三种LLM修补实际软件问题的实证分析。基于文献和从业者的见解,实证研究从安全性、可维护性和性能效率三个方面考察了生成补丁的质量。在文献综述中,我们发现学术关注主要集中在安全性和性能效率上,而可维护性及其他质量特性研究不足。相比之下,行业专家优先考虑可维护性和可读性,并警告生成代码可能加速技术债务的积累。在对三种LLM生成的功能正确补丁的评估中,某一质量维度的改进常以牺牲其他维度为代价。运行时和内存结果进一步显示,不同模型及优化策略间存在高度差异。总体而言,我们的研究揭示了学术关注点、行业优先事项与模型性能之间的不匹配,强调了将质量保证机制集成到LLM代码生成流程中的紧迫性,以确保未来生成的代码不仅能通过测试,更能真正具备质量地通过测试。