探索与释放大型语言模型在CI/CD配置翻译中的潜力 (Exploringand Unleashing the Power of Large Language Models in CI/CD Configuration Translation)

Continuous Integration (CI) is a cornerstone of modern collaborative software development, and numerous CI platforms are available. Differences in maintenance overhead, reliability, and integration depth with code-hosting platforms make migration between CI platforms a common practice. A central step in migration is translating CI configurations, which is challenging due to the intrinsic complexity of CI configurations and the need to understand semantic differences and relationships across CI platforms. With the advent of large language models (LLMs), recent advances in software engineering highlight their potential for CI configuration translation. In this paper, we present a study on LLM-based CI configuration translation, focusing on the migration from Travis CI to GitHub Actions. First, using 811 migration records, we quantify the effort involved and find that developers read an average of 38 lines of Travis configuration and write 58 lines of GitHub Actions configuration, with nearly half of the migrations requiring multiple commits. We further analyze translations produced by each of the four LLMs and identify 1,121 issues grouped into four categories: logic inconsistencies (38%), platform discrepancies (32%), environment errors (25%), and syntax errors (5%). Finally, we evaluate three enhancement strategies and show that combining guideline-based prompting with iterative refinement achieves the best performance, reaching a Build Success Rate of 75.5%-nearly a threefold improvement over GPT-4o with a basic prompt.

翻译：持续集成（CI）是现代协作式软件开发的基石，现有众多CI平台可供选择。由于维护开销、可靠性以及与代码托管平台集成深度的差异，在不同CI平台间迁移已成为常见实践。迁移的核心步骤是翻译CI配置，这因CI配置固有的复杂性以及需要理解跨CI平台的语义差异和关联而颇具挑战性。随着大型语言模型（LLMs）的出现，软件工程领域的最新进展凸显了其在CI配置翻译方面的潜力。本文针对基于LLM的CI配置翻译开展研究，重点关注从Travis CI到GitHub Actions的迁移。首先，利用811条迁移记录，我们量化了相关工作量，发现开发者平均需阅读38行Travis配置并编写58行GitHub Actions配置，且近半数迁移需要多次提交。我们进一步分析了四种LLM各自生成的翻译结果，识别出1,121个问题，并将其归纳为四类：逻辑不一致（38%）、平台差异（32%）、环境错误（25%）和语法错误（5%）。最后，我们评估了三种增强策略，结果表明，结合基于指导的提示与迭代优化能实现最佳性能，构建成功率可达75.5%——相较于使用基础提示的GPT-4o，提升了近三倍。