The rapid iteration cycles of modern live-service games make regression testing indispensable for maintaining quality and stability. However, existing regression testing approaches face critical limitations, especially in common gray-box settings where full source code access is unavailable: they heavily rely on manual effort for test case construction, struggle to maintain growing suites plagued by redundancy, and lack efficient mechanisms for prioritizing relevant tests. These challenges result in excessive testing costs, limited automation, and insufficient bug detection. To address these issues, we propose SAGE, a semanticaware regression testing framework for gray-box game environments. SAGE systematically addresses the core challenges of test generation, maintenance, and selection. It employs LLM-guided reinforcement learning for efficient, goal-oriented exploration to automatically generate a diverse foundational test suite. Subsequently, it applies a semantic-based multi-objective optimization to refine this suite into a compact, high-value subset by balancing cost, coverage, and rarity. Finally, it leverages LLM-based semantic analysis of update logs to prioritize test cases most relevant to version changes, enabling efficient adaptation across iterations. We evaluate SAGE on two representative environments, Overcooked Plus and Minecraft, comparing against both automated baselines and human-recorded test cases. Across all environments, SAGE achieves superior bug detection with significantly lower execution cost, while demonstrating strong adaptability to version updates.
翻译:现代实时服务游戏的快速迭代周期使得回归测试成为维持质量和稳定性的必要环节。然而,现有回归测试方法面临关键局限,尤其是在无法获取完整源代码的常见灰盒场景中:它们严重依赖人工构建测试用例,难以维护因冗余而日益膨胀的测试集,且缺乏高效机制来优先处理相关测试。这些挑战导致测试成本过高、自动化程度有限以及缺陷检测不足。为解决这些问题,我们提出SAGE,一种面向灰盒游戏环境的语义感知回归测试框架。SAGE系统性地应对测试生成、维护与选择的核心挑战。它采用LLM引导的强化学习进行高效、目标导向的探索,自动生成多样化的基础测试集;随后通过基于语义的多目标优化,在成本、覆盖率和稀有性之间取得平衡,将测试集精炼为紧凑的高价值子集;最后,利用基于LLM的更新日志语义分析,优先处理与版本变更最相关的测试用例,实现跨迭代的高效适配。我们在两个代表性环境(Overcooked Plus和Minecraft)上评估SAGE,并与自动化基线及人工记录的测试用例进行比较。在所有环境中,SAGE以显著更低的执行成本实现了更优的缺陷检测能力,同时展现出对版本更新的强大适应性。