Large Language Models (LLMs) are increasingly used to refactor unit tests, improving readability and structure while preserving behavior. Evaluating such refactorings, however, remains difficult: metrics like CodeBLEU penalize beneficial renamings and edits, while semantic similarities overlook readability and modularity. We propose CTSES, a first step toward human-aligned evaluation of refactored tests. CTSES combines CodeBLEU, METEOR, and ROUGE-L into a composite score that balances semantics, lexical clarity, and structural alignment. Evaluated on 5,000+ refactorings from Defects4J and SF110 (GPT-4o and Mistral-Large), CTSES reduces false negatives and provides more interpretable signals than individual metrics. Our emerging results illustrate that CTSES offers a proof-of-concept for composite approaches, showing their promise in bridging automated metrics and developer judgments.
翻译:大型语言模型(LLM)正日益用于重构单元测试,在保持行为不变的同时提升可读性与结构。然而,对此类重构的评估仍具挑战性:CodeBLEU等指标会惩罚有益的变量重命名与代码编辑,而语义相似性方法则忽视了可读性与模块化。我们提出CTSES,这是迈向人类对齐的测试重构评估的第一步。CTSES将CodeBLEU、METEOR与ROUGE-L结合为一个综合评分,以平衡语义、词汇清晰度与结构对齐性。在基于Defects4J和SF110数据集(使用GPT-4o与Mistral-Large生成)的5000余次重构上进行评估,CTSES相比单一指标减少了假阴性案例,并提供了更具可解释性的评估信号。我们的初步结果表明,CTSES为复合评估方法提供了概念验证,展现了其在弥合自动化指标与开发者判断之间差距的潜力。