Automating the adaptation of software engineering (SE) research artifacts across datasets is essential for scalability and reproducibility, yet it remains largely unstudied. Recent advances in large language model (LLM)-based multi-agent systems, such as GitHub Copilot's agent mode, promise to automate complex development workflows through coordinated reasoning, code generation, and tool interaction. This paper presents the first empirical study on how state-of-the-art multi-agent systems perform in dataset adaptation tasks. We evaluate Copilot, backed by GPT-4.1 and Claude Sonnet 4, on adapting SE research artifacts from benchmark repositories including ROCODE and LogHub2.0. Through a five-stage evaluation pipeline (file comprehension, code editing, command generation, validation, and final execution), we measure success rates, analyze failure patterns, and assess prompt-based interventions designed to enhance agent performance. Results show that current systems can identify key files and generate partial adaptations but rarely produce functionally correct implementations. Prompt-level interventions, especially providing execution error messages and reference code, substantially improve structural similarity to ground truth (from 7.25% to 67.14%), highlighting the importance of contextual and feedback-driven guidance. Our findings reveal both the promise and limitations of today's multi-agent LLM systems for dataset adaptation, and suggest concrete directions for building more reliable, self-correcting agents in future SE research.
翻译:自动化软件工程(SE)研究工件在数据集间的适应过程对于可扩展性和可复现性至关重要,但目前仍鲜有研究。基于大语言模型(LLM)的多智能体系统(如GitHub Copilot的智能体模式)的最新进展,通过协调推理、代码生成和工具交互,有望实现复杂开发工作流的自动化。本文首次对最先进的多智能体系统在数据集适应任务中的表现进行了实证研究。我们评估了由GPT-4.1和Claude Sonnet 4支持的Copilot在适应来自ROCODE和LogHub2.0等基准仓库的SE研究工件时的表现。通过一个五阶段评估流程(文件理解、代码编辑、命令生成、验证与最终执行),我们测量了成功率,分析了失败模式,并评估了旨在提升智能体性能的基于提示的干预措施。结果表明,当前系统能够识别关键文件并生成部分适应方案,但很少能产生功能正确的实现。提示层面的干预,特别是提供执行错误信息和参考代码,显著提高了与真实情况的结构相似性(从7.25%提升至67.14%),凸显了上下文和反馈驱动指导的重要性。我们的研究揭示了当前多智能体LLM系统在数据集适应方面的潜力与局限,并为未来SE研究中构建更可靠、具备自我修正能力的智能体提出了具体方向。