RefactorCoderQA：面向云与边缘部署的多领域代码问题解决方案的LLM基准测试 (RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment)

To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud and responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of LLMs across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers multiple technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges sourced from Stack Overflow. We propose RefactorCoder-MoE, a fine-tuned mixture-of-experts (MoE) code language model based on DeepSeek-Coder-7B-Instruct, adapted to the RefactorCoderQA benchmark using QLoRA for domain-specific coding question answering. Extensive experiments demonstrate that RefactorCoder-MoE achieves strong and competitive performance, significantly outperforming all evaluated open-source and commercial baselines, with an overall accuracy of 76.84%.

翻译：为优化大型语言模型（LLMs）的推理与问题解决能力，我们提出了一种新颖的云边协同架构，该架构实现了一个结构化的多智能体提示框架。该框架包含三个专门化组件：GuideLLM，一个部署在边缘的轻量级模型，负责提供方法学指导；SolverLLM，一个部署在云端、能力更强的模型，负责生成代码解决方案；以及JudgeLLM，一个用于评估解决方案正确性与质量的自动化评估器。为了在真实场景中评估并展示该架构的有效性，我们引入了RefactorCoderQA，这是一个旨在评估和提升LLMs在多领域编码任务中性能的综合基准。受现有基准局限性的启发，RefactorCoderQA系统性地覆盖了软件工程、数据科学、机器学习和自然语言处理等多个技术领域，所使用的真实编码挑战均源自Stack Overflow。我们提出了RefactorCoder-MoE，这是一个基于DeepSeek-Coder-7B-Instruct进行微调的混合专家（MoE）代码语言模型，它使用QLoRA技术适配于RefactorCoderQA基准，以用于领域特定的代码问答任务。大量实验表明，RefactorCoder-MoE取得了强劲且具有竞争力的性能，显著优于所有评估的开源和商业基线模型，其总体准确率达到76.84%。