Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.
翻译:诸如Chatbot Arena等众包模型评估平台,能够从人类视角进行实时评估,以衡量模型响应的质量。在代码领域,手动检查大语言模型生成内容的质量极具挑战性,因为这需要理解长段原始代码并有意识地模拟代码执行过程。为此,我们推出了BigCodeArena,这是一个开放的代码生成人类评估平台,其背后是一个全面且即时可用的执行环境。BigCodeArena构建于Chatbot Arena之上,能够执行大语言模型生成的代码,并允许人类与执行过程及结果进行交互。我们收集了超过14,000个以代码为中心的原始对话会话,涉及10个广泛使用的大语言模型,涵盖10种编程语言和8种执行环境类型。在这些对话中,我们识别出超过4,700个包含成对人类偏好的多轮样本。进一步的分析揭示了大语言模型在由任务、语言和框架所表征的细粒度领域中尚未被充分探索的偏好。为了系统性地检验前沿大语言模型的代码理解与生成能力,我们基于收集的数据构建了两个基准测试,即BigCodeReward和AutoCodeArena。对于BigCodeReward,我们对4,700个对话进行了后处理,并评估了奖励模型与人类偏好之间的一致性。评估结果表明,当执行结果可用时,大多数大语言模型在判断编码偏好方面表现出优越性能。受这些发现的启发,我们提出了AutoCodeArena,这是一个旨在无需人类参与的情况下评估大语言模型编码质量的自动Elo评分基准。我们发现,在近期涌现的模型中,诸如GPT-5、Claude-Sonnet-4和Claude-Opus-4等专有大语言模型在代码生成性能方面仍然领先。