小型语言模型的代码生成能力：基于Codeforces的深度评估 (Code Generation with Small Language Models: A Deep Evaluation on Codeforces)

Large Language Models (LLMs) have demonstrated capabilities in code generation, potentially boosting developer productivity. However, their widespread adoption remains limited by high computational costs, significant energy demands, and security risks such as data leakage and adversarial attacks. As a lighter-weight alternative, Small Language Models (SLMs) offer faster inference, lower deployment overhead, and better adaptability to domain-specific tasks, making them an attractive option for real-world applications. While prior research has benchmarked LLMs on competitive programming tasks, such evaluations often focus narrowly on metrics like Elo scores or pass rates, overlooking deeper insights into model behavior, failure patterns, and problem diversity. Furthermore, the potential of SLMs to tackle complex tasks such as competitive programming remains underexplored. In this study, we benchmark five open SLMs - LLAMA 3.2 3B, GEMMA 2 9B, GEMMA 3 12B, DEEPSEEK-R1 14B, and PHI-4 14B - across 280 Codeforces problems spanning Elo ratings from 800 to 2100 and covering 36 distinct topics. All models were tasked with generating Python solutions. PHI-4 14B achieved the best performance among SLMs, with a pass@3 of 63.6%, approaching the proprietary O3-MINI-HIGH (86.8%). In addition, we evaluated PHI-4 14B on C++ and found that combining outputs from both Python and C++ increases its aggregated pass@3 to 73.6%. A qualitative analysis of PHI-4 14B's incorrect outputs revealed that some failures were due to minor implementation issues - such as handling edge cases or correcting variable initialization - rather than deeper reasoning flaws.

翻译：大型语言模型（LLMs）在代码生成方面已展现出显著能力，有望提升开发者的生产效率。然而，其广泛应用仍受限于高昂的计算成本、巨大的能源需求以及数据泄露和对抗攻击等安全风险。作为一种轻量级替代方案，小型语言模型（SLMs）具有推理速度更快、部署开销更低、对领域特定任务适应性更强的优势，使其在实际应用中成为具有吸引力的选择。尽管已有研究在竞技编程任务上对LLMs进行了基准测试，但此类评估往往仅关注Elo评分或通过率等指标，未能深入探究模型行为模式、失败原因及问题多样性。此外，SLMs处理竞技编程等复杂任务的潜力尚未得到充分探索。本研究选取五个开源SLMs——LLAMA 3.2 3B、GEMMA 2 9B、GEMMA 3 12B、DEEPSEEK-R1 14B和PHI-4 14B，在涵盖Elo评分800至2100的280道Codeforces题目（涉及36个不同主题）上进行基准测试。所有模型均需生成Python解决方案。其中PHI-4 14B在SLMs中表现最佳，其pass@3达到63.6%，接近专有模型O3-MINI-HIGH（86.8%）的水平。此外，我们对PHI-4 14B进行了C++语言评估，发现结合Python与C++的输出结果可将其综合pass@3提升至73.6%。对PHI-4 14B错误输出的定性分析表明，部分失败源于边缘情况处理或变量初始化修正等细微实现问题，而非深层推理缺陷。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日