Large Language Models (LLMs) have demonstrated capabilities in code generation, potentially boosting developer productivity. However, their widespread adoption remains limited by high computational costs, significant energy demands, and security risks such as data leakage and adversarial attacks. As a lighter-weight alternative, Small Language Models (SLMs) offer faster inference, lower deployment overhead, and better adaptability to domain-specific tasks, making them an attractive option for real-world applications. While prior research has benchmarked LLMs on competitive programming tasks, such evaluations often focus narrowly on metrics like Elo scores or pass rates, overlooking deeper insights into model behavior, failure patterns, and problem diversity. Furthermore, the potential of SLMs to tackle complex tasks such as competitive programming remains underexplored. In this study, we benchmark five open SLMs - LLAMA 3.2 3B, GEMMA 2 9B, GEMMA 3 12B, DEEPSEEK-R1 14B, and PHI-4 14B - across 280 Codeforces problems spanning Elo ratings from 800 to 2100 and covering 36 distinct topics. All models were tasked with generating Python solutions. PHI-4 14B achieved the best performance among SLMs, with a pass@3 of 63.6%, approaching the proprietary O3-MINI-HIGH (86.8%). In addition, we evaluated PHI-4 14B on C++ and found that combining outputs from both Python and C++ increases its aggregated pass@3 to 73.6%. A qualitative analysis of PHI-4 14B's incorrect outputs revealed that some failures were due to minor implementation issues - such as handling edge cases or correcting variable initialization - rather than deeper reasoning flaws.
翻译:大型语言模型(LLMs)在代码生成方面已展现出显著能力,有望提升开发者的生产效率。然而,其广泛应用仍受限于高昂的计算成本、巨大的能源需求以及数据泄露和对抗攻击等安全风险。作为一种轻量级替代方案,小型语言模型(SLMs)具有推理速度更快、部署开销更低、对领域特定任务适应性更强的优势,使其在实际应用中成为具有吸引力的选择。尽管已有研究在竞技编程任务上对LLMs进行了基准测试,但此类评估往往仅关注Elo评分或通过率等指标,未能深入探究模型行为模式、失败原因及问题多样性。此外,SLMs处理竞技编程等复杂任务的潜力尚未得到充分探索。本研究选取五个开源SLMs——LLAMA 3.2 3B、GEMMA 2 9B、GEMMA 3 12B、DEEPSEEK-R1 14B和PHI-4 14B,在涵盖Elo评分800至2100的280道Codeforces题目(涉及36个不同主题)上进行基准测试。所有模型均需生成Python解决方案。其中PHI-4 14B在SLMs中表现最佳,其pass@3达到63.6%,接近专有模型O3-MINI-HIGH(86.8%)的水平。此外,我们对PHI-4 14B进行了C++语言评估,发现结合Python与C++的输出结果可将其综合pass@3提升至73.6%。对PHI-4 14B错误输出的定性分析表明,部分失败源于边缘情况处理或变量初始化修正等细微实现问题,而非深层推理缺陷。