Large language models (LLMs) have transformed sentiment analysis, yet balancing accuracy, efficiency, and explainability remains a critical challenge. This study presents the first comprehensive evaluation of DeepSeek-R1--an open-source reasoning model--against OpenAI's GPT-4o and GPT-4o-mini. We test the full 671B model and its distilled variants, systematically documenting few-shot learning curves. Our experiments show DeepSeek-R1 achieves a 91.39\% F1 score on 5-class sentiment and 99.31\% accuracy on binary tasks with just 5 shots, an eightfold improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant by 6.69 percentage points. While its reasoning process reduces throughput, DeepSeek-R1 offers superior explainability via transparent, step-by-step traces, establishing it as a powerful, interpretable open-source alternative.
翻译:大语言模型(LLM)已彻底改变了情感分析领域,然而在准确性、效率与可解释性之间取得平衡仍是一个关键挑战。本研究首次对开源推理模型DeepSeek-R1与OpenAI的GPT-4o及GPT-4o-mini进行了全面评估。我们测试了完整的671B模型及其蒸馏变体,并系统性地记录了小样本学习曲线。实验结果表明,DeepSeek-R1在五类情感分析任务中取得了91.39%的F1分数,在仅使用5个样本的二元分类任务中准确率达到99.31%,其小样本效率较GPT-4o提升了八倍。研究揭示了架构特定的蒸馏效应:基于32B Qwen2.5的模型表现优于基于70B Llama的变体,差距达6.69个百分点。尽管其推理过程会降低吞吐量,但DeepSeek-R1通过透明、逐步的推理轨迹提供了卓越的可解释性,从而确立其作为强大且可解释的开源替代方案的地位。