Recently, DeepSeek has been the focus of attention in and beyond the AI community. An interesting problem is how DeepSeek compares to other large language models (LLMs). There are many tasks an LLM can do, and in this paper, we use the task of "predicting an outcome using a short text" for comparison. We consider two settings, an authorship classification setting and a citation classification setting. In the first one, the goal is to determine whether a short text is written by human or AI. In the second one, the goal is to classify a citation to one of four types using the textual content. For each experiment, we compare DeepSeek with $4$ popular LLMs: Claude, Gemini, GPT, and Llama. We find that, in terms of classification accuracy, DeepSeek outperforms Gemini, GPT, and Llama in most cases, but underperforms Claude. We also find that DeepSeek is comparably slower than others but with a low cost to use, while Claude is much more expensive than all the others. Finally, we find that in terms of similarity, the output of DeepSeek is most similar to those of Gemini and Claude (and among all $5$ LLMs, Claude and Gemini have the most similar outputs). In this paper, we also present a fully-labeled dataset collected by ourselves, and propose a recipe where we can use the LLMs and a recent data set, MADStat, to generate new data sets. The datasets in our paper can be used as benchmarks for future study on LLMs.
翻译:近期,DeepSeek已成为人工智能领域内外关注的焦点。一个值得探讨的问题是DeepSeek与其他大语言模型(LLMs)相比表现如何。大语言模型可执行多种任务,本文选取“基于短文本预测结果”这一任务进行比较研究。我们设计了两种实验场景:作者身份分类场景和引文分类场景。在第一种场景中,目标是判断短文本由人类撰写还是AI生成;第二种场景中,目标是根据文本内容将引文归类为四种类型之一。每个实验均将DeepSeek与$4$个主流大语言模型进行对比:Claude、Gemini、GPT和Llama。研究发现,在分类准确率方面,DeepSeek在多数情况下优于Gemini、GPT和Llama,但略逊于Claude。同时发现DeepSeek的运行速度相对较慢,但使用成本较低,而Claude的使用成本远高于其他所有模型。最后研究显示,在输出相似性方面,DeepSeek的输出与Gemini和Claude最为接近(在所有$5$个大语言模型中,Claude和Gemini的输出相似度最高)。本文还提出了一个自主构建的完整标注数据集,并设计了一种方法流程,能够利用大语言模型结合近期发布的MADStat数据集生成新的数据集。本文所构建的数据集可作为未来大语言模型研究的基准测试资源。