基于检索增强生成的问答任务中大型语言模型性能比较：计算机科学文献案例研究 (Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature)

Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google's AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI's Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.

翻译：检索增强生成（RAG）作为一种通过减少幻觉来增强生成式人工智能模型能力的技术正日益凸显其强大潜力。因此，RAG与大型语言模型（LLMs）日益增长的重要性引发了在不同领域问答（QA）任务中比较不同LLM性能的研究兴趣。本研究比较了四种开源LLM——Mistral-7b-instruct、LLaMa2-7b-chat、Falcon-7b-instruct和Orca-mini-v3-7b——以及OpenAI的热门模型GPT-3.5在计算机科学文献中利用RAG支持的QA任务上的表现。研究采用的评估指标包括针对二元问题的准确率与精确率、人类专家排序、Google的AI模型Gemini排序，以及针对长答案问题的余弦相似度。GPT-3.5在与RAG结合时能有效回答二元问题与长答案问题，巩固了其作为先进LLM的地位。在开源LLM中，Mistral AI的Mistral-7b-instruct与RAG结合后在回答二元问题与长答案问题上均优于其他模型。然而，在开源LLM中，Orca-mini-v3-7b在生成响应时报告了最短的平均延迟，而Meta的LLaMa2-7b-chat则报告了最高的平均延迟。本研究强调了一个事实：在具备更优基础设施的情况下，开源LLM同样能够与GPT-3.5等专有模型并驾齐驱。