研究大型通用语言模型和本地精调模型在高度特定的放射学NLI任务中的权衡：探索 (Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task)

Zihao Wu,Lu Zhang,Chao Cao,Xiaowei Yu,Haixing Dai,Chong Ma,Zhengliang Liu,Lin Zhao,Gang Li,Wei Liu,Quanzheng Li,Dinggang Shen,Xiang Li,Dajiang Zhu,Tianming Liu

Recently, ChatGPT and GPT-4 have emerged and gained immense global attention due to their unparalleled performance in language processing. Despite demonstrating impressive capability in various open-domain tasks, their adequacy in highly specific fields like radiology remains untested. Radiology presents unique linguistic phenomena distinct from open-domain data due to its specificity and complexity. Assessing the performance of large language models (LLMs) in such specific domains is crucial not only for a thorough evaluation of their overall performance but also for providing valuable insights into future model design directions: whether model design should be generic or domain-specific. To this end, in this study, we evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty. Our results show that 1) GPT-4 outperforms ChatGPT in the radiology NLI task; 2) other specifically fine-tuned models require significant amounts of data samples to achieve comparable performance to ChatGPT/GPT-4. These findings demonstrate that constructing a generic model that is capable of solving various tasks across different domains is feasible.

翻译：最近，ChatGPT和GPT-4已经出现并引起了广泛关注，由于他们在语言处理方面的卓越性能。尽管在各种开放领域任务中展示出了令人印象深刻的功能，但他们在放射学等高度特定领域的充分性仍未得到测试。放射学呈现出不同于开放领域数据的独特语言现象，这是由于它的特定性和复杂性。评估大型语言模型（LLM）在这样特定的领域中的表现对于全面评估它们的整体表现非常关键，而且对于为未来的模型设计方向提供有价值的见解非常重要：模型设计是通用的还是特定于领域的。为此，在本研究中，我们评估ChatGPT / GPT-4在放射学NLI任务上的表现，并将其与其他专门针对任务相关数据样本进行精细调整的模型进行比较。我们还对ChatGPT / GPT-4的推理能力进行全面的调查，引入不同的推理难度级别。我们的结果显示：1）GPT-4在放射学NLI任务中的表现优于ChatGPT；2）其他专门精细调整的模型需要大量的数据样本才能达到与ChatGPT / GPT-4相当的性能。这些发现表明，构建通用模型，能够解决不同领域中的各种任务，是可行的。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

专知会员服务

74+阅读 · 2023年4月26日

【CVPR2023】基于图像特定提示学习的零样本生成模型自适应

专知会员服务

31+阅读 · 2023年4月7日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

基于预训练语言模型的文本生成研究综述

专知会员服务

82+阅读 · 2021年10月15日