Recently, ChatGPT and GPT-4 have emerged and gained immense global attention due to their unparalleled performance in language processing. Despite demonstrating impressive capability in various open-domain tasks, their adequacy in highly specific fields like radiology remains untested. Radiology presents unique linguistic phenomena distinct from open-domain data due to its specificity and complexity. Assessing the performance of large language models (LLMs) in such specific domains is crucial not only for a thorough evaluation of their overall performance but also for providing valuable insights into future model design directions: whether model design should be generic or domain-specific. To this end, in this study, we evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty. Our results show that 1) GPT-4 outperforms ChatGPT in the radiology NLI task; 2) other specifically fine-tuned models require significant amounts of data samples to achieve comparable performance to ChatGPT/GPT-4. These findings demonstrate that constructing a generic model that is capable of solving various tasks across different domains is feasible.
翻译:最近,ChatGPT和GPT-4已经出现并引起了广泛关注,由于他们在语言处理方面的卓越性能。尽管在各种开放领域任务中展示出了令人印象深刻的功能,但他们在放射学等高度特定领域的充分性仍未得到测试。放射学呈现出不同于开放领域数据的独特语言现象,这是由于它的特定性和复杂性。评估大型语言模型(LLM)在这样特定的领域中的表现对于全面评估它们的整体表现非常关键,而且对于为未来的模型设计方向提供有价值的见解非常重要:模型设计是通用的还是特定于领域的。为此,在本研究中,我们评估ChatGPT / GPT-4在放射学NLI任务上的表现,并将其与其他专门针对任务相关数据样本进行精细调整的模型进行比较。我们还对ChatGPT / GPT-4的推理能力进行全面的调查,引入不同的推理难度级别。我们的结果显示:1)GPT-4在放射学NLI任务中的表现优于ChatGPT;2)其他专门精细调整的模型需要大量的数据样本才能达到与ChatGPT / GPT-4相当的性能。这些发现表明,构建通用模型,能够解决不同领域中的各种任务,是可行的。