助临床文字采矿是否对LLMs的合成数据进行生成?</s> (Does Synthetic Data Generation of LLMs Help Clinical Text Mining?)

Recent advancements in large language models (LLMs) have led to the development of highly potent models like OpenAI's ChatGPT. These models have exhibited exceptional performance in a variety of tasks, such as question answering, essay composition, and code generation. However, their effectiveness in the healthcare sector remains uncertain. In this study, we seek to investigate the potential of ChatGPT to aid in clinical text mining by examining its ability to extract structured information from unstructured healthcare texts, with a focus on biological named entity recognition and relation extraction. However, our preliminary results indicate that employing ChatGPT directly for these tasks resulted in poor performance and raised privacy concerns associated with uploading patients' information to the ChatGPT API. To overcome these limitations, we propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data with labels utilizing ChatGPT and fine-tuning a local model for the downstream task. Our method has resulted in significant improvements in the performance of downstream tasks, improving the F1-score from 23.37% to 63.99% for the named entity recognition task and from 75.86% to 83.59% for the relation extraction task. Furthermore, generating data using ChatGPT can significantly reduce the time and effort required for data collection and labeling, as well as mitigate data privacy concerns. In summary, the proposed framework presents a promising solution to enhance the applicability of LLM models to clinical text mining.

翻译：最近,大型语言模型(LLMS)的进步导致开发了像OpenAI的CatGPT这样的非常强大的模型。这些模型在诸如答答、作文组成和代码生成等各种任务中表现出了杰出的成绩。然而,它们在保健部门的效力仍然不确定。在这项研究中,我们试图调查ChatGPT在临床文本挖掘方面的潜力,通过审查它从没有结构的保健文本中提取结构化信息的能力,重点是生物名称实体的识别和关系提取。然而,我们的初步结果显示,直接利用CatGPT执行这些任务导致业绩不佳,并引起了与将病人信息上传到ChatGPT API相关的隐私问题。为了克服这些限制,我们提出了一个新的培训模式,涉及利用CatGPT和对下游任务的地方模型进行微调,产生大量高质量的合成数据。我们的方法使下游任务的执行情况有了重大改进,将F1核心从23.37%提高到63.99 %,将F1芯模型从75.86%提高到83.559 % 与向CTAPTAP APPPI 和LLM 相关提取任务中,从而大大减轻了与时间采集数据,从而改进了数据库,从而改进了数据库,并改进了数据库,并改进了数据库,从而改进了数据库,并改进了数据库,并改进了数据库,并改进了数据库,并改进了数据库,并改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了数据库,改进了</s>