生成LLMs合成数据能帮助临床文本挖掘吗？ (Does Synthetic Data Generation of LLMs Help Clinical Text Mining?)

Recent advancements in large language models (LLMs) have led to the development of highly potent models like OpenAI's ChatGPT. These models have exhibited exceptional performance in a variety of tasks, such as question answering, essay composition, and code generation. However, their effectiveness in the healthcare sector remains uncertain. In this study, we seek to investigate the potential of ChatGPT to aid in clinical text mining by examining its ability to extract structured information from unstructured healthcare texts, with a focus on biological named entity recognition and relation extraction. However, our preliminary results indicate that employing ChatGPT directly for these tasks resulted in poor performance and raised privacy concerns associated with uploading patients' information to the ChatGPT API. To overcome these limitations, we propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data with labels utilizing ChatGPT and fine-tuning a local model for the downstream task. Our method has resulted in significant improvements in the performance of downstream tasks, improving the F1-score from 23.37% to 63.99% for the named entity recognition task and from 75.86% to 83.59% for the relation extraction task. Furthermore, generating data using ChatGPT can significantly reduce the time and effort required for data collection and labeling, as well as mitigate data privacy concerns. In summary, the proposed framework presents a promising solution to enhance the applicability of LLM models to clinical text mining.

翻译：近年来，大型语言模型（LLM）的先进发展已经导致像OpenAI的ChatGPT这样的高强度模型的出现。这些模型表现出了在多个任务中极好的性能，如问答、论文写作和代码生成。然而，它们在医疗保健领域的有效性仍不确定。本研究旨在通过检验其从非结构化的医疗保健文本中提取结构化信息的能力，聚焦于生物命名实体识别和关系提取，来调查ChatGPT在临床文本挖掘方面的潜力。然而，我们的初步结果表明，将ChatGPT直接用于这些任务导致了较差的性能，并引发了与将患者信息上传至ChatGPT API相关的隐私问题。为克服这些限制，我们提出了一种新的训练范式，该范式涉及使用ChatGPT生成大量高质量的合成数据与标签，并对下游任务进行微调的本地模型。我们的方法在下游任务的性能方面获得了显著的改进，命名实体识别任务的F1分数从23.37％提高到63.99％，关系提取任务的得分从75.86％提高到83.59％。此外，使用ChatGPT生成数据可以大大减少数据收集和标记所需的时间和精力，并减轻数据隐私问题。总之，所提出的框架为提高LLM模型在临床文本挖掘中的适用性提供了有前途的解决方案。

相关内容

ChatGPT

关注 256

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

【吴恩达新课程】ChatGPT提示工程，ChatGPT Prompt Engineering for Developers

专知会员服务

104+阅读 · 2023年4月28日

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

专知会员服务

29+阅读 · 2022年5月22日

【牛津大学】电子医疗记录的生成式对抗网络:应用、评估措施和数据来源综述，A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources

专知会员服务

24+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日