Following the global COVID-19 pandemic, the number of scientific papers studying the virus has grown massively, leading to increased interest in automated literate review. We present a clinical text mining system that improves on previous efforts in three ways. First, it can recognize over 100 different entity types including social determinants of health, anatomy, risk factors, and adverse events in addition to other commonly used clinical and biomedical entities. Second, the text processing pipeline includes assertion status detection, to distinguish between clinical facts that are present, absent, conditional, or about someone other than the patient. Third, the deep learning models used are more accurate than previously available, leveraging an integrated pipeline of state-of-the-art pretrained named entity recognition models, and improving on the previous best performing benchmarks for assertion status detection. We illustrate extracting trends and insights, e.g. most frequent disorders and symptoms, and most common vital signs and EKG findings, from the COVID-19 Open Research Dataset (CORD-19). The system is built using the Spark NLP library which natively supports scaling to use distributed clusters, leveraging GPUs, configurable and reusable NLP pipelines, healthcare specific embeddings, and the ability to train models to support new entity types or human languages with no code changes.
翻译:继全球COVID-19大流行之后,研究该病毒的科学论文数量大幅增长,导致人们对自动化识字审查的兴趣增加。我们提出了一个临床文本挖掘系统,它以三种方式改进了以往的工作。首先,它可以识别100多种不同的实体类型,包括健康的社会决定因素、解剖、风险因素和不利事件,以及其他常用的临床和生物医学实体。第二,文本处理管道包括确认状况检测,以区分现有、缺席、有条件或病人以外的人的临床事实。第三,所使用的深层次学习模型比以往更准确,利用了先进、经过预先训练的实体识别模型的综合管道,改进了先前最佳的确认状况基准。我们从COVID-19公开研究数据集(CORD-19)中提取了趋势和洞察力,例如最常见的病症和症状,以及最常见的生命迹象和EKG发现。该系统是利用Spark NLP图书馆建立的,该图书馆支持扩大分布的集群,利用GPPP、可配置和可再利用的NLP型计算机支持,没有具体的健康模式和实体的升级能力,从而将NLP型计算机转化为特定的版本。