利用基于变异器的自然语言处理模型研究肺癌病人健康的社会和行为决定因素 (A Study of Social and Behavioral Determinants of Health in Lung Cancer Patients Using Transformers-based Natural Language Processing Models)

Social and behavioral determinants of health (SBDoH) have important roles in shaping people's health. In clinical research studies, especially comparative effectiveness studies, failure to adjust for SBDoH factors will potentially cause confounding issues and misclassification errors in either statistical analyses and machine learning-based models. However, there are limited studies to examine SBDoH factors in clinical outcomes due to the lack of structured SBDoH information in current electronic health record (EHR) systems, while much of the SBDoH information is documented in clinical narratives. Natural language processing (NLP) is thus the key technology to extract such information from unstructured clinical text. However, there is not a mature clinical NLP system focusing on SBDoH. In this study, we examined two state-of-the-art transformer-based NLP models, including BERT and RoBERTa, to extract SBDoH concepts from clinical narratives, applied the best performing model to extract SBDoH concepts on a lung cancer screening patient cohort, and examined the difference of SBDoH information between NLP extracted results and structured EHRs (SBDoH information captured in standard vocabularies such as the International Classification of Diseases codes). The experimental results show that the BERT-based NLP model achieved the best strict/lenient F1-score of 0.8791 and 0.8999, respectively. The comparison between NLP extracted SBDoH information and structured EHRs in the lung cancer patient cohort of 864 patients with 161,933 various types of clinical notes showed that much more detailed information about smoking, education, and employment were only captured in clinical narratives and that it is necessary to use both clinical narratives and structured EHRs to construct a more complete picture of patients' SBDoH factors.

翻译：健康的社会和行为决定因素(SBDoH)在影响人们的健康方面起着重要作用。在临床研究研究中,特别是比较有效性研究中,未能调整SBDoH因素可能会在统计分析和机器学习模型中引起混乱问题和分类错误。然而,由于当前电子健康记录系统缺乏结构化的SBDoH信息,因此在临床健康记录(EHR)系统中,SBDoH因素在临床结果中研究SBDoH因素有限,而许多SBDoH信息记录在临床描述中都有记录。因此,自然语言处理(NLP)是从非结构化临床文本中提取此类信息的关键技术。然而,在SBH因素中,没有针对SBDoP因素的成熟的临床NLP系统,侧重于SBDo。在这项研究中,我们检查了两种以最先进的变压器为基础的NLPH因素,从临床记录中提取SBDH概念,在肺癌诊断组中采用最完善的SBDOH概念,SLH数据在NLPSVP的临床分析结果和EHR的分类中也分别显示,在SBH标准的EL的ERC的S-ROCA中,在S-CS-RB结果中,在BCSBCSBSBSBS-CS-CS-CS-CSBSBSBSB结果中显示,在B结果中显示,在BBBBB结果中,在BS-CS-CS-CS-CS-CSDRBB结果中显示,在BRBBBBBBBBBBB中, 和B的多数中显示,在B中,在BRB中显示,在BBBBBBBSDA中显示,在BBBBBBBBBRB中,在BBRA中,在B中,在B中,在BRBSDRBSDA中,在BBB中,在B中,在BRBA中,在BRA中也显示,在BA中也显示,在BA中也显示,在BSDRA中,在BSDRA中,在BS-RA中,在BSBSD