SODA:为癌症研究提取健康的社会决定因素的一套自然语言处理软件 (SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies)

Zehao Yu,Xi Yang,Chong Dang,Prakash Adekkanattu,Braja Gopal Patra,Yifan Peng,Jyotishman Pathak,Debbie L. Wilson,Ching-Yuan Chang,Wei-Hsuan Lo-Ciganic,Thomas J. George,William R. Hogan,Yi Guo,Jiang Bian,Yonghui Wu

Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations. Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts. Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.

翻译：目标:我们的目标是开发一个开放源码自然语言处理(NLP)软件包,SODA(即SOcial 确定剂Ants),配有预先训练的变压器模型,为癌症患者提取健康的社会决定因素(SDoH),检查SODA在新疾病领域(即类阿片使用)的通用性,评估SDoH的提取率。方法:我们确定了SDoH的类别和属性,并利用一般癌症组群的临床说明开发了SDoH体。我们比较了4个基于NLP的变压器模型,以提取SDoHSHS,9702/Ants,检查了NLP模式的通用性,检查了NLP模式对一组按类开具的患者(SDOH)的严格性,检查了SDOHS, 将19类的SDER, SNLPS,S, SN=7,804, 和10CE, DNA,从S=6,但结果和结论:我们开发了629个癌症患者的改良图纸质说明,13,19SDOH概念/AT,从SD,从S,从SDR,从S,从SDR,从S,从SDR,从SD, IM, IM,从S,从S, IM, IM, IM, IM,202,从S,从S,从S,从S,从S,从S,从S,从S,从S,202,从S,从S,从S,从S,202,M,从S,从S,从S,从S,从S,从S,从S,从S,从S,从S,从S,从S,从S,20,20,2,从S,从S,从S,从S,从S,从S,从S,从S,2,从S,0,20,从S,2,2,从S,从S,2,从S,从S,从S,从S,2,2,2,从S,从S,从S,从S,从S,从S,从S,从S,从S,从S,从S,从S