Social determinants of health (SDOH) affect health outcomes, and knowledge of SDOH can inform clinical decision-making. Automatically extracting SDOH information from clinical text requires data-driven information extraction models trained on annotated corpora that are heterogeneous and frequently include critical SDOH. This work presents a new corpus with SDOH annotations, a novel active learning framework, and the first extraction results on the new corpus. The Social History Annotation Corpus (SHAC) includes 4,480 social history sections with detailed annotation for 12 SDOH characterizing the status, extent, and temporal information of 18K distinct events. We introduce a novel active learning framework that selects samples for annotation using a surrogate text classification task as a proxy for a more complex event extraction task. The active learning framework successfully increases the frequency of health risk factors and improves automatic extraction of these events over undirected annotation. An event extraction model trained on SHAC achieves high extraction performance for substance use status (0.82-0.93 F1), employment status (0.81-0.86 F1), and living status type (0.81-0.93 F1) on data from three institutions.
翻译:健康的社会决定因素(SDOH)影响健康结果,而SDOH的知识可以为临床决策提供依据。从临床文本中自动提取SDOH信息需要数据驱动的信息提取模型,在具有注释性、具有多样性且经常包含关键SDOH的子公司中,经过数据驱动的信息提取模型。这项工作提供了含有SDOH说明的新内容,这是一个全新的积极学习框架,以及新主体的第一个提取结果。社会历史批注公司(SHAC)包括4,480个社会历史章节,其中详细注明了12个SDOH对18K不同事件的状况、程度和时间信息的特点。我们引入了一个新型的积极学习框架,即利用代用文本分类任务选择样本进行批注,作为更复杂的事件提取任务的代理。积极学习框架成功地增加了健康风险因素的频率,并改进了这些事件的自动提取,而不是无定向的批注。一个关于SHAC的事件提取模型在物质使用状态(0.82-093.F1)、就业状况(0.81-0.86 F1)和生活状况类型(0.81-093 F1)的数据方面实现了三个机构的数据。