This work is motivated by the scarcity of tools for accurate, unsupervised information extraction from unstructured clinical notes in computationally underrepresented languages, such as Czech. We introduce a stepping stone to a broad array of downstream tasks such as summarisation or integration of individual patient records, extraction of structured information for national cancer registry reporting or building of semi-structured semantic patient representations for computing patient embeddings. More specifically, we present a method for unsupervised extraction of semantically-labelled textual segments from clinical notes and test it out on a dataset of Czech breast cancer patients, provided by Masaryk Memorial Cancer Institute (the largest Czech hospital specialising in oncology). Our goal was to extract, classify (i.e. label) and cluster segments of the free-text notes that correspond to specific clinical features (e.g., family background, comorbidities or toxicities). The presented results demonstrate the practical relevance of the proposed approach for building more sophisticated extraction and analytical pipelines deployed on Czech clinical notes.
翻译:这项工作的动因是,在计算代表性不足的语言,如捷克语中,缺乏从未经结构化的临床记录中提取准确、未经监督的信息的工具;我们为一系列广泛的下游任务铺平了道路,例如总结或整合个别病人的记录,为国家癌症登记册报告提取结构化信息,或为计算病人嵌入而建立半结构化的语义化病人代表机构;更具体地说,我们提出一种方法,从未经监督的情况下从临床记录中提取带有语义标签的文字片段,并测试由Masaryk纪念癌症研究所(捷克最大的医院专门研究肿瘤学)提供的捷克乳腺癌病人数据集;我们的目标是提取、分类(即标签)和与特定临床特征(例如家庭背景、毒性或毒性)相对应的自由文本说明的集成部分;我们介绍的结果表明,拟议方法对于在捷克临床记录上安装的更精密的提取和分析管道具有实际意义。