In this work, we propose a novel problem formulation for de-identification of unstructured clinical text. We formulate the de-identification problem as a sequence to sequence learning problem instead of a token classification problem. Our approach is inspired by the recent state-of -the-art performance of sequence to sequence learning models for named entity recognition. Early experimentation of our proposed approach achieved 98.91% recall rate on i2b2 dataset. This performance is comparable to current state-of-the-art models for unstructured clinical text de-identification.
翻译:在这项工作中,我们提出一个新的问题提法,用于去辨别非结构化临床文本。我们把脱身份问题作为排序学习问题的顺序,而不是象征性分类问题。我们的方法受到最近最先进的序列性能的启发,以排序学习模式,用于命名实体的识别。我们拟议方法的早期实验在i2b2数据集上实现了98.91%的回溯率。这一性能与目前最先进的非结构化临床文本脱身份模型相似。