Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR. However, more sophisticated "attacks" may succeed in doing so: To facilitate such research, we make our experimental setup and baseline probing models available at https://github.com/elehman16/exposing_patient_data_release
翻译:对电子健康记录(EHR)的临床记录进行预先培训的大型变压器在临床记录临床说明的临床说明方面已经取得了相当大的进展。培训这些模型的成本(以及数据访问的必要性)以及这些模型的实用动力参数共享,即公布诸如临床BERT等预先培训的模型。虽然大多数努力都使用了识别的EHR,但许多研究人员都有机会获得大量敏感、非识别的EHR,他们可以据此培训BERT模型(或类似的模型)。如果它们这样做的话,释放这种模型的重量是否安全?在这项工作中,我们设计了一个旨在从经过培训的BERT中恢复个人健康信息(PHI)的方法的电池。具体地说,我们试图恢复病人的姓名和他们与之相关的条件。我们发现,简单的检验方法无法从经过培训的MIMIC-III EHR的BERT获得有意义的敏感信息。然而,更为复杂的“攻击”也许能够成功:为了便利这种研究,我们在https://github.com/elerehdata。