Balancing the needs of data privacy and predictive utility is a central challenge for machine learning in healthcare. In particular, privacy concerns have led to a dearth of public datasets, complicated the construction of multi-hospital cohorts and limited the utilization of external machine learning resources. To remedy this, new methods are required to enable data owners, such as hospitals, to share their datasets publicly, while preserving both patient privacy and modeling utility. We propose NeuraCrypt, a private encoding scheme based on random deep neural networks. NeuraCrypt encodes raw patient data using a randomly constructed neural network known only to the data-owner, and publishes both the encoded data and associated labels publicly. From a theoretical perspective, we demonstrate that sampling from a sufficiently rich family of encoding functions offers a well-defined and meaningful notion of privacy against a computationally unbounded adversary with full knowledge of the underlying data-distribution. We propose to approximate this family of encoding functions through random deep neural networks. Empirically, we demonstrate the robustness of our encoding to a suite of adversarial attacks and show that NeuraCrypt achieves competitive accuracy to non-private baselines on a variety of x-ray tasks. Moreover, we demonstrate that multiple hospitals, using independent private encoders, can collaborate to train improved x-ray models. Finally, we release a challenge dataset to encourage the development of new attacks on NeuraCrypt.
翻译:平衡数据隐私和预测效用的需要是保健领域机器学习的一个中心挑战。特别是,隐私问题导致公共数据集缺乏,使多医院组群的建设复杂化,限制了外部机器学习资源的利用。为此,需要采用新方法使数据所有人,如医院能够公开分享数据集,同时维护患者隐私和建模功能。我们提议NeuraCrypt,一个基于随机的深空神经网络的私人编码系统。NeuraCrypt使用一个仅为数据拥有者所知的随机构建的神经网络编码原始病人数据,并公开公布编码数据和相关标签。从理论角度看,我们表明,从足够丰富的编码功能组进行取样,可以提供一个定义明确和有意义的隐私概念,与完全了解基本数据分布的计算式对立。我们提议通过随机的深空的神经网络来将编码功能对这个组进行匹配。我们展示了我们的编码对一套非对抗性攻击的内脏网络的坚固性,并公开公布了编码数据和相关标签。从理论角度,我们证明,从足够丰富的编码组群的编码中取样,可以实现一种竞争性的保密性基准,我们通过多级的实验室来展示了一种更精确的实验室。