The need for data privacy and security -- enforced through increasingly strict data protection regulations -- renders the use of healthcare data for machine learning difficult. In particular, the transfer of data between different hospitals is often not permissible and thus cross-site pooling of data not an option. The Personal Health Train (PHT) paradigm proposed within the GO-FAIR initiative implements an 'algorithm to the data' paradigm that ensures that distributed data can be accessed for analysis without transferring any sensitive data. We present PHT-meDIC, a productively deployed open-source implementation of the PHT concept. Containerization allows us to easily deploy even complex data analysis pipelines (e.g, genomics, image analysis) across multiple sites in a secure and scalable manner. We discuss the underlying technological concepts, security models, and governance processes. The implementation has been successfully applied to distributed analyses of large-scale data, including applications of deep neural networks to medical image data.
翻译:数据隐私和安全的需要 -- -- 通过日益严格的数据保护条例加以执行 -- -- 使得很难利用保健数据进行机器学习,特别是不同医院之间往往不允许数据转让,因此数据跨站集合也不是一个选项。在GO-FAIR倡议中提议的个人保健培训模式(PHT)实施“数据分类”模式,确保可获取已分发的数据进行分析,而不转让任何敏感数据。我们介绍了PHT-MeDIC,这是对PHT概念的有效部署的开放源实施。 集装箱化使我们能够以安全和可扩缩的方式在多个地点很容易地部署甚至复杂的数据分析管道(例如基因组学、图像分析)。我们讨论了基本的技术概念、安全模式和治理程序。我们成功地应用了这一模式来传播大规模数据的分析,包括将深神经网络应用于医疗图像数据。