Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other's data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
翻译:大型培训数据集可能并不总是可能由任何单一实体收集的大型培训数据集,特别是使用隐私敏感数据时,这些数据集不一定总能让任何单一实体收集。在许多情况下,如医疗保健和金融,不同的当事方可能希望相互协作,相互学习对方的数据,相互学习,但因隐私条例而不能这样做。有些条例防止缔约方通过在中央地点加入数据集(保密)而明确分享数据。另一些条例还限制不公开分享数据,例如通过模型预测(隐私)等。目前甚至没有任何方法使任何单一实体能够在这种环境下进行机器学习,因为需要保存保密和隐私,以防止公开和隐含地分享数据。在很多情况下,联邦学习只提供保密,而不是隐私,因为共享的梯度仍然包含私人信息。不同的私人学习假设假定了一个不合理的大数据集。此外,这两种学习模式都产生了一个核心模型,而这种模型是所有缔约方学习并改进了自己的本地模式。 我们引入了保密和私营合作(CPC)学习,第一个方法可以实现保密和隐私的模型,而每个缔约方在合作型号中都能够进行内部安全操作。我们利用多式数据模型来进行内部加密。