Large machine learning models with improved predictions have become widely available in the chemical sciences. Unfortunately, these models do not protect the privacy necessary within commercial settings, prohibiting the use of potentially extremely valuable data by others. Encrypting the prediction process can solve this problem by double-blind model evaluation and prohibits the extraction of training or query data. However, contemporary ML models based on fully homomorphic encryption or federated learning are either too expensive for practical use or have to trade higher speed for weaker security. We have implemented secure and computationally feasible encrypted machine learning models using oblivious transfer enabling and secure predictions of molecular quantum properties across chemical compound space. However, we find that encrypted predictions using kernel ridge regression models are a million times more expensive than without encryption. This demonstrates a dire need for a compact machine learning model architecture, including molecular representation and kernel matrix size, that minimizes model evaluation costs.
翻译:化学科学中广泛存在有改进预测的大型机器学习模型,但不幸的是,这些模型并不保护商业环境中必要的隐私,禁止他人使用潜在极有价值的数据。通过双盲模型评估加密预测过程可以解决这个问题,禁止提取培训或查询数据。然而,基于完全同质加密或联合学习的当代ML模型要么太昂贵,无法实际使用,要么不得不以更高速度换取较弱的安全性。我们采用了安全和计算上可行的加密机器学习模型,利用隐蔽的转移,使化学化合物空间能够安全预测分子数量特性。然而,我们发现,使用内核脊回归模型加密预测的费用比没有加密的费用高出100万倍。这表明迫切需要一个包括分子代表在内的紧凑机器学习模型架构和内核矩阵大小,以尽量减少模型评价费用。