Chemistry research has both high material and computational costs to conduct experiments. Institutions thus consider chemical data to be valuable and there have been few efforts to construct large public datasets for machine learning. Another challenge is that different intuitions are interested in different classes of molecules, creating heterogeneous data that cannot be easily joined by conventional distributed training. In this work, we introduce federated heterogeneous molecular learning to address these challenges. Federated learning allows end-users to build a global model collaboratively while keeping the training data distributed over isolated clients. Due to the lack of related research, we first simulate a heterogeneous federated learning benchmark (FedChem) by jointly performing scaffold splitting and latent Dirichlet allocation on existing datasets for heterogeneously distributed client data. Our results on FedChem show that significant learning challenges arise when working with heterogeneous molecules across clients. We then propose a method to alleviate the problem, namely Federated Learning by Instance reweighTing (FLIT(+)). FLIT(+) can align the local training across heterogeneous clients by improving the performance for uncertain samples. Comprehensive experiments conducted on our new benchmark FedChem validate the advantages of this method over other federated learning schemes. FedChem should enable a new type of collaboration for improving AI in chemistry that mitigates concerns about valuable chemical data.
翻译:化学研究具有很高的物质和计算成本来进行实验。因此,各机构认为化学数据是有价值的,因此没有做出多少努力来为机器学习建立大型公共数据集。另一个挑战是,不同直觉对不同种类的分子感兴趣,产生不同的数据,而常规分布培训无法轻易结合这些数据。在这项工作中,我们引入了混合分子学习,以应对这些挑战。联邦学习使最终用户能够合作建立一个全球模型,同时保持由孤立客户传播的培训数据。由于缺乏相关的研究,我们首先通过联合进行不同类别分布客户数据的现有数据集的松散和潜在dirichlet分配,模拟一个多样化的联邦化化学数据库(FedChem)的混合学习基准(FedChem),我们关于FedChem的结果表明,在与不同客户的混合分子合作时,会出现重大的学习挑战。我们然后提出一个缓解问题的方法,即通过实例再连接(FLIT+) 。FLIT(+) 可以通过改进不确定样品的性能来调整不同客户的本地培训。我们在新的基准上进行的关于FedChem进行的全面实验,以降低新的化学合作方法的优势。