Recent advances in deep learning have accelerated its use in various applications, such as cellular image analysis and molecular discovery. In molecular discovery, a generative adversarial network (GAN), which comprises a discriminator to distinguish generated molecules from existing molecules and a generator to generate new molecules, is one of the premier technologies due to its ability to learn from a large molecular data set efficiently and generate novel molecules that preserve similar properties. However, different pharmaceutical companies may be unwilling or unable to share their local data sets due to the geo-distributed and sensitive nature of molecular data sets, making it impossible to train GANs in a centralized manner. In this paper, we propose a Graph convolutional network in Generative Adversarial Networks via Federated learning (GraphGANFed) framework, which integrates graph convolutional neural Network (GCN), GAN, and federated learning (FL) as a whole system to generate novel molecules without sharing local data sets. In GraphGANFed, the discriminator is implemented as a GCN to better capture features from molecules represented as molecular graphs, and FL is used to train both the discriminator and generator in a distributive manner to preserve data privacy. Extensive simulations are conducted based on the three bench-mark data sets to demonstrate the feasibility and effectiveness of GraphGANFed. The molecules generated by GraphGANFed can achieve high novelty (=100) and diversity (> 0.9). The simulation results also indicate that 1) a lower complexity discriminator model can better avoid mode collapse for a smaller data set, 2) there is a tradeoff among different evaluation metrics, and 3) having the right dropout ratio of the generator and discriminator can avoid mode collapse.
翻译:最近深度学习领域的进展加速了其在细胞图像分析和分子发现等各种应用中的使用。在分子发现中,生成对抗网络(GAN)是一种优秀的技术,因为它能够从大型分子数据集中高效地学习并生成保留类似性质的新分子。然而,不同的制药公司可能不愿意或无法分享其本地数据集,这是由于分子数据集的地理分布和敏感性质造成的,这使得在中心化的方式下训练 GANs 是不可能的。本文提出了一个联邦学习框架,将图卷积神经网络 (GCN)、GAN 和联邦学习 (FL) 集成为一个完整的系统,可以在不共享本地数据集的条件下生成新颖的分子。在这种联邦学习中,识别器被实现为 GCN,以更好地从分子图中捕获特征,FL 被用来分布式地训练识别器和生成器以保护数据隐私。通过三个基准数据集的广泛模拟,我们证明了 GraphGANFed 的可行性和有效性。GraphGANFed 生成的分子可以实现高度新颖性(=100) 和多样性(> 0.9)。模拟结果还表明:1)较低复杂度的鉴别器模型可以更好地避免在较小的数据集中出现模态崩溃;2)不同的评估指标之间存在一个折衷;3)适当的 生成器和识别器的丢弃率可以避免模态崩溃。