Determining the aqueous solubility of molecules is a vital step in many pharmaceutical, environmental, and energy storage applications. Despite efforts made over decades, there are still challenges associated with developing a solubility prediction model with satisfactory accuracy for many of these applications. The goal of this study is to develop a general model capable of predicting the solubility of a broad range of organic molecules. Using the largest currently available solubility dataset, we implement deep learning-based models to predict solubility from molecular structure and explore several different molecular representations including molecular descriptors, simplified molecular-input line-entry system (SMILES) strings, molecular graphs, and three-dimensional (3D) atomic coordinates using four different neural network architectures - fully connected neural networks (FCNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and SchNet. We find that models using molecular descriptors achieve the best performance, with GNN models also achieving good performance. We perform extensive error analysis to understand the molecular properties that influence model performance, perform feature analysis to understand which information about molecular structure is most valuable for prediction, and perform a transfer learning and data size study to understand the impact of data availability on model performance.
翻译:确定分子的水溶性是许多制药、环境和能源储存应用中的一个关键步骤。尽管几十年来做出了努力,但在开发溶性预测模型方面仍然存在挑战,许多这些应用的准确性都令人满意。本研究的目标是开发一个能够预测各种有机分子的溶性的一般模型。我们利用现有最大的溶性数据集,采用深层次的学习模型来预测分子结构的溶性,并探索若干不同的分子代表,包括分子描述器、简化分子-投入线系统(SMILES)字符串、分子图和三维(3D)原子坐标,使用四种不同的神经网络结构――完全连接的神经网络(FCNNN)、经常性神经网络(RNNS)、图形神经网络(GNNS)和SchNet。我们发现,使用分子描述器的模型取得最佳性能,GNN模型也取得良好的性能。我们进行了广泛的错误分析,以了解影响模型性能的分子特性,进行特征分析,以了解影响模型性能的模型,进行特征分析,以了解关于分子性能的最有价值的数据,并了解关于模型的可获取性能的研究。