The solvation free energy of organic molecules is a critical parameter in determining emergent properties such as solubility, liquid-phase equilibrium constants, and pKa and redox potentials in an organic redox flow battery. In this work, we present a machine learning (ML) model that can learn and predict the aqueous solvation free energy of an organic molecule using Gaussian process regression method based on a new molecular graph kernel. To investigate the performance of the ML model on electrostatic interaction, the nonpolar interaction contribution of solvent and the conformational entropy of solute in solvation free energy, three data sets with implicit or explicit water solvent models, and contribution of conformational entropy of solute are tested. We demonstrate that our ML model can predict the solvation free energy of molecules at chemical accuracy with a mean absolute error of less than 1 kcal/mol for subsets of the QM9 dataset and the Freesolv database. To solve the general data scarcity problem for a graph-based ML model, we propose a dimension reduction algorithm based on the distance between molecular graphs, which can be used to examine the diversity of the molecular data set. It provides a promising way to build a minimum training set to improve prediction for certain test sets where the space of molecular structures is predetermined.
翻译:有机分子的解脱能量是确定有机红氧化物流电池中的溶解性、液相平衡常数、pKa和红氧化物潜力等突发特性的关键参数。在这项工作中,我们提出了一个机器学习(ML)模型,可以学习和预测有机分子的水溶解无能量。我们展示了一种机器学习(ML)模型,可以使用基于新的分子图形内核的Gausian进程回归法,以化学精度为基础,预测分子的解脱能量,其绝对误差小于1千卡/摩尔。为了调查溶剂的性能和溶剂在溶解无能量中的异质性激素、三个带有隐含或显露水溶溶剂模型的数据集,以及溶液的相异性激素。我们展示了我们的ML模型可以预测分子在化学精度上的解解脱解能量,而对于 QM9 数据集和 FreeSolov 数据库的分解性相互作用作用。为了解决以图形为基础的ML模型中的一般数据稀缺性问题,我们提议用一个最小的分子级的模型来改进模型的模型,以便用一个有稳定的分子级的模型来进行精确的模型分析。