Machine Learning (ML) models have, in contrast to their usefulness in molecular dynamics studies, had limited success as surrogate potentials for reaction barrier search. It is due to the scarcity of training data in relevant transition state regions of chemical space. Currently, available datasets for training ML models on small molecular systems almost exclusively contain configurations at or near equilibrium. In this work, we present the dataset Transition1x containing 9.6 million Density Functional Theory (DFT) calculations of forces and energies of molecular configurations on and around reaction pathways at the wB97x/6-31G(d) level of theory. The data was generated by running Nudged Elastic Band (NEB) calculations with DFT on 10k reactions while saving intermediate calculations. We train state-of-the-art equivariant graph message-passing neural network models on Transition1x and cross-validate on the popular ANI1x and QM9 datasets. We show that ML models cannot learn features in transition-state regions solely by training on hitherto popular benchmark datasets. Transition1x is a new challenging benchmark that will provide an important step towards developing next-generation ML force fields that also work far away from equilibrium configurations and reactive systems.
翻译:相对于分子动态研究而言,机器学习模型在替代反应屏障搜索潜力方面的实用性,在替代反应屏障搜索方面成效有限,这是因为化学空间相关转型州缺乏培训数据;目前,用于培训小分子系统ML模型的现有数据集几乎完全包含在平衡或接近平衡时的配置;在这项工作中,我们展示了含有960万个密度功能功能理论(DFT)的数据集过渡1x,其中含有960万个密度功能理论(DFT)计算,计算了在WB97x/6-31G(d)理论水平上反应路径上的分子配置的能量和能量;这些数据是利用DFT计算10公里反应而节省中间计算出来的;我们培训了在过渡1x和QM9数据集上最先进的等离异图形信息传递神经网络模型;我们显示,ML模型无法在转型状态区域学习特征,只能通过对迄今为止的通用基准数据集进行培训。Treportal1x是一个具有挑战性能的新的模型,从远方位模型中将提供一个具有挑战性的磁场。