Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with high-quality conformers and experimental data. Here we use first-principles simulations to generate accurate conformers for over 430,000 molecules, including 300,000 with experimental data for the inhibition of various pathogens. The Geometric Ensemble Of Molecules (GEOM) dataset contains over 33 million molecular conformers labeled with their relative energies and statistical probabilities at room temperature. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.
翻译:在许多分子设计任务中,机器学习(ML)优于传统方法。 ML 模型通常预测2D化学图或单一的3D结构中的分子特性,但这些模型都没有说明分子可以进入一个分子的3D相容器的组合。 属性预测可以通过使用相容器组合作为输入来改进,但是没有大型数据集包含附有高质量的相容器和实验数据的图表。 我们在这里使用第一原则模拟来生成超过430,000个分子的精确相容器,包括300,000个含有抑制各种病原体的实验数据。 分子的几何组合数据集包含3,300多万个与其相对能量和室温下统计概率相标的分子相容器。 GEOM 将协助开发模型,从符合的酶组合和实验的3D兼容性模型中预测属性。