Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Inspired by recent advances in noise regularization, our pre-training objective is based on denoising. Relying on the well-known link between denoising autoencoders and score-matching, we also show that the objective corresponds to learning a molecular force field -- arising from approximating the physical state distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.
翻译:3D结构中涉及分子财产预测的许多重要问题的数据有限,对神经网络构成一般化的挑战。在本文中,我们描述了一种培训前技术,在平衡时使用3D分子结构的大型数据集,以了解下游任务有意义的代表性。在噪音正规化方面最近的进展的启发下,我们的培训前目标是基于分解。我们基于脱色自动编码器和得分匹配之间的众所周知的联系,我们还表明,这一目标相当于学习分子力量领域 -- -- 直接来自均衡结构,即与高山混合体相近的物理状态分布。我们的实验表明,使用这一培训前目标大大改进了多个基准的绩效,在广泛使用的QM9数据集中的大多数目标上实现了新的状态。我们的分析随后提供了对不同因素 -- -- 数据集大小、模型大小和结构,以及上游和下游数据集的选择 -- -- 对培训前的影响的实际了解。