GemNet-OC:开发大型和多样化分子模拟数据集的图形神经网络 (GemNet-OC: Developing Graph Neural Networks for Large and Diverse Molecular Simulation Datasets)

Recent years have seen the advent of molecular simulation datasets that are orders of magnitude larger and more diverse. These new datasets differ substantially in four aspects of complexity: 1. Chemical diversity (number of different elements), 2. system size (number of atoms per sample), 3. dataset size (number of data samples), and 4. domain shift (similarity of the training and test set). Despite these large differences, benchmarks on small and narrow datasets remain the predominant method of demonstrating progress in graph neural networks (GNNs) for molecular simulation, likely due to cheaper training compute requirements. This raises the question -- does GNN progress on small and narrow datasets translate to these more complex datasets? This work investigates this question by first developing the GemNet-OC model based on the large Open Catalyst 2020 (OC20) dataset. GemNet-OC outperforms the previous state-of-the-art on OC20 by 16% while reducing training time by a factor of 10. We then compare the impact of 18 model components and hyperparameter choices on performance in multiple datasets. We find that the resulting model would be drastically different depending on the dataset used for making model choices. To isolate the source of this discrepancy we study six subsets of the OC20 dataset that individually test each of the above-mentioned four dataset aspects. We find that results on the OC-2M subset correlate well with the full OC20 dataset while being substantially cheaper to train on. Our findings challenge the common practice of developing GNNs solely on small datasets, but highlight ways of achieving fast development cycles and generalizable results via moderately-sized, representative datasets such as OC-2M and efficient models such as GemNet-OC. Our code and pretrained model weights are open-sourced.

翻译：近些年来,分子模拟数据集的出现是数量级更大、种类更多得多的分子模拟数据集的出现。这些新的数据集在四个复杂方面有很大差异:1. 化学多样性(不同元素的数量)、2. 系统规模(每个样本的原子数量)、3. 数据集规模(数据样本的数量)和4. 域变(培训和测试数据集的相似性)。尽管存在这些巨大差异,但小型和狭窄的数据集的基准仍然是用于分子模拟的图形神经网络(GNNS)中显示进展的主要方法,这很可能是由于更廉价的培训计算要求。这提出了问题 -- -- 小型和狭窄的数据集的GNNC进展能否转化为这些更复杂的数据集?2系统规模(每个样本的原子数量)、3. 系统规模(3个) 数据集大小(数据样本数量) 3 和4个域变换码(OC) 数据模型的精确度(GNNNS) 和透度(ONC-O-2) 模型的精确度选择对多个数据集的性能产生何种影响。我们发现,由此得出的G-O-O-M 模型的每个数据模型的模型的模型将产生何种数据变异化数据,而这种模型将用来进行不同的数据分析数据模型的模型的模型将产生何种数据。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日