Machine-learning models are increasingly used to predict properties of atoms in chemical systems. There have been major advances in developing descriptors and regression frameworks for this task, typically starting from (relatively) small sets of quantum-mechanical reference data. Larger datasets of this kind are becoming available, but remain expensive to generate. Here we demonstrate the use of a large dataset that we have "synthetically" labelled with per-atom energies from an existing ML potential model. The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints, in turn enabling rapid experimentation with atomistic ML models from the small- to the large-data regime. This approach allows us here to compare regression frameworks in depth, and to explore visualisation based on learned representations. We also show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets. In the future, we expect that our open-sourced dataset, and similar ones, will be useful in rapidly exploring deep-learning models in the limit of abundant chemical data.
翻译:机械学习模型越来越多地用于预测化学系统中原子的特性。在为这项任务制定描述和回归框架方面取得了重大进展,通常从(相对地)小组量子机械参考数据开始。这类较大的数据集正在逐渐出现,但制作成本仍然很高。在这里,我们展示了使用大型数据集的情况,即我们从现有的ML潜在模型中用“同步”的每原子能量贴上“同步”标签。与量子机械地面的真相相比,这一过程的廉价性使我们能够产生数百万个数据点,从而使我们能够用从小到大数据体系的零星ML模型快速进行实验。这一方法使我们能够在这里对回归框架进行深度比较,并根据所学的表述来探索可视化。我们还表明,学习合成数据标签对于随后对小数据集进行微调来说,是一种有用的培训前任务。在未来,我们预计我们的开放源数据集和类似数据集将有益于快速探索大量化学数据极限的深层学习模型。