Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability. To improve predictions, techniques such as transfer learning and multitask learning have been used. The performance of multitask learning models depends on the strength of the underlying correlations between tasks and the completeness of the data set. Standard multitask models tend to underperform when trained on sparse data sets with weakly correlated properties. To address this gap, we fuse deep-learned embeddings generated by independent pretrained single-task models, resulting in a multitask model that inherits rich, property-specific representations. By reusing (rather than retraining) these embeddings, the resulting fused model outperforms standard multitask models and can be extended with fewer trainable parameters. We demonstrate this technique on a widely used benchmark data set of quantum chemistry data for small molecules as well as a newly compiled sparse data set of experimental data collected from literature and our own quantum chemistry and thermochemical calculations.
翻译:深度学习方法等数据驱动技术能够构建具有卓越精度和效率的材料性质预测模型。然而,在许多应用场景中,数据稀疏性严重制约了模型的准确性和适用性。为提升预测性能,迁移学习和多任务学习等技术已被广泛采用。多任务学习模型的性能取决于任务间潜在相关性的强度以及数据集的完整性。当在稀疏数据集上训练弱相关性质时,标准多任务模型往往表现欠佳。为弥补这一不足,我们融合了由独立预训练单任务模型生成的深度嵌入表示,构建出能够继承丰富且性质特异性表征的多任务模型。通过复用(而非重新训练)这些嵌入表示,所得融合模型在减少可训练参数的同时,其性能超越了标准多任务模型。我们在小分子量子化学数据的常用基准数据集以及新构建的稀疏实验数据集上验证了该技术的有效性,其中实验数据来源于文献汇编及我们自主完成的量子化学与热化学计算。