Deep learning has achieved remarkable success in learning representations for molecules, which is crucial for various biochemical applications, ranging from property prediction to drug design. However, training Deep Neural Networks (DNNs) from scratch often requires abundant labeled molecules, which are expensive to acquire in the real world. To alleviate this issue, tremendous efforts have been devoted to Molecular Pre-trained Models (MPMs), where DNNs are pre-trained using large-scale unlabeled molecular databases and then fine-tuned over specific downstream tasks. Despite the prosperity, there lacks a systematic review of this fast-growing field. In this paper, we present the first survey that summarizes the current progress of MPMs. We first highlight the limitations of training molecular representation models from scratch to motivate MPM studies. Next, we systematically review recent advances on this topic from several key perspectives, including molecular descriptors, encoder architectures, pre-training strategies, and applications. We also highlight the challenges and promising avenues for future research, providing a useful resource for both machine learning and scientific communities.
翻译:深层学习在分子的学习表现方面取得了显著的成功,分子的学习对各种生化应用至关重要,从财产预测到药物设计不等。然而,从零开始培训深神经网络(DNN)往往需要大量贴上标签的分子,这些分子在现实世界中是昂贵的。为了缓解这一问题,已经为分子预修模型(MPM)作出了巨大努力,DNN利用大型无标签分子数据库进行预先培训,然后对具体的下游任务进行微调。尽管取得了繁荣,但缺乏对这一快速增长领域的系统审查。我们在本文件中介绍了第一次调查,总结了MPM目前的进展。我们首先强调了培训分子代表模型从零开始到激励MPM研究的局限性。我们从几个关键角度系统地审查了这一专题的最新进展,包括分子描述器、编码结构、培训前战略和应用。我们还强调了未来研究的挑战和有希望的途径,为机器学习和科学界提供了有用的资源。