GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction. However, in NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer. In parallel, the software ecosystem around transformers is maturing rapidly, with libraries like HuggingFace and BertViz enabling streamlined training and introspection. In this work, we make one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via our ChemBERTa model. ChemBERTa scales well with pretraining dataset size, offering competitive downstream performance on MoleculeNet and useful attention-based visualization modalities. Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction. To facilitate these efforts, we release a curated dataset of 77M SMILES from PubChem suitable for large-scale self-supervised pretraining.
翻译:GNNs和化学指纹是代表分子进行财产预测的主要方法,然而,在NLP中,变压器由于其强大的下游任务转移,已成为代议制学习的实际标准。与此同时,变压器周围的软件生态系统正在迅速成熟,Hugging Face和BertViz等图书馆使培训和内窥镜得以简化。在这项工作中,我们首次尝试通过我们的ChembERTA模型系统评价分子财产预测任务的变压器。ChembERTA尺度与培训前的数据集大小相当,在MoleculeNet上提供了竞争性的下游性表现和有用的关注可视化模式。我们的结果表明,变压器为分子代表性学习和财产预测的未来工作提供了一个充满希望的渠道。为了便利这些努力,我们发布了一套适合大规模自我监管前训练的PubChem的77M SMILES集集。