Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.
翻译:基于机器学习的模型可以进行准确和快速的分子属性预测,这是药物发现和材料设计中感兴趣的。各种受监督的机器学习模型表现出了有希望的性能,但广泛的化学空间和有限的财产标签使监督学习具有挑战性。最近,在大型无标签材料库中预先训练的未经监督的变压器语言模型在许多下游自然语言处理任务中产生了最先进的结果。受这一发展的影响,我们介绍了通过培训高效变压器编码模型(MolFormer,该模型使用旋转定位嵌入器)获得的分子嵌入。这一模型使用一个线性关注机制,加上大量分布式培训,对普布切姆和ZINC数据集中11亿个无标签分子的SMILES序列进行了监督性学习。我们表明,在10个基准数据集中的若干下游任务(MoLFormer)图形神经网络和语言模型中,所学的分子嵌入的分子型号都比现有的基线更强。它们具有竞争性。进一步的分析,特别是通过关注的透视镜镜镜,表明,这些分子在化学模型中已经进行了充分的化学分子结构特性分析。