SELFormer: 通过SELFIES语言模型进行分子表示学习 (SELFormer: Molecular Representation Learning via SELFIES Language Models)

Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.

翻译：自动计算分析广泛的化学空间对于众多研究领域（如药物发现和材料科学）至关重要。表示学习技术最近被用于生成复杂数据的紧凑和信息的数字表达式。一种有效学习分子表示的方法是通过自然语言处理（NLP）算法处理基于字符串的化学符号。迄今为止大多数提出的方法都是使用SMILES符号; 然而，SMILES与与有效性和稳健性有关的多种问题相关联，这可能会阻止模型有效地揭示数据中隐藏的知识。在本研究中，我们提出了SELFormer，一种基于Transformer架构的化学语言模型，它利用100％有效，紧凑且表达力强的符号SELFIES作为输入，以学习灵活且高质量的分子表示。SELFormer 预训练了200万种药物类似分子，并用于多种分子属性预测任务的微调。我们的性能评估表明，SELFormer在预测分子的溶解度和不良药物反应等方面，都优于所有竞争方法，包括基于图形学习的方法和基于SMILES的化学语言模型。我们还通过降维将SELFormer学习的分子表示进行可视化，表明即使是预训练模型也可以区分具有不同结构特征的分子。我们共享了SELFormer作为编程工具，连同其数据集和预训练模型。总的来说，我们的研究表明，在化学语言建模的背景下使用SELFIES符号的好处，并为设计和发现具有所需特征的新型药物候选物提供了新的可能性。