We apply a Transformer architecture, specifically BERT, to learn flexible and high quality molecular representations for drug discovery problems. We study the impact of using different combinations of self-supervised tasks for pre-training, and present our results for the established Virtual Screening and QSAR benchmarks. We show that: i) The selection of appropriate self-supervised task(s) for pre-training has a significant impact on performance in subsequent downstream tasks such as Virtual Screening. ii) Using auxiliary tasks with more domain relevance for Chemistry, such as learning to predict calculated molecular properties, increases the fidelity of our learnt representations. iii) Finally, we show that molecular representations learnt by our model `MolBert' improve upon the current state of the art on the benchmark datasets.
翻译:我们运用一个变换器结构,特别是BERT,以学习灵活和高质量的分子表现方式处理药物发现问题;我们研究将自行监督的任务的不同组合用于培训前任务的影响,并介绍我们的成果,以建立虚拟筛选和QSAR基准;我们表明:(一) 为培训前选择适当的自我监督任务,对随后的下游任务,例如虚拟筛选的绩效有重大影响。 (二) 利用对化学具有更大领域相关性的辅助任务,例如学会预测计算分子特性,提高我们所学过的表示的忠诚性。 (三) 最后,我们表明,我们的模型“MolBert”所学的分子表现方式,对基准数据集的目前状况有了改善。