We discover a robust self-supervised strategy tailored towards molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pre-training strategy, we train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new state-of-the-art on 11 tasks. We then quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen BARTSmiles, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules. The code and the pretrained model are publicly available.
翻译:我们通过一系列量身定制、深入的布局,发现针对基因化遮蔽语言模型的分子代表的强有力的自我监督战略。我们利用这一培训前战略,培训BARTSmiles(BARTSmiles)(BARTSmiles,一种比以前自我监督的分子代表制更能计算数量级的BARTSmiles)模式。深入评估显示,BARTSmiles在分类、回归和生成任务方面始终优于其他自我监督的表达方式,在11项任务上设置新的最新技术。然后,我们量化地显示,在对分子领域应用时,BART目标会学习暗含我们下游利益任务的表达方式。例如,从一个冷冻的BARTSmiles中选择7个神经元,我们就能获得一个在对任务克林托克斯进行精密调整的模型两个百分点内表现的模型。最后,我们展示标准归属解释方法,在应用到BARTSmiles时,强调化学家用来解释分子具体特性的某些次结构。代码和预设模型是公开的。