保持简单:语言模型可以学习复杂的分子分布 (Keeping it Simple: Language Models can learn Complex Molecular Distributions)

Deep generative models of molecules have grown immensely in popularity, trained on relevant datasets, these models are used to search through chemical space. The downstream utility of generative models for the inverse design of novel functional compounds depends on their ability to learn a training distribution of molecules. The most simple example is a language model that takes the form of a recurrent neural network and generates molecules using a string representation. More sophisticated are graph generative models, which sequentially construct molecular graphs and typically achieve state of the art results. However, recent work has shown that language models are more capable than once thought, particularly in the low data regime. In this work, we investigate the capacity of simple language models to learn distributions of molecules. For this purpose, we introduce several challenging generative modeling tasks by compiling especially complex distributions of molecules. On each task, we evaluate the ability of language models as compared with two widely used graph generative models. The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions -- and yield better performance than the graph models. Language models can accurately generate: distributions of the highest scoring penalized LogP molecules in ZINC15, multi-modal molecular distributions as well as the largest molecules in PubChem.

翻译：分子深基因模型在相关数据集方面受过培训,已变得非常受欢迎。这些模型被用于通过化学空间进行搜索。新功能化合物逆向设计基因模型的下游效用取决于它们学习分子培训分布的能力。最简单的例子就是一种语言模型,其形式为经常性神经网络,使用字符串表示法生成分子。更先进的是图形基因模型,这些模型依次构建分子图,通常达到最新结果。然而,最近的工作表明语言模型比曾经想象的更有能力,特别是在低数据系统中。我们研究简单语言模型学习分子分布的能力。为此,我们引入了几种具有挑战性的基因模型,方法是汇编特别复杂的分子分布。关于每一项任务,我们评估语言模型的能力,与两种广泛使用的图形基因化模型相比较。结果显示,语言模型是强大的基因模型,能够学会复杂的分子分布,特别是在低数据体系中。语言模型可以准确地生成:将最大分子分子分子分子分布在多级化学模型中。