Methods for designing organic materials with desired properties have high potential impact across fields such as medicine, renewable energy, petrochemical engineering, and agriculture. However, using generative modeling to design substances with desired properties is difficult because candidate compounds must satisfy multiple constraints, including synthetic accessibility and other metrics that are intuitive to domain experts but challenging to quantify. We propose C5T5, a novel self-supervised pretraining method that enables transformers to make zero-shot select-and-replace edits, altering organic substances towards desired property values. C5T5 operates on IUPAC names -- a standardized molecular representation that intuitively encodes rich structural information for organic chemists but that has been largely ignored by the ML community. Our technique requires no edited molecule pairs to train and only a rough estimate of molecular properties, and it has the potential to model long-range dependencies and symmetric molecular structures more easily than graph-based methods. C5T5 also provides a powerful interface to domain experts: it grants users fine-grained control over the generative process by selecting and replacing IUPAC name fragments, which enables experts to leverage their intuitions about structure-activity relationships. We demonstrate C5T5's effectiveness on four physical properties relevant for drug discovery, showing that it learns successful and chemically intuitive strategies for altering molecules towards desired property values.
翻译:设计具有理想特性的有机材料的方法在医学、可再生能源、石化工程和农业等各个领域都具有很高的潜在影响。然而,使用基因模型模型来设计具有理想特性的物质很难,因为候选化合物必须满足多种限制,包括合成无障碍和其他对域专家来说直观但难以量化的测量标准。我们建议C5T5,一种全新的自我监督的预培训方法,使变压器能够进行零光选择和替换编辑,将有机物质改变为理想财产价值。C5T5使用国际化联的名称 -- -- 一种标准化的分子代表,直观地为有机化学化学家编码丰富的结构信息,但ML社区基本上忽略了这一点。我们的技术不需要经过编辑的分子配对来培训和仅仅粗略估计分子特性,而且它有可能比图形方法更容易地建模长期依赖和对称分子结构。C5T5还提供与域专家的强大接口:通过选择和取代国际化联动化学化学化学家学家的物理发现结构,使用户能够精细控制基因化过程,选择并取代国际化化学化学化学学家的物理发现其成功的化学特性。