Motivation: The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a model is limited by the availability of interacting protein-ligand pairs. On the other hand, large amounts of unlabeled protein sequences and chemical compounds are available and have been used to train language models that learn useful representations. In this study, we propose exploiting pretrained biochemical language models to initialize (i.e. warm start) targeted molecule generation models. We investigate two warm start strategies: (i) a one-stage strategy where the initialized model is trained on targeted molecule generation (ii) a two-stage strategy containing a pre-finetuning on molecular generation followed by target specific training. We also compare two decoding strategies to generate compounds: beam search and sampling. Results: The results show that the warm-started models perform better than a baseline model trained from scratch. The two proposed warm-start strategies achieve similar results to each other with respect to widely used metrics from benchmarks. However, docking evaluation of the generated compounds for a number of novel proteins suggests that the one-stage strategy generalizes better than the two-stage strategy. Additionally, we observe that beam search outperforms sampling in both docking evaluation and benchmark metrics for assessing compound quality. Availability and implementation: The source code is available at https://github.com/boun-tabi/biochemical-lms-for-drug-design and the materials are archived in Zenodo at https://doi.org/10.5281/zenodo.6832145
翻译:动机68: 开发针对感兴趣蛋白的新型化合物是制药业最重要的任务之一。 深基因化模型已被应用于目标分子设计,并显示出有希望的结果。 最近, 目标特定分子生成被视为蛋白语言和化学语言之间的翻译。 然而, 这样的模型有限于互动蛋白- ligand 配对的可用性。 另一方面, 大量未贴标签的蛋白序列和化学化合物可供使用, 并被用于培训学习有用表现的语言模型。 在这项研究中, 我们提议利用预先训练的生物化学语言模型来初始化( 即温暖启动) 目标分子生成模型。 我们调查了两种温暖的启动战略:( 一) 第一阶段战略, 对目标分子生成进行培训;(二) 两阶段战略, 包含对分子生成的预喷雾前的分子序列序列和化学化合物。 我们还比较了两种解码战略: 浅层搜索和取样。 结果显示, 温启动的模型比基线模型的运行好, 从点开始, 温度启动) 启动两个阶段性战略, 用于一个阶段搜索, 一个阶段的深度战略, 用来评估。