Retrosynthesis prediction is one of the fundamental challenges in organic chemistry and related fields. The goal is to find reactants molecules that can synthesize product molecules. To solve this task, we propose a new graph-to-graph transformation model, G2GT, in which the graph encoder and graph decoder are built upon the standard transformer structure. We also show that self-training, a powerful data augmentation method that utilizes unlabeled molecule data, can significantly improve the model's performance. Inspired by the reaction type label and ensemble learning, we proposed a novel weak ensemble method to enhance diversity. We combined beam search, nucleus, and top-k sampling methods to further improve inference diversity and proposed a simple ranking algorithm to retrieve the final top-10 results. We achieved new state-of-the-art results on both the USPTO-50K dataset, with top1 accuracy of 54%, and the larger data set USPTO-full, with top1 accuracy of 50%, and competitive top-10 results.
翻译:复制合成预测是有机化学和相关领域的基本挑战之一。 目标是找到能合成产品分子的反应分子。 为了完成这项任务, 我们提议了一个新的图形到绘图转换模型G2GT, 其中图形编码器和图形解码器建在标准的变压器结构上。 我们还表明, 自我培训是一种强大的数据增强方法, 使用无标签分子数据, 可以显著改善模型的性能。 在反应类型标签和联合学习的启发下, 我们提出了一种新的弱反应分子分子聚合方法, 以加强多样性。 我们结合了光束搜索、核和顶K取样方法, 以进一步改进推断多样性, 并提出了简单的排序算法, 以检索最后的10强变压器结构。 我们在USPTO- 50K数据集上取得了新的最新数据结果, 最高精确度为54%, 更大的数据集 USPTO- full, 最高精确度为50 %, 最高精确度为50 和最高有竞争力的10 。