Forming a molecular candidate set that contains a wide range of potentially effective compounds is crucial to the success of drug discovery. While most databases and machine-learning-based generation models aim to optimize particular chemical properties, there is limited literature on how to properly measure the coverage of the chemical space by those candidates included or generated. This problem is challenging due to the lack of formal criteria to select good measures of the chemical space. In this paper, we propose a novel evaluation framework for measures of the chemical space based on two analyses: an axiomatic analysis with three intuitive axioms that a good measure should obey, and an empirical analysis on the correlation between a measure and a proxy gold standard. Using this framework, we are able to identify #Circles, a new measure of chemical space coverage, which is superior to existing measures both analytically and empirically. We further evaluate how well the existing databases and generation models cover the chemical space in terms of #Circles. The results suggest that many generation models fail to explore a larger space over existing databases, which leads to new opportunities for improving generation models by encouraging exploration.
翻译:虽然大多数数据库和基于机器学习的生成模型旨在优化特定化学特性,但关于如何适当测量这些候选人所包括或产生的化学空间的覆盖范围的文献有限。由于缺乏用于选择化学空间良好测量方法的正式标准,这一问题具有挑战性。在本文件中,我们基于以下两项分析,提出了一套关于化学空间测量的新型评价框架:一项含有三个直觉轴心的不言而喻分析,即应当遵守一个良好的计量标准,以及一项关于计量和替代金标准之间关联的经验性分析。我们利用这一框架能够确定#Circles,这是一个化学空间覆盖的新尺度,它优于现有的分析和实验性措施。我们进一步评估现有数据库和生成模型在#Circles中覆盖化学空间的程度如何。结果显示,许多生成模型未能在现有数据库上探索更大的空间,从而通过鼓励探索来改善生成模型的新机会。</s>