Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models require uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Methods that combine Bayesian models with NN models address this issue, but are difficult to implement and more expensive to train. Some methods require changing the NN architecture or training procedure, limiting the selection of NN models. Moreover, predictive uncertainty can come from different sources. It is important to have the ability to separately model different types of predictive uncertainty, as the model can take assorted actions depending on the source of uncertainty. In this paper, we examine UQ methods that estimate different sources of predictive uncertainty for NN models aiming at drug discovery. We use our prior knowledge on chemical compounds to design the experiments. By utilizing a visualization method we create non-overlapping and chemically diverse partitions from a collection of chemical compounds. These partitions are used as training and test set splits to explore NN model uncertainty. We demonstrate how the uncertainties estimated by the selected methods describe different sources of uncertainty under different partitions and featurization schemes and the relationship to prediction error.
翻译:神经网络(NN)模型为加快药物发现过程和降低其失败率提供了潜力。NN模型的成功要求不确定性量化(UQ),因为药物发现探索的化学空间超出了培训数据分布的范围。标准NN模型不提供不确定性信息。将巴伊西亚模型与NN模型相结合的方法解决这一问题,但很难实施,培训费用更高。有些方法需要改变NN结构或培训程序,限制NN模型的选择。此外,预测性不确定性可能来自不同来源。重要的是,要能够分别模拟不同种类的预测性不确定性,因为该模型可以根据不确定性的来源采取各种行动。在本文件中,我们研究了用来估计NNN模型以发现药物为目标的不同预测性不确定性来源的UQ方法。我们利用我们以前关于化学化合物的知识来设计实验。通过视觉化方法,我们从收集的化学化合物中创建了不重叠和化学多样化的隔断。这些隔断点可以被用来作为对NN模型不确定性进行分解的培训和测试,因为该模型可以根据不确定性的来源采取各种行动。我们用所选方法来估计不确定性,以不同的方式来描述不确定性的分布。