The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting biological activities of compounds using their molecular descriptors. Predictions from QSAR models can help, for example, to optimize molecular structure; prioritize compounds for further experimental testing; and estimate their toxicity. In addition to the accurate estimation of the activity, it is highly desirable to obtain some estimate of the uncertainty associated with the prediction, e.g., calculate a prediction interval (PI) containing the true molecular activity with a pre-specified probability, say 70%, 90% or 95%. The challenge is that most machine learning (ML) algorithms that achieve superior predictive performance require some add-on methods for estimating uncertainty of their prediction. The development of these algorithms is an active area of research by statistical and ML communities but their implementation for QSAR modeling remains limited. Conformal prediction (CP) is a promising approach. It is agnostic to the prediction algorithm and can produce valid prediction intervals under some weak assumptions on the data distribution. We proposed computationally efficient CP algorithms tailored to the most advanced ML models, including Deep Neural Networks and Gradient Boosting Machines. The validity and efficiency of proposed conformal predictors are demonstrated on a diverse collection of QSAR datasets as well as simulation studies.
翻译:QSAR回归模型是一种常用的技术,用于基于分子描述符预测化合物的生物活性。 QSAR模型的预测可以帮助优化分子结构,为进一步的实验测试确定优先顺序,以及估计其毒性。除了准确估计活性之外,高度希望能够获得与预测相关的不确定性的一些估计,例如,计算包含真实分子活性的预测区间(PI),其先前已经确定了某个概率,例如70%,90%或95%。挑战在于,大多数实现卓越预测性能的机器学习(ML)算法需要某些添加的方法来估计其预测的不确定性。这些算法的开发是统计和ML社区的活跃研究领域,但是它们的QSAR建模实施仍然有限。符合性预测(CP)是一种有前途的方法。它对预测算法不可知,并且可以在数据分布的某些弱限制下产生有效的预测区间。我们提出了专门针对最先进的ML模型(包括深度神经网络和梯度提升机)的计算高效的CP算法。所提出的符合性预测器的有效性和效率在各种QSAR数据集以及模拟研究中得到证明。