Adverse events are a serious issue in drug development and many prediction methods using machine learning have been developed. The random split cross-validation is the de facto standard for model building and evaluation in machine learning, but care should be taken in adverse event prediction because this approach tends to be overoptimistic compared with the real-world situation. The time split, which uses the time axis, is considered suitable for real-world prediction. However, the differences in model performance obtained using the time and random splits are not fully understood. To understand the differences, we compared the model performance between the time and random splits using eight types of compound information as input, eight adverse events as targets, and six machine learning algorithms. The random split showed higher area under the curve values than did the time split for six of eight targets. The chemical spaces of the training and test datasets of the time split were similar, suggesting that the concept of applicability domain is insufficient to explain the differences derived from the splitting. The area under the curve differences were smaller for the protein interaction than for the other datasets. Subsequent detailed analyses suggested the danger of confounding in the use of knowledge-based information in the time split. These findings indicate the importance of understanding the differences between the time and random splits in adverse event prediction and suggest that appropriate use of the splitting strategies and interpretation of results are necessary for the real-world prediction of adverse events.
翻译:反常事件是药物开发中的一个严重问题,许多使用机器学习的预测方法已经开发出来。随机分割的交叉验证是模型建设和机器学习评估的实际标准,但应当谨慎对待不利事件预测,因为这种方法与现实世界的情况相比往往过于乐观。使用时间轴的时间分割被认为适合真实世界的预测。但是,使用时间和随机分解获得的模型性能差异并没有充分理解。为了理解差异,我们用八类复合信息作为投入、八类不利事件作为目标、六种机器学习算法对模型性能和随机分解的性能进行了比较。随机分割显示曲线值下的面积高于对八个目标中六个目标的时间分割。时间分割的培训和测试数据集的化学空间相似,表明适用性域的概念不足以解释分化所产生的差异。对于蛋白质互动而言,曲线下的区域差异比其他数据集要小。随后的详细分析表明,在使用基于知识的预测结果中,在使用基于时间分割的预测结果和对时间的随机分析结果进行必要理解方面存在着混杂的危险。