对涉及时间轴利用机器学习进行不利活动预测的数据分化战略的调查 (Investigation of a Data Split Strategy Involving the Time Axis in Adverse Event Prediction Using Machine Learning)

Adverse events are a serious issue in drug development and many prediction methods using machine learning have been developed. The random split cross-validation is the de facto standard for model building and evaluation in machine learning, but care should be taken in adverse event prediction because this approach tends to be overoptimistic compared with the real-world situation. The time split, which uses the time axis, is considered suitable for real-world prediction. However, the differences in model performance obtained using the time and random splits are not fully understood. To understand the differences, we compared the model performance between the time and random splits using eight types of compound information as input, eight adverse events as targets, and six machine learning algorithms. The random split showed higher area under the curve values than did the time split for six of eight targets. The chemical spaces of the training and test datasets of the time split were similar, suggesting that the concept of applicability domain is insufficient to explain the differences derived from the splitting. The area under the curve differences were smaller for the protein interaction than for the other datasets. Subsequent detailed analyses suggested the danger of confounding in the use of knowledge-based information in the time split. These findings indicate the importance of understanding the differences between the time and random splits in adverse event prediction and suggest that appropriate use of the splitting strategies and interpretation of results are necessary for the real-world prediction of adverse events.

翻译：反常事件是药物开发中的一个严重问题,许多使用机器学习的预测方法已经开发出来。随机分割的交叉验证是模型建设和机器学习评估的实际标准,但应当谨慎对待不利事件预测,因为这种方法与现实世界的情况相比往往过于乐观。使用时间轴的时间分割被认为适合真实世界的预测。但是,使用时间和随机分解获得的模型性能差异并没有充分理解。为了理解差异,我们用八类复合信息作为投入、八类不利事件作为目标、六种机器学习算法对模型性能和随机分解的性能进行了比较。随机分割显示曲线值下的面积高于对八个目标中六个目标的时间分割。时间分割的培训和测试数据集的化学空间相似,表明适用性域的概念不足以解释分化所产生的差异。对于蛋白质互动而言,曲线下的区域差异比其他数据集要小。随后的详细分析表明,在使用基于知识的预测结果中,在使用基于时间分割的预测结果和对时间的随机分析结果进行必要理解方面存在着混杂的危险。

相关内容

Machine Learning

关注 2245

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日