批判性地看一看目前的火车/测试在机器学习中分裂 (A critical look at the current train/test split in machine learning)

The randomized or cross-validated split of training and testing sets has been adopted as the gold standard of machine learning for decades. The establishment of these split protocols are based on two assumptions: (i)-fixing the dataset to be eternally static so we could evaluate different machine learning algorithms or models; (ii)-there is a complete set of annotated data available to researchers or industrial practitioners. However, in this article, we intend to take a closer and critical look at the split protocol itself and point out its weakness and limitation, especially for industrial applications. In many real-world problems, we must acknowledge that there are numerous situations where assumption (ii) does not hold. For instance, for interdisciplinary applications like drug discovery, it often requires real lab experiments to annotate data which poses huge costs in both time and financial considerations. In other words, it can be very difficult or even impossible to satisfy assumption (ii). In this article, we intend to access this problem and reiterate the paradigm of active learning, and investigate its potential on solving problems under unconventional train/test split protocols. We further propose a new adaptive active learning architecture (AAL) which involves an adaptation policy, in comparison with the traditional active learning that only unidirectionally adds data points to the training pool. We primarily justify our points by extensively investigating an interdisciplinary drug-protein binding problem. We additionally evaluate AAL on more conventional machine learning benchmarking datasets like CIFAR-10 to demonstrate the generalizability and efficacy of the new framework.

翻译：数十年来,将培训和测试的随机或交叉验证的学科间拆分作为机器学习的金本位标准。这些拆分协议的建立基于两个假设:(一) 将数据集固定成永久静态,以便我们可以评价不同的机器学习算法或模型;(二) 向研究人员或工业从业人员提供一套完整的附加说明的数据。然而,在本条中,我们打算更密切和严格地审视这一拆分协议本身,指出其弱点和局限性,特别是工业应用。在许多现实世界问题中,我们必须承认存在许多假设(二)无法维持的情况。例如,对于药物发现等跨学科应用,它往往需要真正的实验室实验来说明在时间和财政方面都造成巨大成本的数据。换句话说,可能很难甚至无法满足假设 (二) 。在本篇文章中,我们打算了解这一问题,重申积极学习的模式,并调查其在非常规火车/测试分解协议下解决问题的潜力。我们进一步提议一个新的适应性积极学习结构(AAL),这主要是为了在传统数据上进行积极的比较,我们只是通过不断的学习数据库,我们只是通过不断的学习数据库中的数据,来解释。

相关内容

Machine Learning

关注 2245

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【经典书】使用机器学习R语言，149页pdf，Practical Machine Learning in R

专知会员服务

24+阅读 · 2021年1月13日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日