The randomized or cross-validated split of training and testing sets has been adopted as the gold standard of machine learning for decades. The establishment of these split protocols are based on two assumptions: (i)-fixing the dataset to be eternally static so we could evaluate different machine learning algorithms or models; (ii)-there is a complete set of annotated data available to researchers or industrial practitioners. However, in this article, we intend to take a closer and critical look at the split protocol itself and point out its weakness and limitation, especially for industrial applications. In many real-world problems, we must acknowledge that there are numerous situations where assumption (ii) does not hold. For instance, for interdisciplinary applications like drug discovery, it often requires real lab experiments to annotate data which poses huge costs in both time and financial considerations. In other words, it can be very difficult or even impossible to satisfy assumption (ii). In this article, we intend to access this problem and reiterate the paradigm of active learning, and investigate its potential on solving problems under unconventional train/test split protocols. We further propose a new adaptive active learning architecture (AAL) which involves an adaptation policy, in comparison with the traditional active learning that only unidirectionally adds data points to the training pool. We primarily justify our points by extensively investigating an interdisciplinary drug-protein binding problem. We additionally evaluate AAL on more conventional machine learning benchmarking datasets like CIFAR-10 to demonstrate the generalizability and efficacy of the new framework.
翻译:数十年来,将培训和测试的随机或交叉验证的学科间拆分作为机器学习的金本位标准。这些拆分协议的建立基于两个假设:(一) 将数据集固定成永久静态,以便我们可以评价不同的机器学习算法或模型;(二) 向研究人员或工业从业人员提供一套完整的附加说明的数据。然而,在本条中,我们打算更密切和严格地审视这一拆分协议本身,指出其弱点和局限性,特别是工业应用。在许多现实世界问题中,我们必须承认存在许多假设(二)无法维持的情况。例如,对于药物发现等跨学科应用,它往往需要真正的实验室实验来说明在时间和财政方面都造成巨大成本的数据。换句话说,可能很难甚至无法满足假设 (二) 。在本篇文章中,我们打算了解这一问题,重申积极学习的模式,并调查其在非常规火车/测试分解协议下解决问题的潜力。我们进一步提议一个新的适应性积极学习结构(AAL),这主要是为了在传统数据上进行积极的比较,我们只是通过不断的学习数据库,我们只是通过不断的学习数据库中的数据,来解释。