The acquisition of labels for supervised learning can be expensive. In order to improve the sample-efficiency of neural network regression, we study active learning methods that adaptively select batches of unlabeled data for labeling. We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations and selection methods. Our framework encompasses many existing Bayesian methods based on Gaussian Process approximations of neural networks as well as non-Bayesian methods. Additionally, we propose to replace the commonly used last-layer features with sketched finite-width Neural Tangent Kernels, and to combine them with a novel clustering method. To evaluate different methods, we introduce an open-source benchmark consisting of 15 large tabular regression data sets. Our proposed method outperforms the state-of-the-art on our benchmark, scales to large data sets, and works out-of-the-box without adjusting the network architecture or training code. We provide open-source code that includes efficient implementations of all kernels, kernel transformations, and selection methods, and can be used for reproducing our results.
翻译:有监督学习的标签获取可能很昂贵。为了提高神经网络回归的样本效率,我们研究了适应性选择未标记数据批次进行标记的主动学习方法。我们提出了一个框架,用于利用(与网络相关的)基内核、内核变换和选择方法构建这样的方法。我们的框架包括许多基于高斯过程逼近神经网络的贝叶斯方法以及非贝叶斯方法。此外,我们建议使用用于替换常用的最后一层特征的有限宽度神经切向核,以及与一种新的聚类方法相结合。为了评估不同的方法,我们介绍了一个由15个大型表格回归数据集组成的开放源代码基准。我们提出的方法在我们的基准测试中优于最先进的方法,可扩展到大型数据集,并且在不调整网络架构或培训代码的情况下即可开箱即用。我们提供开源代码,其中包括所有内核、内核变换和选择方法的高效实现,并可用于复制我们的结果。