One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of $\epsilon$-approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR-10 classification. Furthermore, our KIP-learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacy-preservation.
翻译:任何机器学习算法的最根本方面之一是算法使用的培训数据。我们引入了新颖的概念,即用$\ epsilon$-access-access of datase,获取的数据集比原始培训数据小得多或严重腐败,同时保持类似的模型性能。我们引入了一个称为Kernel诱导点(KIP)的元学习算法,用于获取这种非凡的数据集,这种算法受到无限宽的神经网络和内核里脊反射(KRR)之间通信最近发展情况的启发。关于KRR的任务,我们证明KIP可以用一个或两个数量级压缩数据集,大大改进以前的数据集蒸馏和子选择方法,同时获取MNIST和CIFAR-10分类的艺术结果状况。此外,我们的KIP所学数据集可转让用于培训有限线神经网络的培训,甚至超越了懒惰培训制度,从而导致记录神经网络数据蒸馏的艺术结果,并有可能应用于隐私保护。