Finding valuable training data points for deep neural networks has been a core research challenge with many applications. In recent years, various techniques for calculating the "value" of individual training datapoints have been proposed for explaining trained models. However, the value of a training datapoint also depends on other selected training datapoints - a notion that is not explicitly captured by existing methods. In this paper, we study the problem of selecting high-value subsets of training data. The key idea is to design a learnable framework for online subset selection, which can be learned using mini-batches of training data, thus making our method scalable. This results in a parameterized convex subset selection problem that is amenable to a differentiable convex programming paradigm, thus allowing us to learn the parameters of the selection model in end-to-end training. Using this framework, we design an online alternating minimization-based algorithm for jointly learning the parameters of the selection model and ML model. Extensive evaluation on a synthetic dataset, and three standard datasets, show that our algorithm finds consistently higher value subsets of training data, compared to the recent state-of-the-art methods, sometimes ~20% higher value than existing methods. The subsets are also useful in finding mislabelled training data. Our algorithm takes running time comparable to the existing valuation functions.
翻译:为深神经网络寻找宝贵的培训数据点是许多应用的核心研究挑战。近年来,提出了各种计算个人培训数据点的“价值”的技术,以解释经过培训的模型。然而,培训数据点的价值还取决于其他选定的培训数据点—— 现有方法没有明确体现的概念。在本文件中,我们研究了选择高价值培训数据子集的问题。关键的想法是设计一个可学习的在线子集选择框架,可以通过培训数据微型袋来学习,从而使我们的方法可以调整。这导致一个参数化的组合子集选择问题,适合不同的 convex编程模式,从而使我们能够在端到端培训中学习选择模式的参数。我们利用这个框架,设计一个在线的基于最小值的混合算法,以共同学习选择模型和ML模型的参数。对综合数据集进行广泛的评价和三个标准数据集,表明我们的算法发现,与最近的状态- 配置组合组合组合组合组合组合组合组合组合子集相比,培训数据的价值始终更高。让我们能够了解端到端到端至端培训中的选择模型,有时我们的现有数据定位为可比较的方法。