Sample re-weighting strategies provide a promising mechanism to deal with imperfect training data in machine learning, such as noisily labeled or class-imbalanced data. One such strategy involves formulating a bi-level optimization problem called the meta re-weighting problem, whose goal is to optimize performance on a small set of perfect pivotal samples, called meta samples. Many approaches have been proposed to efficiently solve this problem. However, all of them assume that a perfect meta sample set is already provided while we observe that the selections of meta sample set is performance critical. In this paper, we study how to learn to identify such a meta sample set from a large, imperfect training set, that is subsequently cleaned and used to optimize performance in the meta re-weighting setting. We propose a learning framework which reduces the meta samples selection problem to a weighted K-means clustering problem through rigorously theoretical analysis. We propose two clustering methods within our learning framework, Representation-based clustering method (RBC) and Gradient-based clustering method (GBC), for balancing performance and computational efficiency. Empirical studies demonstrate the performance advantage of our methods over various baseline methods.
翻译:抽样重新加权战略为处理机器学习中的不完善培训数据提供了一种有希望的机制,例如有名无实的标签或分类平衡的数据。这种战略之一涉及制定双级优化问题,称为元重加权问题,其目标是在一组精密的关键样本(称为元抽样)上优化业绩。提出了许多办法来有效解决这一问题。然而,所有办法都假定已经提供了完美的元抽样组,而我们发现,对元抽样组的选择至关重要。在本文中,我们研究如何从一个大型、不完善的培训组中找出这样一个元抽样组,然后加以清理,并用于优化元重加权环境中的绩效。我们提议了一个学习框架,通过严格的理论分析,将元样品选择问题降低到加权K means集群问题。我们建议了两种组合方法,即基于代表性的集群方法和基于梯度的组合方法(GBC),以平衡业绩和计算效率。根据经验进行的研究表明,我们的方法优于各种基线方法。