A noisy training set usually leads to the degradation of the generalization and robustness of neural networks. In this paper, we propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Specifically, we first present a Scalable Penalized Regression (SPR) method, to model the linear relation between network features and one-hot labels. In SPR, the clean data are identified by the zero mean-shift parameters solved in the regression model. We theoretically show that SPR can recover clean data under some conditions. Under general scenarios, the conditions may be no longer satisfied; and some noisy data are falsely selected as clean data. To solve this problem, we propose a data-adaptive method for Scalable Penalized Regression with Knockoff filters (Knockoffs-SPR), which is provable to control the False-Selection-Rate (FSR) in the selected clean data. To improve the efficiency, we further present a split algorithm that divides the whole training set into small pieces that can be solved in parallel to make the framework scalable to large datasets. While Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline, we further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework and validate the theoretical results of Knockoffs-SPR. Our code and pre-trained models will be released.
翻译:噪音培训组通常会导致神经网络的概括性和稳健性退化。 在本文中, 我们提出一个新的理论上有保障的清洁样本选择框架, 用于学习使用吵闹标签。 具体地说, 我们首先提出一种可缩放的可缩放的刑事回归( SPR) 方法, 用于模拟网络特征和一热标签之间的线性关系。 在 SPR 中, 清洁数据由回归模型中解开的零中位参数来识别。 我们理论上显示 SPR 在某些条件下可以恢复干净的数据。 在一般情况下, 条件可能不再满足; 一些吵闹的数据被错误地选择为清洁数据。 为了解决这个问题, 我们提出了一种数据适应性适应性的方法, 用于模拟可缩放过滤器( Knockoffs- SPR), 用于控制选定清洁数据中的错误选举- Rate (FSR) 。 为了提高效率, 我们进一步提出一种分裂的算法, 将整个培训的理论集分解成小片段, 在一般情况下, 可能不再满足这些条件; 一些吵杂数据被错误被错误数据被错误选为清洁数据选为可缩成大数据设置 。 同时, 我们可以把KnockSPR 样级标准的模型的模型作为我们的标准模型 的模型, 的模型作为我们可以被归为一种标准 的模型 的模型, 将用来用来用来用来进一步使用。