Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes 'easy' points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2% higher final accuracy than uniform data shuffling.
翻译:网上数据培训可能需要数月时间。 但大多数计算和时间浪费在已经学习或无法学习的冗余和噪音点上。 相反,课程学习优先考虑“容易的”点,但不需要在一旦学习后就训练这些点。 相反,RHO-LOS选择了可以学习、值得学习和尚未学习的点。 RHO-LOS选择了比以前少得多的步骤,提高了准确性,并加快了对一系列广泛的数据集、超参数和结构(MLPs、CNNs和BERT)的培训。 在大型网络调整后图像设置的精确度比18个步骤要低,在标准1M、RHO-LOSS 和BERT中,最后的精确度比18个步骤要低。