Data augmentation is becoming essential for improving regression performance in critical applications including manufacturing, climate prediction, and finance. Existing techniques for data augmentation largely focus on classification tasks and do not readily apply to regression tasks. In particular, the recent Mixup techniques for classification have succeeded in improving the model performance, which is reasonable due to the characteristics of the classification task, but has limitations in regression. We show that mixing examples that have large data distances using linear interpolations may have increasingly-negative effects on model performance. Our key idea is thus to limit the distances between examples that are mixed. We propose RegMix, a data augmentation framework for regression that learns for each example how many nearest neighbors it should be mixed with for the best model performance using a validation set. Our experiments conducted both on synthetic and real datasets show that RegMix outperforms state-of-the-art data augmentation baselines applicable to regression.
翻译:增加数据对于提高关键应用(包括制造、气候预测和融资)的回归性能至关重要。现有的数据增强技术主要侧重于分类任务,并不轻易适用于回归性任务。特别是,最近的混合分类技术成功地改进了模型性能,由于分类任务的特点而是合理的,但在回归方面有局限性。我们表明,使用线性内插法进行数据距离大的混合实例可能会对模型性能产生日益消极的影响。因此,我们的关键想法是限制混杂例子之间的距离。我们提议了一个数据增强框架RegMix,即一个用于回归性能的数据增强框架,为每个例子学习它应该如何与使用验证集的最佳模型性能混合。我们在合成和真实数据集上进行的实验显示,RegMix超越了适用于回归的状态数据增强基线。