标题：深入理解数据增强在处理不平衡数据中的作用摘要：数据增强是许多现代机器学习训练流程的基础，但其作用机理并不清楚。数据增强的研究大部分是关注如何改进现有技术、探究在神经网络过拟合情况下的正则化效果或研究它对特征的影响。在本文中，我们从全面的角度，研究数据增强在卷积神经网络、支持向量机和逻辑回归模型等三种分类器中的作用，这些模型通常用于不平衡数据的监督分类。我们使用了三个图像和五个表格数据集来支持我们的研究。我们的研究发现，在处理不平衡数据时，数据增强会产生模型权重、支持向量和特征选择方面的显著变化，尽管它可能对全局度量，如平衡准确度或F1度量仅产生相对较小的变化。我们假设数据增强通过使数据的方差扩大，让机器学习模型能够将数据的变化与标签关联起来，通过扩大模型识别特征振幅的范围来改善模型在学习不平衡数据时的泛化能力。 (Towards Understanding How Data Augmentation Works with Imbalanced Data)

翻译：标题：深入理解数据增强在处理不平衡数据中的作用摘要：数据增强是许多现代机器学习训练流程的基础，但其作用机理并不清楚。数据增强的研究大部分是关注如何改进现有技术、探究在神经网络过拟合情况下的正则化效果或研究它对特征的影响。在本文中，我们从全面的角度，研究数据增强在卷积神经网络、支持向量机和逻辑回归模型等三种分类器中的作用，这些模型通常用于不平衡数据的监督分类。我们使用了三个图像和五个表格数据集来支持我们的研究。我们的研究发现，在处理不平衡数据时，数据增强会产生模型权重、支持向量和特征选择方面的显著变化，尽管它可能对全局度量，如平衡准确度或F1度量仅产生相对较小的变化。我们假设数据增强通过使数据的方差扩大，让机器学习模型能够将数据的变化与标签关联起来，通过扩大模型识别特征振幅的范围来改善模型在学习不平衡数据时的泛化能力。

Damien A. Dablain,Nitesh V. Chawla

Data augmentation forms the cornerstone of many modern machine learning training pipelines; yet, the mechanisms by which it works are not clearly understood. Much of the research on data augmentation (DA) has focused on improving existing techniques, examining its regularization effects in the context of neural network over-fitting, or investigating its impact on features. Here, we undertake a holistic examination of the effect of DA on three different classifiers, convolutional neural networks, support vector machines, and logistic regression models, which are commonly used in supervised classification of imbalanced data. We support our examination with testing on three image and five tabular datasets. Our research indicates that DA, when applied to imbalanced data, produces substantial changes in model weights, support vectors and feature selection; even though it may only yield relatively modest changes to global metrics, such as balanced accuracy or F1 measure. We hypothesize that DA works by facilitating variances in data, so that machine learning models can associate changes in the data with labels. By diversifying the range of feature amplitudes that a model must recognize to predict a label, DA improves a model's capacity to generalize when learning with imbalanced data.

翻译：