We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a linear ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.
翻译:我们提供的结果可以精确地量化数据增强如何影响估计数的趋同率和差异。它们导致一些出乎意料的结果:与普通直觉相反,数据增强可能增加而不是减少估计数的不确定性,例如实证预测风险。我们的主要理论工具是随机变化的高维随机矢量功能的极限理论。证据利用许多变量功能的噪音稳定性的概率。我们发现的病理行为不是复杂模型的结果,但即使在最简单的环境中也可以发生 -- -- 我们的一个实例是具有两个参数的线性脊向递增器。另一方面,我们的结果还表明,数据增加可以产生真实的量化效益。