Influence functions estimate effect of individual data points on predictions of the model on test data and were adapted to deep learning in Koh and Liang [2017]. They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc. Recently, Ilyas et al. [2022] introduced a linear regression method they termed datamodels to predict the effect of training points on outputs on test data. The current paper seeks to provide a better theoretical understanding of such interesting empirical phenomena. The primary tool is harmonic analysis and the idea of noise stability. Contributions include: (a) Exact characterization of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the residual error and quality of the optimum linear datamodel without having to train the datamodel. (c) New insights into when influences of groups of datapoints may or may not add up linearly.
翻译:最近,Ilyas等人([2022年])采用了一种直线回归法,他们称之为数据模型,以预测培训点对测试数据产出的影响。本文件力求从理论上更好地了解这种有趣的经验现象。主要工具是协调分析和噪音稳定性概念。贡献包括:(a) 以四倍系数对所学数据模型的精确定性;(b) 一种有效方法,用以估计残余错误和最佳线性数据模型的质量,而无需培训数据模型。 (c) 新的见解,了解数据点群的影响何时可能或可能不会以线性方式增加。