We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed "target" example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using only information about which examples of $S$ are contained in $S'$ -- predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the potential complexity of the underlying process being approximated (e.g., end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels can successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. Data for this paper (including pre-computed datamodels as well as raw predictions from four million trained deep neural networks) is available at https://github.com/MadryLab/datamodels-data .
翻译:我们提出了一个概念框架,即数据模型,用于分析培训数据模型类别的行为。对于任何固定的“目标”示例($x美元,培训设定美元S$)和学习算法,数据模型是一个参数化函数($s'\S\to\mathbb{R}$),对于任何子集($S'\subset S$),数据模型仅使用以美元为单位的信息 -- -- 预测培训模型($S'$)的结果,用美元进行评估。尽管基本过程可能比较复杂(例如,端到端培训和深层神经网络的评估),但我们显示,即使是简单的线性数据模型也能成功地预测模型产出。然后我们证明,数据模型会产生多种应用,例如:准确预测数据集反事实的效果;确定微小的预测;从字质上找到类似的例子;量化训练测试渗漏;将数据嵌入一个良好已经形成和具有地貌的模型(例如,端到端到端端端端的培训和深层神经网络),我们发现数据模型(包括以原始模型/深层模型形式进行的数据)。