Understanding how changes in training data affect a trained model is critical to building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. In this paper, we present a framework, Data2Model, for predicting the output model of a learning algorithm given the input data points. Specifically, Data2Model learns a parameterized function that takes a dataset $S$ as the input and predicts the model obtained by training on $S$. Despite the potential complexity of the underlying end-to-end training process being approximated, we show that a neural network-based set function class can successfully predict the trained model from its training data. We introduce novel global and local regularization techniques for preventing overfitting and rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. We perform extensive empirical investigations and demonstrate that Data2Model gives rise to a wide range of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.
翻译:理解培训数据的变化如何影响经过培训的模型,对于在机器学习管道的各个阶段建立信任至关重要:从清理质量差的样本和追踪在数据准备期间收集的重要样本,到校准模型预测的不确定性,到解释模型在部署期间的某些行为的原因。在本文件中,我们提出了一个框架,即Data2Model,用于预测根据输入数据点预测学习算法的产出模型。具体地说,Data2Model学习一个参数化功能,该功能需要一套数据集,作为投入和预测用美元培训获得的模型。尽管基础的端到端培训过程可能很复杂,但我们显示一个基于神经网络的功能班可以从培训数据中成功预测经过培训的模型。我们采用了新的全球和地方规范化技术,以防止神经网络(NNN)在适应端到端培训过程时过度适应和严格定性。我们进行了广泛的实证调查,并证明Dat2Model在广泛的应用中增加了各种应用,能够提高机器学习的模型、定量和问责性,例如数据评估、数据选择、数据量化、数据等。