It is a growing direction to utilize unintended memorization in ML models to benefit real-world applications, with recent efforts like user auditing, dataset ownership inference and forgotten data measurement. Standing on the point of ML model development, we introduce a process named data origin inference, to assist ML developers in locating missed or faulty data origin in training set without maintaining strenuous metadata. We formally define the data origin and the data origin inference task in the development of the ML model (mainly neural networks). Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow training. Diverse use cases cover language, visual and structured data, with various kinds of data origin (e.g. business, county, movie, mobile user, text author). A comprehensive performance analysis of our proposed strategy contains referenced target model layers, available testing data for each origin, and in shadow training, the implementations of feature extraction as well as shadow models. Our best inference accuracy achieves 98.96% in the language use case when the target model is a transformer-based deep neural network. Furthermore, we give a statistical analysis of different kinds of data origin to investigate what kind of origin is probably to be inferred correctly.
翻译:利用ML模型中意外误差的误差数据源,使实际世界应用受益,这是一个日益加强的方向,最近的努力包括用户审计、数据集所有权推断和被遗忘的数据测量。在ML模型开发点上,我们引入了一个名为数据源推断的过程,以协助ML开发者在培训组中找到未出错或错误的数据源,而无需维持艰苦的元数据。我们正式界定数据来源和数据源推断任务,以开发ML模型(主要是神经网络)中的数据来源和数据源推断任务。然后我们提出一项新的推论战略,将内嵌空间多实例分类和影子培训结合起来。多样化使用案例包括语言、视觉和结构化数据,并有各种数据来源(例如商业、县、电影、移动用户、文本作者)。我们拟议战略的全面绩效分析包含参考的目标模型层、每种来源的可用测试数据,以及影子培训中的地貌提取实施以及影子模型。当目标模型是一个基于变压器的深神经网络时,我们的最佳推论精确度在语言使用的情况下达到98.96%。我们可能正确分析不同数据来源的统计来源。