Unintended memorization of various information granularity has garnered academic attention in recent years, e.g. membership inference and property inference. How to inversely use this privacy leakage to facilitate real-world applications is a growing direction; the current efforts include dataset ownership inference and user auditing. Standing on the data lifecycle and ML model production, we propose an inference process named Data Provenance Inference, which is to infer the generation, collection or processing property of the ML training data, to assist ML developers in locating the training data gaps without maintaining strenuous metadata. We formularly define the data provenance and the data provenance inference task in ML training. Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow learning. Comprehensive evaluations cover language, visual and structured data in black-box and white-box settings, with diverse kinds of data provenance (i.e. business, county, movie, user). Our best inference accuracy achieves 98.96% in the white-box text model when "author" is the data provenance. The experimental results indicate that, in general, the inference performance positively correlated with the amount of reference data for inference, the depth and also the amount of the parameter of the accessed layer. Furthermore, we give a post-hoc statistical analysis of the data provenance definition to explain when our proposed method works well.
翻译:近些年来,各种信息颗粒的不完全记忆化引起了学术关注,例如会员推论和财产推断。如何反向利用这种隐私渗漏促进真实世界应用是一个日益增长的方向;目前的努力包括数据集所有权推论和用户审计。在数据生命周期和ML模型制作方面,我们提议了一个称为“数据预测推断”的推论过程,即推断ML培训数据的生成、收集或处理属性,以协助ML开发商在不维持艰巨的元数据的情况下找出培训数据差距。我们用公式定义数据来源和数据来源推导任务。然后我们提出一个新的推论战略,将嵌入空间多实例分类和影子学习结合起来。全面评价涵盖黑箱和白箱环境中的语言、视觉和结构数据,并有各种数据来源(即商业、县、电影、用户)。我们的最佳推论精确度在“作者”的深度和数据来源的深度分析中达到了白箱文本模型中的98.96%,同时“数据来源”是数据来源的精确度,实验性结果还显示了我们的数据的精确度,还显示了数据查看率的精确度。