This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e.g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions. Distributing synthetic data as rewards (instead of trained models or money) offers task- and model-agnostic benefits for downstream learning tasks and is less likely to violate data privacy regulation. To realize the framework, we firstly propose a data valuation function using maximum mean discrepancy (MMD) that values data based on its quantity and quality in terms of its closeness to the true data distribution and provide theoretical results guiding the kernel choice in our MMD-based data valuation function. Then, we formulate the reward scheme as a linear optimization problem that when solved, guarantees certain incentives such as fairness in the CGM framework. We devise a weighted sampling algorithm for generating synthetic data to be distributed to each party as reward such that the value of its data and the synthetic data combined matches its assigned reward value by the reward scheme. We empirically show using simulated and real-world datasets that the parties' synthetic data rewards are commensurate to their contributions.
翻译:本文介绍了一个新的合作型样(CGM)框架,该框架鼓励自利各方之间开展合作,为培训一种基因模型(例如GAN)提供数据,从而将数据输入一个人才库,用于培训一种基因模型(例如GAN),从中提取合成数据并将其分发给各方,作为与其贡献相称的奖励;将合成数据作为奖励(而不是经过培训的模型或金钱),为下游学习任务带来任务和模型 -- -- 不那么容易违反数据隐私条例;为了实现该框架,我们首先提议一个数据估值功能,利用最大平均差异(MMD),根据数据与真实数据分布的密切程度对数据进行定量和质量进行估值,并提供理论结果指导我们基于MMD的数据估值功能的内核选择;然后,我们制定奖励计划,作为一种线性优化问题,一旦解决,即保证某些奖励措施,如在CGM框架中的公平性;我们设计一个用于生成合成数据的加权抽样算法,分发给各方,作为奖励,使其数据的价值和合成数据结合其分配的奖励价值与奖励计划所分配的奖励价值相匹配。我们用模拟和合成数据对等的模拟和合成世界数据作出相应的奖励。