多面性魔法! 用于私人数据生成的 Hermite 多面性模型 (Polynomial magic! Hermite polynomials for private data generation)

Kernel mean embedding is a useful tool to compare probability measures. Despite its usefulness, kernel mean embedding considers infinite-dimensional features, which are challenging to handle in the context of differentially private data generation. A recent work proposes to approximate the kernel mean embedding of data distribution using finite-dimensional random features, where the sensitivity of the features becomes analytically tractable. More importantly, this approach significantly reduces the privacy cost, compared to other known privatization methods (e.g., DP-SGD), as the approximate kernel mean embedding of the data distribution is privatized only once and can then be repeatedly used during training of a generator without incurring any further privacy cost. However, the required number of random features is excessively high, often ten thousand to a hundred thousand, which worsens the sensitivity of the approximate kernel mean embedding. To improve the sensitivity, we propose to replace random features with Hermite polynomial features. Unlike the random features, the Hermite polynomial features are ordered, where the features at the low orders contain more information on the distribution than those at the high orders. Hence, a relatively low order of Hermite polynomial features can more accurately approximate the mean embedding of the data distribution compared to a significantly higher number of random features. As a result, using the Hermite polynomial features, we significantly improve the privacy-accuracy trade-off, reflected in the high quality and diversity of the generated data, when tested on several heterogeneous tabular datasets, as well as several image benchmark datasets.

翻译：内核嵌入是比较概率度量的有用工具。尽管内核嵌入是有用的, 内核意味着考虑无限的尺寸特征, 这对于在不同的私人数据生成过程中处理具有挑战性。但是, 最近的一项工作提议, 使用有限维随机特性来接近内核中的数据分布, 其特性的敏感性在分析上可移植。更重要的是, 这种方法与其他已知的私有化方法( 如DP-SGD) 相比, 大大降低了隐私成本。与其他已知的私有化方法( DP-SGD) 相比, 近似内核嵌入数据分布意味着仅仅私有化一次, 而在对发电机进行培训时, 内核内核会反复使用无限的尺寸特征。然而, 随机特性的数量太高, 往往为10万至10万, 这使得近内核的内核分布更加敏感。为了提高敏感度, 我们提议用Hermite 的隐私多元特性取代随机特性。与随机特性不同, Hermite 多元特性是排列的, 低级命令的分布信息比高, 高等级数据特性比高。因此, 将高端数据特性比高, 高端数据高端数据高端数据, 数据高端数据分布为高端, 精确的的的精确, 高端数据精确性, 数据的精确性数据精确性, 精确性, 数据数据的的的精确性数据的的的的的精确性精确性精确性精确性精确性数据, 数据数据数据的数据的, 的的精确性的的的的的的的的, 的的的的的的的的的的的的的的的的的的的的的的的的精确性的精确性, 精确性, 精确性的精确性的的的的的的的的精确性的的的的的的的的的精确性的的的