零膨胀右偏数据的隐私保护群体均值差异推断：基于分区和截断 (Privacy-preserving Inference of Group Mean Difference in Zero-inflated Right Skewed Data with Partitioning and Censoring)

We examine privacy-preserving inferences of group mean differences in zero-inflated right-skewed (zirs) data. Zero inflation and right skewness are typical characteristics of ads clicks and purchases data collected from e-commerce and social media platforms, where we also want to preserve user privacy to ensure that individual data is protected. In this work, we develop likelihood-based and model-free approaches to analyzing zirs data with formal privacy guarantees. We first apply partitioning and censoring (PAC) to ``regularize'' zirs data to get the PAC data. We expect inferences based on PAC to have better inferential properties and more robust privacy considerations compared to analyzing the raw data directly. We conduct theoretical analysis to establish the MSE consistency of the privacy-preserving estimators from the proposed approaches based on the PAC data and examine the rate of convergence in the number of partitions and privacy loss parameters. The theoretical results also suggest that it is the sampling error of PAC data rather than the sanitization error that is the limiting factor in the convergence rate. We conduct extensive simulation studies to compare the inferential utility of the proposed approach for different types of zirs data, sample size and partition size combinations, censoring scenarios, mean differences, privacy budgets, and privacy loss composition schemes. We also apply the methods to obtain privacy-preserving inference for the group mean difference in a real digital ads click-through data set. Based on the theoretical and empirical results, we make recommendations regarding the usage of these methods in practice.

翻译：本文研究零膨胀右偏(zirs)数据中隐私保护的群体均值差异推断。零膨胀和右偏是电子商务和社交媒体平台收集的广告点击和购买数据的典型特征，我们也希望确保保护个体数据以保护用户隐私。在这项工作中，我们开发了基于似然和基于模型无关的方法来分析zirs数据，并提供形式化的隐私保证。我们首先应用分区和截断(PAC)来“规范化”zirs数据以获取PAC数据。我们预计基于PAC的推断具有更好的推理性质和更强的隐私考虑，相比于直接分析原始数据。我们进行理论分析，以建立基于PAC数据的隐私保护估计量的均方误差一致性，并检查分区数量和隐私损失参数的收敛速度。理论结果还表明，在收敛速度方面，采样误差是PAC数据的限制因素，而不是清理误差。我们进行了广泛的模拟研究，比较了不同类型zirs数据，样本容量和分区大小组合，截尾情况，平均差异，隐私预算和隐私损失组成方案的提出方法的推理效用。我们还应用这些方法来获得实际数字广告点击数据集中群体均值差异的隐私保护推断。基于理论和实证结果，我们对这些方法在实践中的使用提出了建议。