Random forests are a popular method for classification and regression due to their versatility. However, this flexibility can come at the cost of user privacy, since training random forests requires multiple data queries, often on small, identifiable subsets of the training data. Privatizing these queries typically comes at a high utility cost, in large part because we are privatizing queries on small subsets of the data, which are easily corrupted by added noise. In this paper, we propose DiPriMe forests, a novel tree-based ensemble method for differentially private regression and classification, which is appropriate for real or categorical covariates. We generate splits using a differentially private version of the median, which encourages balanced leaf nodes. By avoiding low occupancy leaf nodes, we avoid high signal-to-noise ratios when privatizing the leaf node sufficient statistics. We show theoretically and empirically that the resulting algorithm exhibits high utility, while ensuring differential privacy.
翻译:随机森林因其多功能性而是一种常用的分类和回归方法。然而,这种灵活性可能以用户隐私为代价,因为培训随机森林需要多种数据查询,往往对培训数据中小的、可识别的子集进行数据查询。 将这些查询私有化通常需要很高的水电费,这在很大程度上是因为我们正在将关于数据中小子集的查询私有化,这些数据很容易因增加的噪音而腐蚀。在本文中,我们提议Diprime森林,这是一种基于树的新型的、以差异为基础的私人回归和分类共通方法,适合真实或绝对的共变式。我们利用中位的有差异的私人版本产生分裂,这鼓励平衡的叶节点。通过避免低占用叶节点,我们避免在将叶节私有化时出现高信号到噪音比率。我们从理论上和从经验上表明,由此产生的算法具有很高的效用,同时确保不同的隐私。