Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data analysis, especially in information retrieval problems where n-grams over text with TF-IDF or Okapi feature values are a strong and easy baseline. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models. However, despite the increasing need of privacy-preserving models in information retrieval, to the best of our knoweledge, no differentially private screening rule exists. In this paper, we develop the first differentially private screening rule for linear and logistic regression. In doing so, we discover difficulties in the task of making a useful private screening rule due to the amount of noise added to ensure privacy. We provide theoretical arguments and experimental evidence that this difficulty arises from the screening step itself and not the private optimizer. Based on our results, we highlight that developing an effective private $L_1$ screening method is an open problem in the differential privacy literature.
翻译:线性$L_1$正则模型一直是数据分析领域中最简单和最有效的工具之一,特别是在信息检索问题中,基于文本的n-gram使用TF-IDF或Okapi特征值是一种强有力且易于实施的基准方法。在过去的十年中,随着筛选规则的流行,它们已成为减少产生$L_1$模型稀疏回归权重运行时间的一种方式。然而,尽管在信息检索中日益需要隐私保护模型,但据我们所知,没有不同ially private screening rule存在。本文提出了第一种适用于线性和逻辑回归的隐私保护筛选规则。在此过程中,我们发现由于添加的噪声量,使有用的私有筛选规则成为任务的难点。我们提供理论依据和实验证据,表明这种难题是由于筛选步骤本身而不是私有优化器所导致。根据我们的结果,我们强调,开发有效的私有$L_1$筛选方法是差分隐私文献中的一个尚待解决的问题。