Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014). We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to model a variety of real-world scenarios not covered by prior work. In addition to its simplicity, we demonstrate the utility of this framework by designing mechanisms for two foundational tasks, statistical queries and median finding. In particular, our mechanism for answering the broadly applicable class of statistical queries is both extremely simple and state of the art in many parameter regimes.
翻译:多数古典技术假定,数据集独立于分析员的查询,并在通用环境中细分,将数据集重新用于多种适应性选择的查询。Dwork et al.(STOC,2015年)和Hardt and Ullman(FOCS,2014年)的开创性著作中正式确定了对数据集进行分析的问题。我们确定了一套非常简单的假设,根据这些假设,即使选择适应性,查询也将继续具有代表性:唯一的要求是,每个查询作为输入随机子抽样和产出,只有很少的几位。这一结果表明,子抽样中固有的噪音足以保证查询答复的笼统性。这一子抽样框架的简单化使得它能够模拟先前工作没有涵盖的各种现实世界情景。除了其简单性外,我们还通过设计两个基本任务、统计查询和中位发现机制来证明这一框架的实用性。特别是,我们用来回答广泛应用的统计查询的分类机制非常简单,也是许多参数的参数。