Estimating the empirical distribution of a scalar-valued data set is a basic and fundamental task. In this paper, we tackle the problem of estimating an empirical distribution in a setting with two challenging features. First, the algorithm does not directly observe the data; instead, it only asks a limited number of threshold queries about each sample. Second, the data are not assumed to be independent and identically distributed; instead, we allow for an arbitrary process generating the samples, including an adaptive adversary. These considerations are relevant, for example, when modeling a seller experimenting with posted prices to estimate the distribution of consumers' willingness to pay for a product: offering a price and observing a consumer's purchase decision is equivalent to asking a single threshold query about their value, and the distribution of consumers' values may be non-stationary over time, as early adopters may differ markedly from late adopters. Our main result quantifies, to within a constant factor, the sample complexity of estimating the empirical CDF of a sequence of elements of $[n]$, up to $\varepsilon$ additive error, using one threshold query per sample. The complexity depends only logarithmically on $n$, and our result can be interpreted as extending the existing logarithmic-complexity results for noisy binary search to the more challenging setting where noise is non-stochastic. Along the way to designing our algorithm, we consider a more general model in which the algorithm is allowed to make a limited number of simultaneous threshold queries on each sample. We solve this problem using Blackwell's Approachability Theorem and the exponential weights method. As a side result of independent interest, we characterize the minimum number of simultaneous threshold queries required by deterministic CDF estimation algorithms.
翻译:估算卡路里值数据集的经验分布是一项基本和根本的任务。 在本文中, 我们解决了在具有两个挑战性特点的环境下估算一个实验性分布的问题。 首先, 算法并不直接观察数据; 相反, 它只要求每个样本的限值查询数量有限。 其次, 数据假设不独立, 并且分布相同; 相反, 我们允许任意生成样本, 包括一个适应性对手。 这些考虑因素是相关的, 比如, 当模拟卖方以已公布的价格实验来估计消费者支付产品价格的分布情况时: 提供价格并观察消费者购买决定的一次性阈值查询; 首先, 算法并不直接观察数据; 早期采用者可能与迟采用者有明显差异; 我们的主要结果在固定系数范围内, 估算以 $[ 允许值序列的经验性 CDF 的样本复杂性, 最高为 美元( 美元) 补充值错误, 使用一个阈值抽样来估计我们的消费者支付意愿: 提供价格并观察一个消费者购买决定, 等等量 等量 等量 等量 等量 等量 等量 等量 等量 等量 等量 等复杂度 等量 等量 等量 等量 等量 等量