In classical statistics and distribution testing, it is often assumed that elements can be sampled from some distribution $P$, and that when an element $x$ is sampled, the probability $P$ of sampling $x$ is also known. Recent work in distribution testing has shown that many algorithms are robust in the sense that they still produce correct output if the elements are drawn from any distribution $Q$ that is sufficiently close to $P$. This phenomenon raises interesting questions: under what conditions is a "noisy" distribution $Q$ sufficient, and what is the algorithmic cost of coping with this noise? We investigate these questions for the problem of estimating the sum of a multiset of $N$ real values $x_1, \ldots, x_N$. This problem is well-studied in the statistical literature in the case $P = Q$, where the Hansen-Hurwitz estimator is frequently used. We assume that for some known distribution $P$, values are sampled from a distribution $Q$ that is pointwise close to $P$. For every positive integer $k$ we define an estimator $\zeta_k$ for $\mu = \sum_i x_i$ whose bias is proportional to $\gamma^k$ (where our $\zeta_1$ reduces to the classical Hansen-Hurwitz estimator). As a special case, we show that if $Q$ is pointwise $\gamma$-close to uniform and all $x_i \in \{0, 1\}$, for any $\epsilon > 0$, we can estimate $\mu$ to within additive error $\epsilon N$ using $m = \Theta({N^{1-\frac{1}{k}} / \epsilon^{2/k}})$ samples, where $k = \left\lceil (\log \epsilon)/(\log \gamma)\right\rceil$. We show that this sample complexity is essentially optimal. Our bounds show that the sample complexity need not vary uniformly with the desired error parameter $\epsilon$: for some values of $\epsilon$, perturbations in its value have no asymptotic effect on the sample complexity, while for other values, any decrease in its value results in an asymptotically larger sample complexity.
翻译:在古典统计和发行测试中,人们常常假设,元素可以从某些发行量中取样 $P$,当一个元素以美元为单位时,抽样美元的可能性也是已知的。最近在发行测试中的工作表明,许多算法是健全的,因为如果元素从任何发行量中提取的Q美元足够接近美元,那么它们仍然产生正确的输出。这个现象引起了有趣的问题:在什么条件下“noisy” 分配量为$Q美元充足,处理这种噪音的算法成本是多少?我们调查了这些问题,以估算一个美元实际值的多套数之和 $x1,\ldot, x美元。这个问题在案例的统计文献中得到了很好的研究 $P=Q美元,其中经常使用汉森-Hurwitzt 估测器。我们假设,对于某些已知的发行量为$P$,我们的所有值从一个发行量为$QQ美元为美元。 对于每个正数的美元,我们定义一个美元为正数的 美元xxxxxxxx, 美元 美元为正数的计算一个正数。