Frequency estimation in data streams is one of the classical problems in streaming algorithms. Following much research, there are now almost matching upper and lower bounds for the trade-off needed between the number of samples and the space complexity of the algorithm, when the data streams are adversarial. However, in the case where the data stream is given in a random order, or is stochastic, only weaker lower bounds exist. In this work we close this gap, up to logarithmic factors. In order to do so we consider the needle problem, which is a natural hard problem for frequency estimation studied in (Andoni et al. 2008, Crouch et al. 2016). Here, the goal is to distinguish between two distributions over data streams with $t$ samples. The first is uniform over a large enough domain. The second is a planted model; a secret ''needle'' is uniformly chosen, and then each element in the stream equals the needle with probability $p$, and otherwise is uniformly chosen from the domain. It is simple to design streaming algorithms that distinguish the distributions using space $s \approx 1/(p^2 t)$. It was unclear if this is tight, as the existing lower bounds are weaker. We close this gap and show that the trade-off is near optimal, up to a logarithmic factor. Our proof builds and extends classical connections between streaming algorithms and communication complexity, concretely multi-party unique set-disjointness. We introduce two new ingredients that allow us to prove sharp bounds. The first is a lower bound for an asymmetric version of multi-party unique set-disjointness, where players receive input sets of different sizes, and where the communication of each player is normalized relative to their input length. The second is a combinatorial technique that allows to sample needles in the planted model by first sampling intervals, and then sampling a uniform needle in each interval.
翻译:数据流的频率估计是流动算法的经典问题之一。 经过许多研究, 现在在数据流为对立时, 数据流的样本数量和算法空间复杂性之间, 几乎可以匹配所需的交易量的上下界限。 但是, 在数据流按随机顺序排列的数据流时, 或者是随机的, 只有较弱的下界存在。 在这项工作中, 我们缩小了这个差距, 最高是对流系数。 为了这样, 我们考虑针头的问题, 这是在( Antoni 等人, 2008, Crouch 等人, 2016) 所研究的频率估算中一个自然的正常问题。 在这里, 目标是将数据流的分布与数据流之间的比例区分为两个, 以美元为基数为基数。 在一个配置模型中, 秘密的“ 需求”, 然后将每一个元素与针针数相匹配, 以美元为基数, 以其他方式从域中统一选择。 设计一个简单的算法, 使用空间 $ $ etro, 1/ cload 1/ (p) rideal rial rial rial ride) 的计算, rideal rideal 。 rideal deliver 立为一种算法 。