串流中的独特元素: (文本) 书中的比值( 文本) 书中的比值 (Distinct Elements in Streams: An Algorithm for the (Text) Book)

from arxiv, The earlier version of the paper, as published at ESA-22, contained error in the proof of Claim 4. The current revised version fixes the error in the proof. The authors decided to forgo the old convention of alphabetical ordering of authors in favor of a randomized ordering, denoted by \textcircled{r}

Given a data stream $\mathcal{D} = \langle a_1, a_2, \ldots, a_m \rangle$ of $m$ elements where each $a_i \in [n]$, the Distinct Elements problem is to estimate the number of distinct elements in $\mathcal{D}$. Distinct Elements has been a subject of theoretical and empirical investigations over the past four decades resulting in space optimal algorithms for it. All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory.

翻译：鉴于数据流 $\ mathcal{D} =\ langle a_ 1, a_ 2,\ ldots, a_m\rangle $1美元元元元,其中每个$_i\ in[n]美元,不同的元素问题在于估计不同元素的数量,以$mathcal{D}$计算。不同的元素在过去四十年中一直是理论和经验调查的主题,由此形成了空间的最佳算法。然而,目前所有最先进的算法都超出了本科本科教科书的可及范围,因为这些算法依赖于对称独立和通用散列函数等概念的使用。我们提出了一个简单、直观、基于抽样的空间效率算法,其描述和证据可以让具有基本概率理论知识的本科生使用。