Given a data stream $\mathcal{D} = \langle a_1, a_2, \ldots, a_m \rangle$ of $m$ elements where each $a_i \in [n]$, the Distinct Elements problem is to estimate the number of distinct elements in $\mathcal{D}$. Distinct Elements has been a subject of theoretical and empirical investigations over the past four decades resulting in space optimal algorithms for it. All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory.
翻译:鉴于数据流 $\ mathcal{D} =\ langle a_ 1, a_ 2,\ ldots, a_m\rangle $1美元元元元,其中每个$_i\ in[n]美元,不同的元素问题在于估计不同元素的数量,以$mathcal{D}$计算。不同的元素在过去四十年中一直是理论和经验调查的主题,由此形成了空间的最佳算法。然而,目前所有最先进的算法都超出了本科本科教科书的可及范围,因为这些算法依赖于对称独立和通用散列函数等概念的使用。我们提出了一个简单、直观、基于抽样的空间效率算法,其描述和证据可以让具有基本概率理论知识的本科生使用。