\emph{Cardinality Estimation} (aka \emph{Distinct Elements}) is a classic problem in sketching with many industrial applications. Although sketching \emph{algorithms} are fairly simple, analyzing the cardinality \emph{estimators} is notoriously difficult, and even today the state-of-the-art sketches such as HyperLogLog and (compressed) \PCSA{} are not covered in graduate level Big Data courses. In this paper we define a class of \emph{generalized remaining area} (\tGRA) estimators, and observe that HyperLogLog, LogLog, and some estimators for PCSA are merely instantiations of \tGRA{} for various integral values of $\tau$. We then analyze the limiting relative variance of \tGRA{} estimators. It turns out that the standard estimators for HyperLogLog and PCSA can be improved by choosing a \emph{fractional} value of $\tau$. The resulting estimators come \emph{very} close to the Cram\'{e}r-Rao lower bounds for HyperLogLog{} and PCSA derived from their Fisher information. Although the Cram\'{e}r-Rao lower bound \emph{can} be achieved with the Maximum Likelihood Estimator (MLE), the MLE is cumbersome to compute and dynamically update. In contrast, \tGRA{} estimators are trivial to update in constant time. Our presentation assumes only basic calculus and probability, not any complex analysis~\cite{FlajoletM85,DurandF03,FlajoletFGM07}.
翻译:\ emph{ Cardinality Estimation} (aca {emph{Distinct}}}) 是许多工业应用的古老的素描问题。 虽然素描 \ emph{algorithms} 相当简单, 分析基本值 \ emph{sestimator} 是臭名昭著的难度, 甚至在今天, 诸如HerperLogLogLog和(压缩的)\ PCSA 等最先进的素描草图也并不包含在研究生级的大数据课程中。 在本文中, 我们定义了 \ emph@ group 剩余区域 {( talf) 的类别, (tGRA) 估计值, 并观察到超低的LogroupL_ 和 roupal_Lial_ roaddroadal) 。