Given $n$ samples from a population of individuals belonging to different species, what is the number $U$ of hitherto unseen species that would be observed if $\lambda n$ new samples were collected? This is an important problem in many scientific endeavors, and it has been the subject of recent works introducing non-parametric estimators of $U$ that are minimax near-optimal and consistent all the way up to $\lambda\asymp\log n$. These works do not rely on any assumption on the underlying unknown distribution $p$ of the population, and therefore, while providing a theory in its greatest generality, worst-case distributions may severely hamper the estimation of $U$ in concrete applications. In this paper, we consider the problem of strengthening the non-parametric framework for estimating $U$. Inspired by the estimation of rare probabilities in extreme value theory, and motivated by the ubiquitous power-law type distributions in many natural and social phenomena, we make use of a semi-parametric assumption regular variation of index $\alpha\in(0,1)$ for the tail behaviour of $p$. Under this assumption, we introduce an estimator of $U$ that is simple, linear in the sampling information, computationally efficient, and scalable to massive datasets. Then, uniformly over our class of regularly varying tail distributions, we show that the proposed estimator has provable guarantees: i) it is minimax near-optimal, up to a power of $\log n$ factor; ii) it is consistent all of the way up to $\log \lambda\asymp n^{\alpha/2}/\sqrt{\log n}$, and this range is the best possible. This work presents the first study on the estimation of the unseen under regularly varying tail distributions. A numerical illustration of our methodology is presented for synthetic data and real data.
翻译:以来自不同物种个体群的美元样本来看, 如果收集到$\ lambda n n$ 新的样本, 将观察到的迄今未知物种的美元数量是多少? 这是许多科学努力中的一个重要问题, 并且一直是最近一些作品的主题, 引入了非参数性估算值为$美元, 这些估算值接近最佳且始终一致, 直至极值理论中的稀有概率, 并受到许多自然和社会现象中权力法类型分布的无限变化的驱动。 我们使用一个半参数假设, 其基本分布值为$, 因此, 在最笼统的理论中, 最坏的情况分布会严重妨碍对美元的具体应用的估算值。 在本文中, 我们考虑的是加强非参数性估算值框架的问题。 极值理论中稀有的概率估计, 在许多自然和社会现象中, 各种能量- 法律类型分布, 我们使用一个半参数的假设值定期变化 $ 升 。