Consider a random sample $(X_{1},\ldots,X_{n})$ from an unknown discrete distribution $P=\sum_{j\geq1}p_{j}\delta_{s_{j}}$ on a countable alphabet $\mathbb{S}$, and let $(Y_{n,j})_{j\geq1}$ be the empirical frequencies of distinct symbols $s_{j}$'s in the sample. We consider the problem of estimating the $r$-order missing mass, which is a discrete functional of $P$ defined as $$\theta_{r}(P;\mathbf{X}_{n})=\sum_{j\geq1}p^{r}_{j}I(Y_{n,j}=0).$$ This is generalization of the missing mass whose estimation is a classical problem in statistics, being the subject of numerous studies both in theory and methods. First, we introduce a nonparametric estimator of $\theta_{r}(P;\mathbf{X}_{n})$ and a corresponding non-asymptotic confidence interval through concentration properties of $\theta_{r}(P;\mathbf{X}_{n})$. Then, we investigate minimax estimation of $\theta_{r}(P;\mathbf{X}_{n})$, which is the main contribution of our work. We show that minimax estimation is not feasible over the class of all discrete distributions on $\mathbb{S}$, and not even for distributions with regularly varying tails, which only guarantee that our estimator is consistent for $\theta_{r}(P;\mathbf{X}_{n})$. This leads to introduce the stronger assumption of second-order regular variation for the tail behaviour of $P$, which is proved to be sufficient for minimax estimation of $\theta_r(P;\mathbf{X}_{n})$, making the proposed estimator an optimal minimax estimator of $\theta_{r}(P;\mathbf{X}_{n})$. Our interest in the $r$-order missing mass arises from forensic statistics, where the estimation of the $2$-order missing mass appears in connection to the estimation of the likelihood ratio $T(P,\mathbf{X}_{n})=\theta_{1}(P;\mathbf{X}_{n})/\theta_{2}(P;\mathbf{X}_{n})$, known as the "fundamental problem of forensic mathematics". We present theoretical guarantees to nonparametric estimation of $T(P,\mathbf{X}_{n})$.
翻译:暂无翻译