Consider a finite sample from an unknown distribution over a countable alphabet. Unobserved events are alphabet symbols which do not appear in the sample. Estimating the probabilities of unobserved events is a basic problem in statistics and related fields, which was extensively studied in the context of point estimation. In this work we introduce a novel interval estimation scheme for unobserved events. Our proposed framework applies selective inference, as we construct confidence intervals (CIs) for the desired set of parameters. Interestingly, we show that obtained CIs are dimension-free, as they do not grow with the alphabet size. Further, we show that these CIs are (almost) tight, in the sense that they cannot be further improved without violating the prescribed coverage rate. We demonstrate the performance of our proposed scheme in synthetic and real-world experiments, showing a significant improvement over the alternatives. Finally, we apply our proposed scheme to large alphabet modeling. We introduce a novel simultaneous CI scheme for large alphabet distributions which outperforms currently known methods while maintaining the prescribed coverage rate.
翻译:考虑从未知的可计算字母分布中采集的有限样本。 未观测事件是没有出现在样本中的字母符号。 估计未观测事件的可能性是统计和相关领域的一个基本问题,在点估计方面对此进行了广泛研究。 在这项工作中,我们为未观测事件引入了一个新的间隔估计计划。 我们的拟议框架应用了选择性推论, 因为我们为理想的参数组构建了信任间隔。 有趣的是, 我们显示, 获得的光标是无维的, 因为它们没有随着字母大小的增长而增长。 此外, 我们显示, 这些光标是( 几乎) 紧凑的, 也就是说, 不违反规定的覆盖率, 就无法进一步改进它们。 我们展示了我们在合成和现实世界实验中的拟议计划的表现, 展示了对替代方案的重大改进。 最后, 我们应用了我们提议的计划来构建大型字母模型。 我们为大型字母分布引入了一种新型同步的CI 计划, 它比目前已知的方法要好, 并保持规定的覆盖率 。
Alphabet is mostly a collection of companies. This newer Google is a bit slimmed down, with the companies that are pretty far afield of our main internet products contained in Alphabet instead.https://abc.xyz/