Understanding the shape of a distribution of data is of interest to people in a great variety of fields, as it may affect the types of algorithms used for that data. We study one such problem in the framework of distribution property testing, characterizing the number of samples required to to distinguish whether a distribution has a certain property or is far from having that property. In particular, given samples from a distribution, we seek to characterize the tail of the distribution, that is, understand how many elements appear infrequently. We develop an algorithm based on a careful bucketing scheme that distinguishes light-tailed distributions from non-light-tailed ones with respect to a definition based on the hazard rate, under natural smoothness and ordering assumptions. We bound the number of samples required for this test to succeed with high probability in terms of the parameters of the problem, showing that it is polynomial in these parameters. Further, we prove a hardness result that implies that this problem cannot be solved without any assumptions.
翻译:了解数据分布的形状对于众多领域的人来说是有意义的,因为这可能会影响数据所使用的算法类型。我们在分配属性测试的框架内研究一个这样的问题,确定为区分分配是否具有某种属性或远非该属性而需要的样本数量。特别是,根据分布的样本,我们试图描述分布的尾部,即了解有多少元素不经常出现。我们根据一种谨慎的桶装计划制定了一种算法,根据自然平稳和定购假设,区分基于危险率的定义的轻量散发与非浅量的分布。我们将这项测试所需的样本数量按问题参数的高度概率加以限制,表明其在这些参数中是多元的。此外,我们证明一个硬性的结果意味着这个问题在没有任何假设的情况下是无法解决的。