Bump hunting deals with finding in sample spaces meaningful data subsets known as bumps. These have traditionally been conceived as modal or concave regions in the graph of the underlying density function. We define an abstract bump construct based on curvature functionals of the probability density. Then, we explore several alternative characterizations involving derivatives up to second order. In particular, a suitable implementation of Good and Gaskins' original concave bumps is proposed in the multivariate case. Moreover, we bring to exploratory data analysis concepts like the mean curvature and the Laplacian that have produced good results in applied domains. Our methodology addresses the approximation of the curvature functional with a plug-in kernel density estimator. We provide theoretical results that assure the asymptotic consistency of bump boundaries in the Hausdorff distance with affordable convergence rates. We also present asymptotically valid and consistent confidence regions bounding curvature bumps. The theory is illustrated through several use cases in sports analytics with datasets from the NBA, MLB and NFL. We conclude that the different curvature instances effectively combine to generate insightful visualizations.
翻译:在样本空间寻找有意义的数据子集,即碰撞。 传统上, 这些数据在原始密度函数图中被视为模型或组合区域。 我们根据概率密度的曲线功能定义一个抽象的峰值结构。 然后, 我们探索了几种替代特征, 包括到第二顺序的衍生物。 特别是, 在多变量案例中建议适当实施 Good 和 Gaskins 最初的峰值。 此外, 我们引入了探索性数据分析概念, 如平均曲线和拉普拉西亚在应用域中产生良好结果的中值曲线和拉普拉西亚。 我们的方法用插内密度估计器处理曲线功能的近似值。 我们提供理论结果, 保证Hausdorf 距离的峰值边界与可承受的趋同率的无症状一致性。 我们还在多变量中提出了无实际效力和一致的信任区域 。 理论通过在运动分析中使用NBA、 MLB 和 NFLL 数据集的一些案例加以说明。 我们的结论是, 不同的曲线将不同的直观图像有效地结合起来。