We consider the estimation of densities in multiple subpopulations, where the available sample size in each subpopulation greatly varies. This problem occurs in epidemiology, for example, where different diseases may share similar pathogenic mechanism but differ in their prevalence. Without specifying a parametric form, our proposed method pools information from the population and estimate the density in each subpopulation in a data-driven fashion. Drawing from functional data analysis, low-dimensional approximating density families in the form of exponential families are constructed from the principal modes of variation in the log-densities. Subpopulation densities are subsequently fitted in the approximating families based on likelihood principles and shrinkage. The approximating families increase in their flexibility as the number of components increases and can approximate arbitrary infinite-dimensional densities. We also derive convergence results of the density estimates with discrete observations. The proposed methods are shown to be interpretable and efficient in simulation as well as applications to electronic medical record and rainfall data.
翻译:我们考虑对多个亚人口群密度的估计,每个亚人口群的现有抽样规模差异很大。这个问题出现在流行病学中,例如,不同疾病可能具有相似的病原体机制,但其流行程度不同。在不说明参数形式的情况下,我们提议的方法将人口信息汇总起来,并以数据驱动的方式估计每个亚人口群的密度。根据功能数据分析,以指数家庭形式呈现的低维相近密度家庭是从日志密度的主要变化模式构建的。亚人口密度随后根据概率原理和缩缩水在相近家庭中安装。随着组成部分数量的增加,近似家庭增加了灵活性,并且可以近似任意的无限密度。我们还以离散观测方式得出密度估计数的趋同结果。在模拟中,拟议方法可以解释,而且效率很高。在电子医疗记录和降雨数据的应用方面,这些方法也证明是有效的。