We introduce a novel statistical significance-based approach for clustering hierarchical data using semi-parametric linear mixed-effects models designed for responses with laws in the exponential family (e.g., Poisson and Bernoulli). Within the family of semi-parametric mixed-effects models, a latent clustering structure of the highest-level units can be identified by assuming the random effects to follow a discrete distribution with an unknown number of support points. We achieve this by computing {\alpha}-level confidence regions of the estimated support point and identifying statistically different clusters. At each iteration of a tailored Expectation Maximization algorithm, the two closest estimated support points for which the confidence regions overlap collapse. Unlike the related state-of-the-art methods that rely on arbitrary thresholds to determine the merging of close discrete masses, the proposed approach relies on conventional statistical confidence levels, thereby avoiding the use of discretionary tuning parameters. To demonstrate the effectiveness of our approach, we apply it to data from the Programme for International Student Assessment (PISA - OECD) to cluster countries based on the rate of innumeracy levels in schools. Additionally, a simulation study and comparison with classical parametric and state-of-the-art models are provided and discussed.
翻译:在半参数混合效应模型中,最高单位的潜在集合结构可以假设随机效应,以分散分布,支持点数目不详,从而确定最高单位的潜在集合结构。我们利用半参数线性混合效应模型将等级数据分组,采用半参数性线性混合效应模型(例如Poisson和Bernoulli)。在半参数性混合效应模型中,可以通过假设随机效应来跟踪离散分布和数量不明的支持点,从而确定最高单位的潜在集合结构。我们通过计算估计支持点的偏偏偏信任区和识别统计上不同的组群来实现这一点。在定制的预期最大化算法的每一次迭代中,信任区相互重叠的两个最接近的估计支持点是两个最接近的估计支持点。与依赖任意阈值来确定离散质量合并的相关最新方法不同,拟议方法依赖于传统的统计信任水平,从而避免使用酌定调整参数。为了证明我们的方法的有效性,我们将其应用于国际学生评估方案(PISA-OECD)根据学校的数值水平向分组国家提供的数据。此外,模拟研究和比较与古典准参数和状态模型是讨论的。