Maximum likelihood estimates (MLEs) are asymptotically normally distributed, and this property is used in meta-analyses to test the heterogeneity of estimates, either for a single cluster or for several sub-groups. More recently, MLEs for associations between risk factors and diseases have been hierarchically clustered to search for diseases with shared underlying causes, but the approach needs an objective statistical criterion to determine the optimum number and composition of clusters. Conventional statistical tests are briefly reviewed, before considering the posterior distribution associated with partitioning data into clusters. The posterior distribution is calculated by marginalising out the unknown cluster centres, and is different to the likelihood associated with mixture models. The calculation is equivalent to that used to obtain the Bayesian Information Criterion (BIC), but is exact, without a Laplace approximation. The result includes a sum of squares term, and terms that depend on the number and composition of clusters, that penalise the number of free parameters in the model. The usual BIC is shown to be unsuitable for clustering applications unless the number of items in all clusters are sufficiently large.
翻译:最大概率估计数(MLEs)通常不时分布,这种特性用于元分析,以测试某一组或若干子组的估计数的异质性。最近,风险因素和疾病之间关联的数值按等级分组,以寻找具有共同根本原因的疾病,但这一方法需要客观的统计标准来确定集群的最佳数量和组成。常规统计测试经过简要审查,然后才考虑与数据分解成组有关的后遗分布。后遗分布是通过将未知的集群中心边缘化来计算的,与混合模型有关的可能性不同。计算方法相当于用于获取巴伊西亚信息标准(BIC)的数值,但准确无误,没有拉普尔近似值。结果包括一个平方术语总和条件,取决于集群的数量和组成,从而惩罚模型中自由参数的数量。通常的BIC不适于集成应用,除非所有组的物品数量足够大。