银河系统数据集中有多少数据组组在银河系统数据集中? (How many data clusters are in the Galaxy data set? Bayesian cluster analysis in action)

In model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (2001) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin's concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model where a prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications are recommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also, the regularizing properties of the priors may be intentionally exploited to obtain a suitable clustering solution meeting prior expectations and needs of the application.

翻译：在基于模型的集群中,银河数据集经常被用作基准数据集,用于研究不同模型方法的绩效。Aitkin(2001年)比较了最大可能性和Bayesian对银河数据集的分析,并对巴伊西亚办法表示保留,因为先前的假设仍然相当模糊,同时在获得的结果和得出的结论中起着主要作用。文件的目的是说明Aitkin对巴伊西亚办法的关切,说明具体前期规定对估计组数的影响。我们用人工数据模拟研究提供进一步的经验证据,以支持建议。我们清楚了解以前关于定额混合模型混合物不同规格的影响,即以前包含组成部分数的混合模型。我们使用一套广泛的先前不同规格,在全面因素设计中使用,并评估这些规格对银河数据集估计组数的影响。结果着重说明先前规格的互动影响,并介绍建议采用哪些前期规格以获得稀少的集群解决办法。我们用人工数据进行进一步的经验证据支持建议。我们清楚了解以前的规格的影响,消除了以前对Bayes混合物混合物混合物混合物混合物混合物的影响,即先列入部件数的混合模型;我们使用一套不同的前期规格,在全面要素设计中采用各种不同的规格,可以避免使用Bayes的规格,并事先适当选择在采用以前采用的方法,从而适当选择采用。