基于哈明距离的绝对数据基于模型的集群 (Model-based clustering of categorical data based on the Hamming distance)

A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case, when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches.

翻译：开发一种基于模型的方法,将绝对数据分组而没有自然顺序。拟议方法利用哈明距离来定义概率质量函数的大家庭来模拟数据。然后,这一组的要素被视为具有数量不明的成分的有限混合模型的内核。为哈明分布模型的参数得出了Conjugate Bayesian推论。该混合物以巴耶斯非参数设置为框架,并开发了一个截断的跨维基Gibbs取样器,以提供巴伊西亚语全面推理组群的数目、结构以及特定群体参数,便利对习惯可逆跳动算法进行计算。拟议模型包括一个特殊案例,即当部件数量固定时,隐蔽的潜在类别模型。模型的性能通过模拟研究和参考数据集进行评估,表明对现有方法进行集群回收方面的改进。