Bi-clustering is a useful approach in analyzing biological data when observations come from heterogeneous groups and have a large number of features. We outline a general Bayesian approach in tackling bi-clustering problems in moderate to high dimensions, and propose three Bayesian bi-clustering models on categorical data, which increase in complexities in their modeling of the distributions of features across bi-clusters. Our proposed methods apply to a wide range of scenarios: from situations where data are cluster-distinguishable only among a small subset of features but masked by a large amount of noise, to situations where different groups of data are identified by different sets of features or data exhibit hierarchical structures. Through simulation studies, we show that our methods outperform existing (bi-)clustering methods in both identifying clusters and recovering feature distributional patterns across bi-clusters. We apply our methods to two genetic datasets, though the area of application of our methods is even broader.
翻译:当观测来自不同组群且具有大量特征时,双组群是一种有用的方法,用于分析生物数据。我们概述了一种一般的巴伊西亚办法,以解决中高层面的双组问题,并提出了三种关于绝对数据的巴伊西亚双组模式,这三种模式增加了两组群地貌分布模型的复杂程度。我们建议的方法适用于多种情况:从数据仅可在小组群地物中分解但有大量噪音掩盖的地物的情况,到不同组地物或数据显示等级结构的不同组群的情况。我们通过模拟研究表明,我们的方法在确定两组群之间现有(双组)的群集方法以及恢复特征分布模式方面都超过了现有的(双组)集群方法。我们对两个基因数据集采用了我们的方法,尽管我们方法的应用领域更为广泛。