Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing block-diagonal biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.
翻译:将数据矩阵的行和列同时组合成不同的分组,使分组内的行和列呈现类似的模式。 在本文中,我们考虑了生成区块对角双组群的情况。我们根据尽量减少实证群集风险的想法,对双组群问题提出了新的配方。我们在实证群集风险方面制定并证明一个一致的结果。由于优化问题的性质是组合性的,发现全球最小值在计算上是难以解决的。根据这一事实,我们提出了一个简单和新颖的算法,通过在列和行之间互换使用调整后的 k 比例组群算法,找到一个本地最小值。我们评估和比较了我们的算法在模拟数据和现实世界的基因表达数据集方面的其他相关双组集法的性能。结果显示,我们的算法能够发现数据中有意义的结构,并超越不同环境和情况中其他相互竞争的双组群集法。