Biclustering algorithms partition data and covariates simultaneously, providing new insights in several domains, such as analyzing gene expression to discover new biological functions. This paper develops a new model-free biclustering algorithm in abstract spaces using the notions of energy distance (ED) and the maximum mean discrepancy (MMD) -- two distances between probability distributions capable of handling complex data such as curves or graphs. The proposed method can learn more general and complex cluster shapes than most existing literature approaches, which usually focus on detecting mean and variance differences. Although the biclustering configurations of our approach are constrained to create disjoint structures at the datum and covariate levels, the results are competitive. Our results are similar to state-of-the-art methods in their optimal scenarios, assuming a proper kernel choice, outperforming them when cluster differences are concentrated in higher-order moments. The model's performance has been tested in several situations that involve simulated and real-world datasets. Finally, new theoretical consistency results are established using some tools of the theory of optimal transport.
翻译:双组算法分割数据并同时共变, 在多个领域提供新的洞察力, 比如分析基因表达方式以发现新的生物功能。 本文在抽象空间开发一种新的无模型的双组组合算法, 使用能源距离和最大平均差异的概念( MMD) -- -- 概率分布的两条距离, 能够处理曲线或图形等复杂数据。 提议的方法可以学习比大多数现有文献方法更一般和复杂的群集形状, 这些方法通常侧重于检测中值和差异。 虽然我们方法的双组组合配置无法在数据与共变之间创建脱节结构, 但结果是竞争性的。 我们的结果与最理想的情景中最先进的方法相似, 假设适当的内核选择, 当群集差异集中在更高级的瞬间时, 优于它们。 模型的性能已经在若干情况下进行了测试, 其中包括模拟和现实世界数据集。 最后, 新的理论一致性结果是使用最佳运输理论的一些工具建立的。