通过不重叠的可变分割法进行规模化模式采矿 (Scaling pattern mining through non-overlapping variable partitioning)

Biclustering algorithms play a central role in the biotechnological and biomedical domains. The knowledge extracted supports the extraction of putative regulatory modules, essential to understanding diseases, aiding therapy research, and advancing biological knowledge. However, given the NP-hard nature of the biclustering task, algorithms with optimality guarantees tend to scale poorly in the presence of high-dimensionality data. To this end, we propose a pipeline for clustering-based vertical partitioning that takes into consideration both parallelization and cross-partition pattern merging needs. Given a specific type of pattern coherence, these clusters are built based on the likelihood that variables form those patterns. Subsequently, the extracted patterns per cluster are then merged together into a final set of closed patterns. This approach is evaluated using five published datasets. Results show that in some of the tested data, execution times yield statistically significant improvements when variables are clustered together based on the likelihood to form specific types of patterns, as opposed to partitions based on dissimilarity or randomness. This work offers a departuring step on the efficiency impact of vertical partitioning criteria along the different stages of pattern mining and biclustering algorithms. Availability: All the code is freely available at https://github.com/JupitersMight/pattern_merge under the MIT license.

翻译：生物集群算法在生物技术和生物医学领域发挥着中心作用。获取的知识支持提取对了解疾病、协助治疗研究以及推进生物知识至关重要的推定监管模块。但是,鉴于双集群任务具有NP-硬性性质,具有最佳保证的算法在高维数据存在的情况下往往规模不高。为此,我们建议为基于集群的纵向分割提供一条管道,其中既考虑到平行和跨部模式的合并需要,又考虑到平行和跨部模式的合并需要。鉴于一种特定类型的模式一致性,这些组合法基于变量形成这些模式的可能性。随后,每个集群的提取模式将合并成最后一套封闭模式。这种方法利用5个已公布的数据集进行评估。结果表明,在一些已测试的数据中,如果根据差异或随机性形成特定模式的可能性将变量组合在一起,执行的时间会产生统计上的重大改进。这项工作为在模式采矿和双集群交易的不同阶段纵向分割标准的效率影响提供了一个启航级步骤。 http://migimber/mabregalalal