Currently, data-driven discovery in biological sciences resides in finding segmentation strategies in multivariate data that produce sensible descriptions of the data. Clustering is but one of several approaches and sometimes falls short because of difficulties in assessing reasonable cutoffs, the number of clusters that need to be formed or that an approach fails to preserve topological properties of the original system in its clustered form. In this work, we show how a simple metric for connectivity clustering evaluation leads to an optimised segmentation of biological data. The novelty of the work resides in the creation of a simple optimisation method for clustering crowded data. The resulting clustering approach only relies on metrics derived from the inherent properties of the clustering. The new method facilitates knowledge for optimised clustering, which is easy to implement. We discuss how the clustering optimisation strategy corresponds to the viable information content yielded by the final segmentation. We further elaborate on how the clustering results, in the optimal solution, corresponds to prior knowledge of three different data sets.
翻译:目前,生物科学中的数据驱动发现在于寻找多变量数据中的分离战略,从而产生对数据的合理描述。分组只是几种方法之一,有时由于难以评估合理截断、需要形成组群的数量或方法未能保存原始系统以其分组形式产生的地形特性而不能加以评估,因此分组只是几种方法中的一种,有时是不足的。在这项工作中,我们展示了连接集中评价的简单指标如何导致对生物数据进行优化分割。这项工作的新颖之处在于为聚集拥挤数据创建简单的优化方法。由此产生的集群方法仅依赖于从集群固有特性中得出的指标。新方法便于了解优化组合,易于实施。我们讨论了组合优化战略如何与最终分割产生的可行信息内容相匹配。我们进一步阐述了集群在最佳解决方案中的结果如何与先前对三种不同数据集的了解相匹配。