维持等级群聚的秩序 (Order preserving hierarchical agglomerative clustering)

Partial orders and directed acyclic graphs are commonly recurring data structures that arise naturally in numerous domains and applications and are used to represent ordered relations between entities in the domains. Examples are task dependencies in a project plan, transaction order in distributed ledgers and execution sequences of tasks in computer programs, just to mention a few. We study the problem of order preserving hierarchical clustering of this kind of ordered data. That is, if we have $a < b$ in the original data and denote their respective clusters by $[a]$ and $[b]$, then we shall have $[a] < [b]$ in the produced clustering. The clustering is similarity based and uses standard linkage functions, such as single- and complete linkage, and is an extension of classical hierarchical clustering. To achieve this, we define the output from running classical hierarchical clustering on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is defined as the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the p-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting. A reference implementation is employed for experiments on both synthetic random data and real world data from a database of machine parts. When compared to existing methods, the experiments show that our method excels both in cluster quality and order preservation.

翻译：部分订单和定向环绕图通常是在许多领域和应用中自然产生的经常性数据结构,用来代表各个领域实体之间的有序关系。例如项目计划中的任务依赖性、分布分类账的交易顺序和计算机程序任务的执行顺序,仅举几个例子。我们研究这类定单数据的等级分组的顺序问题。也就是说,如果我们在原始数据中有a < b美元,并用$[a]美元和$[b]表示其各自的组别,那么我们将在制作的分组中拥有 < $[b]美元]和[b]美元。分组基于相似性,使用标准链接功能,如单一和完整链接,是典型等级分类组合的延伸。为了达到这一点,我们定义了在严格订购的数据上运行经典的等级分组的输出结果,以部分为嵌入仪;如果在原始数据组别中存在一个部分,则用$[a]美元和$[b]表示其各自的组别,然后在制作的组群中将部分嵌入一套超度保存。最佳的等级分组群集,例如基于标准连接的连接功能,例如单一和完整连接的连接功能组合组合。为此,我们定义了传统的等级分类比级分类比重的基级分类的分类的分类的分类比,用来用来测量比的基底的基比的基比的基比的基比比的基底的基底的基底的基底的基底基底基底基底基底的基底的基底的基底的基底的基底的基底的基底的基底的基底的基底的基底的基底的基底的基底的基底的基底基底基底的基底的基底基底的基底的基底的基底基底的基底基底基底基底的基底基底的基底基底基底基底的基底基底基底的基底的基的基底的基比比比比比比比比比比比比比比比比比比比基底基底基底基底基底基底基底基底基底基底基底基底基底基底基底基底基底的基底基底基底基底基底基底基底基底基底基底基底基底基底基底基底基底基底基底基底