分层集群:新的边界和目标 (Hierarchical Clustering: New Bounds and Objective)

Hierarchical Clustering has been studied and used extensively as a method for analysis of data. More recently, Dasgupta [2016] defined a precise objective function. Given a set of $n$ data points with a weight function $w_{i,j}$ for each two items $i$ and $j$ denoting their similarity/dis-similarity, the goal is to build a recursive (tree like) partitioning of the data points (items) into successively smaller clusters. He defined a cost function for a tree $T$ to be $Cost(T) = \sum_{i,j \in [n]} \big(w_{i,j} \times |T_{i,j}| \big)$ where $T_{i,j}$ is the subtree rooted at the least common ancestor of $i$ and $j$ and presented the first approximation algorithm for such clustering. Then Moseley and Wang [2017] considered the dual of Dasgupta's objective function for similarity-based weights and showed that both random partitioning and average linkage have approximation ratio $1/3$ which has been improved in a series of works to $0.585$ [Alon et al. 2020]. Later Cohen-Addad et al. [2019] considered the same objective function as Dasgupta's but for dissimilarity-based metrics, called $Rev(T)$. It is shown that both random partitioning and average linkage have ratio $2/3$ which has been only slightly improved to $0.667078$ [Charikar et al. SODA2020]. Our first main result is to consider $Rev(T)$ and present a more delicate algorithm and careful analysis that achieves approximation $0.71604$. We also introduce a new objective function for dissimilarity-based clustering. For any tree $T$, let $H_{i,j}$ be the number of $i$ and $j$'s common ancestors. Intuitively, items that are similar are expected to remain within the same cluster as deep as possible. So, for dissimilarity-based metrics, we suggest the cost of each tree $T$, which we want to minimize, to be $Cost_H(T) = \sum_{i,j \in [n]} \big(w_{i,j} \times H_{i,j} \big)$. We present a $1.3977$-approximation for this objective.

翻译：已经研究并广泛使用等级分类法作为分析数据的方法。最近, Dasgupta [2016] 定义了一个精确的客观函数。一组美元的数据点, 其重量函数为$w ⁇ i, j} 美元, 每两个项目以美元表示其相似性/不同性, 目标是将数据点( 类似) 的递归( 树类) 分割成相继较小的组。他定义了一个成本函数: $T( T) 的树值为$Cost( T) = sum_i, j) 定义了一个精确的美元。随机( 美元) 美元( 美元, 美元) 美元( j} 美元美元), 每一组美元( 美元) 美元( 美元) 美元( 美元) 。目标是将数据点( 类似数据点( 类似) 的循环( 树类) 分割法( T) 确定一个成本函数的成本值为$( T) 美元( T) = 美元( Sumét) 美元, 美元( t) 美元 = 美元( t) 美元) 美元( mess) 美元) 美元( 美元) 美元) 美元( 美元) 和王( 美元) 美元) 美元) 以美元( 美元) 美元( 美元) 美元) 美元( 美元) 美元) 美元( 美元) 美元( 美元) 美元( 美元( 美元) 美元) 美元) 美元) 。。。他( 以相似性( t( t) 美元(l) 美元) 美元( 美元) 美元( t( t( ) =( ) ) ) =( ) ) ) ) =( =( ) ) = = = = (tal) = (tal) = (tal) = (tal) = (treal) = (tal) = (tal) =(l) = (tal) = (tal) =(sal) =(x(x(x) = (tal)