Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have been proposed to emulate the behaviour of average linkage. Hence, finding hierarchies (or dendrograms) using only comparisons is a well understood problem. However, evaluating their meaningfulness when no ground-truth nor explicit similarities are available remains an open question. In this paper, we bridge this gap by proposing a new revenue function that allows one to measure the goodness of dendrograms using only comparisons. We show that this function is closely related to Dasgupta's cost for hierarchical clustering that uses pairwise similarities. On the theoretical side, we use the proposed revenue function to resolve the open problem of whether one can approximately recover a latent hierarchy using few triplet comparisons. On the practical side, we present principled algorithms for comparison-based hierarchical clustering based on the maximisation of the revenue and we empirically compare them with existing methods.
翻译:以比较为基础的学习解决了学习问题,当时只有一个人可以比较形式: \ emph{Object $A$比美元美元更接近于美元。}最近,已经表明,在等级类集中,仅可直接使用这种比较来实施单一和完整的联系,同时提出若干算法以效仿平均联系的行为。因此,只使用比较就能找到等级(或登杜拉克)是一个广为人知的问题。然而,在没有地面真相或明显相似之处时评价其意义仍然是一个尚未解决的问题。在本文中,我们提出一个新的收入功能,允许一个人仅用比较来衡量弯曲仪的好坏程度,以此弥补这一差距。我们表明,这一功能与Dasgupta的等级类集成本密切相关,使用相似之处。在理论方面,我们使用拟议的收入函数来解决一个公开的问题,即一个人是否能够利用很少的三重比较来恢复前位等级。在实际方面,我们提出根据现有等级收入的比较方法进行原则性推算。