Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have been proposed to emulate the behaviour of average linkage. Hence, finding hierarchies (or dendrograms) using only comparisons is a well understood problem. However, evaluating their meaningfulness when no ground-truth nor explicit similarities are available remains an open question. In this paper, we bridge this gap by proposing a new revenue function that allows one to measure the goodness of dendrograms using only comparisons. We show that this function is closely related to Dasgupta's cost for hierarchical clustering that uses pairwise similarities. On the theoretical side, we use the proposed revenue function to resolve the open problem of whether one can approximately recover a latent hierarchy using few triplet comparisons. On the practical side, we present principled algorithms for comparison-based hierarchical clustering based on the maximisation of the revenue and we empirically compare them with existing methods.
翻译:比较学习解决了学习问题,当我们只有形式为“物体$A$比$B$更相似,比$C$更相似”的比较而非显式特征或成对相似性时。最近,已经证明,在层次聚类中,可以直接使用这样的比较实现单一和完全链接,同时已经提出了几种算法来模拟平均链接的行为。因此,在只有比较而没有基准真实度或显式相似性的情况下,找到只使用比较来表达层次关系(或树形图)被充分理解。然而,在没有基准真实度或显式相似性的情况下,评估其有意义的程度仍然是一个未解决的问题。在本文中,我们提出了一个新的收益函数,通过它可以用只有比较的信息来测量树形图的好坏。我们展示了这个函数与使用成对相似性的层次聚类中Dasgupta损失相关。在理论上,我们使用所提出的收益函数解决了一个开放性问题,即在使用少量三元组比较恢复潜在层次是否可行。在实践上,我们基于收益的最大化提出了比较层次聚类的原则性算法,并与现有方法进行了实证比较。