We connect the problem of semi-supervised clustering to constrained Markov aggregation, i.e., the task of partitioning the state space of a Markov chain. We achieve this connection by considering every data point in the dataset as an element of the Markov chain's state space, by defining the transition probabilities between states via similarities between corresponding data points, and by incorporating semi-supervision information as hard constraints in a Hartigan-style algorithm. The introduced Constrained Markov Clustering (CoMaC) is an extension of a recent information-theoretic framework for (unsupervised) Markov aggregation to the semi-supervised case. Instantiating CoMaC for certain parameter settings further generalizes two previous information-theoretic objectives for unsupervised clustering. Our results indicate that CoMaC is competitive with the state-of-the-art. Furthermore, our approach is less sensitive to hyperparameter settings than the unsupervised counterpart, which is especially attractive in the semi-supervised setting characterized by little labeled data.
翻译:我们把半监督的集群问题与限制的Markov聚合联系起来,即将马尔科夫链条的国家空间分割任务。我们通过将数据集中的每个数据点作为Markov链条国家空间的一个要素来实现这一联系,通过相应的数据点之间的相似点来界定各州之间的过渡概率,并将半监督信息作为硬性限制纳入Hartigan式算法。引入的 Constractive Markov 集群(CoMaC)是(不受监督的)Markov 集合的最新信息理论框架的延伸,而这是半监督案例的延伸。为某些参数设置而强化的COMaC进一步概括了先前两个未监督的集群的信息理论目标。我们的结果表明,CoMaC与State-art相比具有竞争力。此外,我们的方法对超参数设置的敏感度比未监督的对应方要低,在以小标签数据为特征的半监督环境中,这种环境特别有吸引力。