We introduce an evolutionary algorithm called recombinator-$k$-means for optimizing the highly non-convex kmeans problem. Its defining feature is that its crossover step involves all the members of the current generation, stochastically recombining them with a repurposed variant of the $k$-means++ seeding algorithm. The recombination also uses a reweighting mechanism that realizes a progressively sharper stochastic selection policy and ensures that the population eventually coalesces into a single solution. We compare this scheme with state-of-the-art alternative, a more standard genetic algorithm with deterministic pairwise-nearest-neighbor crossover and an elitist selection policy, of which we also provide an augmented and efficient implementation. Extensive tests on large and challenging datasets (both synthetic and real-word) show that for fixed population sizes recombinator-$k$-means is generally superior in terms of the optimization objective, at the cost of a more expensive crossover step. When adjusting the population sizes of the two algorithms to match their running times, we find that for short times the (augmented) pairwise-nearest-neighbor method is always superior, while at longer times recombinator-$k$-means will match it and, on the most difficult examples, take over. We conclude that the reweighted whole-population recombination is more costly, but generally better at escaping local minima. Moreover, it is algorithmically simpler and more general (it could be applied even to $k$-medians or $k$-medoids, for example). Our implementations are publicly available.
翻译:我们引入了一种叫再融合器- $k$ 的进化算法, 以优化高度非混凝土公里问题。 它的界定性特征是, 它的跨跨步涉及到当前一代的所有成员, 用重用的变方法重新组合它们, 用美元汇率+种子算法的重用变方法。 重组还使用一种重新加权机制, 实现一种渐进式更敏捷的选择选择政策, 并确保人口最终会合并成一个单一的解决方案。 我们把这个方案与最先进的替代方案, 一个更标准的遗传算法, 更标准的双向双向双向双向双向双向双向近邻和交叉选择政策, 我们还提供一种强化和高效的实施。 对大型和具有挑战性的数据集( 合成的和真实的) 的广泛测试显示, 对于固定人口规模的再组合- 美元汇率选择方法来说, 总体优化目标一般而言, 我们发现更昂贵的跨步。 当调整两个更接近的中间值人口规模时, 通常采用更困难的重新算法 。