We introduce an evolutionary algorithm called recombinator-$k$-means for optimizing the highly non-convex kmeans problem. Its defining feature is that its crossover step involves all the members of the current generation, stochastically recombining them with a repurposed variant of the $k$-means++ seeding algorithm. The recombination also uses a reweighting mechanism that realizes a progressively sharper stochastic selection policy and ensures that the population eventually coalesces into a single solution. We compare this scheme with state-of-the-art alternative, a more standard genetic algorithm with deterministic pairwise-nearest-neighbor crossover and an elitist selection policy, of which we also provide an augmented and efficient implementation. Extensive tests on large and challenging datasets (both synthetic and real-word) show that for fixed population sizes recombinator-$k$-means is generally superior in terms of the optimization objective, at the cost of a more expensive crossover step. When adjusting the population sizes of the two algorithms to match their running times, we find that for short times the (augmented) pairwise-nearest-neighbor method is always superior, while at longer times recombinator-$k$-means will match it and, on the most difficult examples, take over. We conclude that the reweighted whole-population recombination is more costly, but generally better at escaping local minima. Moreover, it is algorithmically simpler and more general (it could be applied even to $k$-medians or $k$-medoids, for example). Our implementations are publicly available at \href{https://github.com/carlobaldassi/RecombinatorKMeans.jl}{https://github.com/carlobaldassi/RecombinatorKMeans.jl}.
翻译:我们引入了一种进化算法,称为再组合器- $k$- 优化高非混凝土公里问题。 其定义性特征是, 它的跨跨步涉及当前一代的所有成员, 以重用变方法重新组合它们, 重用 $- 单位+ 种子算法。 重组还使用一种再加权机制, 实现渐进式更敏化的随机选择政策, 并确保人口最终合并成一个单一的解决方案 。 我们比较这个方案, 更先进的替代方案, 更标准的遗传算法, 更标准的双向双向双向双向双向双向双向近亲交叉交叉交叉交叉交叉交叉, 以及一项精英选择政策, 我们还提供一种强化和高效的实施。 对大型且具挑战性的数据集( 合成的和真字) 进行广泛的测试, 显示固定人口规模的再组合- 美元- 单位- 货币- 单位- 表示在可用的优化目标中, 我们发现更昂贵的跨步。 当调整人口规模时, 最短的双向更难的重新算法, 将使用更难的再算方法来, 。 我们的重新使用更难的重新使用。