We propose a new variant of the k-median problem, where the objective function models not only the cost of assigning data points to cluster representatives, but also a penalty term for disagreement among the representatives. We motivate this novel problem by applications where we are interested in clustering data while avoiding selecting representatives that are too far from each other. For example, we may want to summarize a set of news sources, but avoid selecting ideologically-extreme articles in order to reduce polarization. To solve the proposed k-median formulation we adopt the local-search algorithm of Arya et al. We show that the algorithm provides a provable approximation guarantee, which becomes constant under an assumption on the minimum number of points for each cluster. We experimentally evaluate our problem formulation and proposed algorithm on datasets inspired by the motivating applications. In particular, we experiment with data extracted from Twitter, the US Congress voting records, and popular news sources. The results show that our objective can lead to choosing less polarized groups of representatives without significant loss in representation fidelity.
翻译:我们建议了 k- 中间问题的新变体, 客观功能模型不仅向分组代表分配数据的成本指向数据组代表, 而且还规定了代表之间意见分歧的处罚条件。 我们通过应用来激励这个新问题,我们感兴趣的是将数据分组,同时避免选择彼此相距太远的代表。 例如, 我们可能想要总结一组新闻来源, 但避免选择意识形态极端文章来减少两极分化。 为了解决拟议的 k- 中间公式, 我们采用了Arya et al 的本地搜索算法。 我们显示, 算法提供了一种可验证的近似保证, 在每个组最低点数的假设下, 这种保证会保持不变。 我们实验性地评估我们的问题配置, 并提议由激励性应用程序所启发的数据集的算法。 特别是, 我们实验从Twitter、 美国国会投票记录和流行新闻来源提取的数据。 结果显示, 我们的目标可以导致选择极化程度较低的代表群体, 而不会在真实性方面遭受重大损失。