Diversity maximization aims to select a diverse and representative subset of items from a large dataset. It is a fundamental optimization task that finds applications in data summarization, feature selection, web search, recommender systems, and elsewhere. However, in a setting where data items are associated with different groups according to sensitive attributes like sex or race, it is possible that algorithmic solutions for this task, if left unchecked, will under- or over-represent some of the groups. Therefore, we are motivated to address the problem of \emph{max-min diversification with fairness constraints}, aiming to select $k$ items to maximize the minimum distance between any pair of selected items while ensuring that the number of items selected from each group falls within predefined lower and upper bounds. In this work, we propose an exact algorithm based on integer linear programming that is suitable for small datasets as well as a $\frac{1-\varepsilon}{5}$-approximation algorithm for any $\varepsilon \in (0, 1)$ that scales to large datasets. Extensive experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.
翻译:多样性最大化的目的是从大型数据集中选择一个多样化和有代表性的子集,这是一项基本的优化任务,在数据总和、特征选择、网络搜索、推荐系统和其他方面找到应用。然而,在数据项目根据性别或种族等敏感属性与不同组群相联系的环境下,如果数据项目不加以限制,这项任务的算法解决方案可能会低或过份地代表一些组。因此,我们有志于解决具有公平性制约的\emph{max-min多样化问题,目的是选择美元项目,以最大限度地增加所选项目组合之间的最小距离,同时确保从每个组中选定的项目数量在预先确定的下限和上限范围内。在这项工作中,我们建议基于整数线性编程的精确算法,适合于小数据集,以及用于任何 $\ varepsilon \ e e e, 1, 1, 美元, 1美元, 以至大型数据集。在现实世界数据集上进行广泛的实验,展示我们目前提议的高级算法的性。