We study the problem of extracting a small subset of representative items from a large data stream. In many data mining and machine learning applications such as social network analysis and recommender systems, this problem can be formulated as maximizing a monotone submodular function subject to a cardinality constraint $k$. In this work, we consider the setting where data items in the stream belong to one of several disjoint groups and investigate the optimization problem with an additional \emph{fairness} constraint that limits selection to a given number of items from each group. We then propose efficient algorithms for the fairness-aware variant of the streaming submodular maximization problem. In particular, we first give a $ (\frac{1}{2}-\varepsilon) $-approximation algorithm that requires $ O(\frac{1}{\varepsilon} \log \frac{k}{\varepsilon}) $ passes over the stream for any constant $ \varepsilon>0 $. Moreover, we give a single-pass streaming algorithm that has the same approximation ratio of $(\frac{1}{2}-\varepsilon)$ when unlimited buffer sizes and post-processing time are permitted, and discuss how to adapt it to more practical settings where the buffer sizes are bounded. Finally, we demonstrate the efficiency and effectiveness of our proposed algorithms on two real-world applications, namely \emph{maximum coverage on large graphs} and \emph{personalized recommendation}.
翻译:我们研究从大数据流中提取代表项目的小子集的问题。 在许多数据挖掘和机器学习应用程序中, 比如社交网络分析和推荐系统, 这个问题可以被描述为在基度限制下最大化单调子模块功能 $k$。 在这项工作中, 我们考虑流中的数据项目属于多个脱节组之一的设置, 并使用额外的 emph{ 公平} 限制来调查优化问题, 将选择限制在每个组的某个特定项目中。 我们然后为流出子模式最大化问题的公平认知变量提出有效的算法 。 特别是, 我们首先给出一个$ (\ frac{ 1\\\\\\\\\\\\\\\\\\\\\\ varepsilon) 的组合。 $ (\\ frafrac\ varepsilon) 的算法需要 O(\\\\\\\ { 1\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\