Submodular functions are at the core of many machine learning and data mining tasks. The underlying submodular functions for many of these tasks are decomposable, i.e., they are sum of several simple submodular functions. In many data intensive applications, however, the number of underlying submodular functions in the original function is so large that we need prohibitively large amount of time to process it and/or it does not even fit in the main memory. To overcome this issue, we introduce the notion of sparsification for decomposable submodular functions whose objective is to obtain an accurate approximation of the original function that is a (weighted) sum of only a few submodular functions. Our main result is a polynomial-time randomized sparsification algorithm such that the expected number of functions used in the output is independent of the number of underlying submodular functions in the original function. We also study the effectiveness of our algorithm under various constraints such as matroid and cardinality constraints. We complement our theoretical analysis with an empirical study of the performance of our algorithm.
翻译:子模块功能是许多机器学习和数据挖掘任务的核心。许多这些任务的基本子模块功能是可分解的,即,它们是几个简单的子模块功能的总和。然而,在许多数据密集的应用程序中,原始函数中潜在的子模块功能的数量非常大,以至于我们需要大量时间来处理它,而且/或者它甚至不适应主记忆。为了克服这个问题,我们引入了可分解子模块功能的循环化概念,其目的是获得原始功能的精确近似,而原始功能只是几个子模块函数的(加权)和。我们的主要结果是一种多米时间随机随机化算法,因此产出中使用的预期功能数量独立于原始函数中潜在的子模块功能的数量。我们还在诸如婴儿和主要条件等各种制约下研究我们的算法的有效性。我们用对算法的绩效进行实验性研究来补充我们的理论分析。