The optimization of submodular functions constitutes a viable way to perform clustering. Strong approximation guarantees and feasible optimization w.r.t. streaming data make this clustering approach favorable. Technically, submodular functions map subsets of data to real values, which indicate how "representative" a specific subset is. Optimal sets might then be used to partition the data space and to infer clusters. Exemplar-based clustering is one of the possible submodular functions, but suffers from high computational complexity. However, for practical applications, the particular real-time or wall-clock run-time is decisive. In this work, we present a novel way to evaluate this particular function on GPUs, which keeps the necessities of optimizers in mind and reduces wall-clock run-time. To discuss our GPU algorithm, we investigated both the impact of different run-time critical problem properties, like data dimensionality and the number of data points in a subset, and the influence of required floating-point precision. In reproducible experiments, our GPU algorithm was able to achieve competitive speedups of up to 72x depending on whether multi-threaded computation on CPUs was used for comparison and the type of floating-point precision required. Half-precision GPU computation led to large speedups of up to 452x compared to single-precision, single-thread CPU computations.
翻译:优化子模块函数是一种可行的组合方式。 强大的近似保障和可行的优化 w.r.t. t. 流动数据使这种组合法变得有利。 在技术上, 子模块函数将数据子集映射成真实值, 表明“ 代表” 一个特定子集如何。 最佳组合可以用来分割数据空间和推断组。 Exmplar 组合是可能的子模块功能之一, 但却具有高计算复杂性。 但是, 对于实际应用来说, 特定的实时或墙上时钟运行时间是决定性的。 在这项工作中, 我们提出了一个创新的方法来评估这个特定功能。 在GPUs上, 它将保持优化者的需要, 并减少小时运行时间。 为了讨论我们的 GPU 算法, 我们研究了不同运行时的关键问题特性的影响, 比如数据维度和子集中的数据点数量, 以及要求的浮动点精确度的影响。 在可复制的实验中, 我们的 GPU 算法能够达到最多72x 的竞争性加速度, 取决于是否将优化者放在心上, MIPS 4 类的计算中, 需要进行大型的移动的C- 级的计算。