Conditional gradient methods (CGM) are widely used in modern machine learning. CGM's overall running time usually consists of two parts: the number of iterations and the cost of each iteration. Most efforts focus on reducing the number of iterations as a means to reduce the overall running time. In this work, we focus on improving the per iteration cost of CGM. The bottleneck step in most CGM is maximum inner product search (MaxIP), which requires a linear scan over the parameters. In practice, approximate MaxIP data-structures are found to be helpful heuristics. However, theoretically, nothing is known about the combination of approximate MaxIP data-structures and CGM. In this work, we answer this question positively by providing a formal framework to combine the locality sensitive hashing type approximate MaxIP data-structures with CGM algorithms. As a result, we show the first algorithm, where the cost per iteration is sublinear in the number of parameters, for many fundamental optimization algorithms, e.g., Frank-Wolfe, Herding algorithm, and policy gradient.
翻译:有条件梯度方法(CGM)在现代机器学习中广泛使用。 CPG的总体运行时间通常由两部分组成: 迭代次数和每个迭代的费用。 大部分努力侧重于减少迭代次数, 以此减少整个循环时间。 在这项工作中, 我们侧重于提高CGM的每次迭代费用。 大多数CGM的瓶颈步骤是最大内部产品搜索( MaxIP), 需要对参数进行线性扫描。 实际上, 近似 MaxIP 的数据结构被认为是有帮助的超常。 但是, 从理论上看, 几乎对 MaxIP 数据结构和 CGM 的组合一无所知。 在这项工作中, 我们通过提供一个正式框架, 将地方敏感程度的集成类型接近 MaxIP 数据结构与 CGM 算法相结合, 来正面回答这个问题。 结果, 我们展示了第一个算法, 即每次迭代计算的成本在参数数量上是次线, 许多基本优化算法, 例如 Frank- Wolfe、 Herding 算法 和政策梯 。