Optimization is at the heart of machine learning, statistics, and several applied scientific disciplines. Proximal algorithms form a class of methods that are broadly applicable and are particularly well-suited to nonsmooth, constrained, large-scale, and distributed optimization problems. There are essentially five proximal algorithms currently known, each proposed in seminal work: forward-backward splitting, Tseng splitting, Douglas-Rachford, alternating direction method of multipliers, and the more recent Davis-Yin. Such methods sit on a higher level of abstraction compared to gradient-based methods, having deep roots in nonlinear functional analysis. In this paper, we show that all of these algorithms can be derived as different discretizations of a single differential equation, namely the simple gradient flow which dates back to Cauchy (1847). An important aspect behind many of the success stories in machine learning relies on "accelerating" the convergence of first order methods. However, accelerated methods are notoriously difficult to analyze, counterintuitive, and without an underlying guiding principle. We show that by employing similar discretization schemes to Newton's classical equation of motion with an additional dissipative force, which we refer to as the accelerated gradient flow, allow us to obtain accelerated variants of all these proximal algorithms; the majority of which are new although some recover known cases in the literature. Moreover, we extend these algorithms to stochastic optimization settings, allowing us to make connections with Langevin and Fokker-Planck equations. Similar ideas apply to gradient descent, heavy ball, and Nesterov's method which are simpler. These results thus provide a unified framework from which several optimization methods can be derived from basic physical systems.
翻译:优化是机器学习、 统计和若干应用科学学科的核心。 美化算法是机器学习、 统计和若干应用科学学科的核心。 美化算法是一系列广泛应用、 特别适合非光学、 限制、 大规模和分布式优化问题的方法。 目前基本上有五种最准的算法, 每一种都是在初级工作中提出的: 前向后分裂、 尖锐分裂、 Douglas- Rachford 、 乘数交替方向方法, 以及最近的 Davis- Yin 。 与基于梯度的方法相比, 此类方法处于更高层次的抽象结构, 在非线性功能分析中有着深刻的根基。 在本文中, 我们显示所有这些算法都可以作为单一差异方程式的不同分解, 即简单的梯度流可以追溯到Cauchy( 1847) 。 许多机器学习的成功故事背后的一个重要方面 依赖于“ 加速” 方法。 然而, 加速的方法是众所周知的难以分析, 直观的, 直观, 以及没有基本的指导原则。 我们显示, 通过使用类似离分解的直线性的方法, 直向牛顿的直流的直流的直线性的直系, 直系, 直系, 直系, 直系的直系的直系, 直系, 直系, 直系, 直系的直系的直系, 直系的直系的直系的直系的直系的直系, 直系的直系的直系的直系的直系的直系的直系, 。