We investigate the fundamental optimization question of minimizing a target function $f$, whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, etc. We propose two generic new algorithms that apply in all these settings and prove that we can benefit from this framework using only an assumption on the Hessian similarity between the target and side information. A benefit is obtained when this similarity measure is small, we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.
翻译:暂无翻译