We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision $\rho$ of the gradient calculations relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With fine enough precision relative to minibatch size, namely when $b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Similarly, with fine enough precision relative to the sample size $m$, GD can also simulate any sample-based learning algorithm based on $m$ samples. In particular, with polynomially many bits of precision (i.e. when $\rho$ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size. On the other hand, when $b \rho^2$ is large enough, the power of SGD is equivalent to that of SQ learning.
翻译:我们研究通过小批量悬浮梯度下降(SGD)学习人口损失的能力,并分批研究关于不同模型或神经神经网络的经验损失(GD),并询问使用这些模式可以学习什么问题。我们表明SGD和GD总是可以通过统计查询(SQ)模拟学习能力,但是,其超出能力取决于相对于小批量规模美元(SGD)和样本规模(GD)美元(GD)的梯度计算精确度($b美元)和样本规模($m)美元。在小批量损失(即$b\r$足够小的情况下,SGD可以超越SQ学习和模拟任何基于样本的学习算法,因此其学习能力相当于PAC学习的能力;这扩展了以前为美元=1美元取得这一结果的工作。同样,如果与样本规模美元相比精度相当,GD还可以模拟任何基于美元样本的算法($美元)。特别是小量的QMUID值(SGD),而SGD的精度(SGD)则足够,而SGD值(SGD2)为SGD2的微小的精度(GD)和GD(SGD)的精度(SGD)的大小。