Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of inter-dependent parameters, render direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow variables that average the erratic behavior of the fast microscopic variables. Here, we identify a similar separation of scales occurring in fully trained finitely over-parameterized deep convolutional neural networks (CNNs) and fully connected networks (FCNs). Specifically, we show that DNN layers couple only through the second moment (kernels) of their activations and pre-activations. Moreover, the latter fluctuates in a nearly Gaussian manner. For infinite width DNNs, these kernels are inert, while for finite ones they adapt to the data and yield a tractable data-aware Gaussian Process. The resulting thermodynamic theory of deep learning yields accurate predictions in various settings. In addition, it provides new ways of analyzing and understanding DNNs in general.
翻译:深神经网络(DNNS)是压缩和蒸馏信息的强大工具,其规模和复杂程度往往涉及数十亿个独立参数,因此难以进行直接微观分析。在这种情况下,共同的战略是确定平均快速微小变量的不稳定行为的慢变数。在这里,我们确定在经过充分训练的有限超分的深革命神经网络(CNNs)和完全连通的网络(FCNs)中出现的类似尺度的分离。具体地说,我们显示DNN层在激活和预激活的第二个时刻(内核)才对齐。此外,后者以近高斯方式波动。对于无限宽的 DNNS来说,这些内核是无效的,而对于有限的,它们适应数据并产生可感应的数据高斯进程。由此形成的深学习的热力理论在各种环境中得出准确的预测结果。此外,它提供了分析和理解一般DNNS的新方法。