Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R\'enyi's $\alpha$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
翻译:现代深度神经网络通常采用全局交叉熵损失以监督式端到端方式进行训练:神经元需要存储其输出权重;训练在前向传播(计算)与自上而下的反向传播(学习)之间交替进行,这在生物学上缺乏合理性。作为替代方案,贪婪逐层训练消除了对交叉熵损失和反向传播的需求。通过避免计算中间梯度及存储中间输出,该方法降低了内存占用,并有助于缓解梯度消失或爆炸等问题。然而,现有的大多数逐层训练方法仅在具有简单深层架构的相对较小数据集上进行过评估。本文首先通过信息论视角,系统分析了采用随机梯度下降训练的流行卷积神经网络的训练动态。研究发现表明,网络自底向上逐层收敛,且信息流遵循马尔可夫信息瓶颈原理。基于这些观察,我们提出了一种基于近期发展的确定性信息瓶颈和矩阵化Rényi α阶熵泛函的新型逐层训练方法。具体而言,每一层与直接连接输出层的辅助分类器联合训练,从而学习最小充分的任务相关表示。我们在CIFAR-10和CIFAR-100数据集上使用现代深度卷积神经网络实证验证了训练流程的有效性,并进一步展示了其在交通标志识别实际任务中的适用性。该方法不仅超越了现有逐层训练基线,还取得了与随机梯度下降相媲美的性能。