Backpropagation (BP) is the cornerstone of today's deep learning algorithms, but it is inefficient partially because of backward locking, which means updating the weights of one layer locks the weight updates in the other layers. Consequently, it is challenging to apply parallel computing or a pipeline structure to update the weights in different layers simultaneously. In this paper, we introduce a novel learning structure called associated learning (AL), which modularizes the network into smaller components, each of which has a local objective. Because the objectives are mutually independent, AL can learn the parameters in different layers independently and simultaneously, so it is feasible to apply a pipeline structure to improve the training throughput. Specifically, this pipeline structure improves the complexity of the training time from O(nl), which is the time complexity when using BP and stochastic gradient descent (SGD) for training, to O(n + l), where n is the number of training instances and l is the number of hidden layers. Surprisingly, even though most of the parameters in AL do not directly interact with the target variable, training deep models by this method yields accuracies comparable to those from models trained using typical BP methods, in which all parameters are used to predict the target variable. Consequently, because of the scalability and the predictive power demonstrated in the experiments, AL deserves further study to determine the better hyperparameter settings, such as activation function selection, learning rate scheduling, and weight initialization, to accumulate experience, as we have done over the years with the typical BP method. Additionally, perhaps our design can also inspire new network designs for deep learning. Our implementation is available at https://github.com/SamYWK/Associated_Learning.
翻译:反向调整 (BP) 是当今深层学习算法的基石, 但效率低下, 部分原因是后向锁定, 这意味着更新一个层的重量, 从而更新一个层的重量, 锁定其他层的重量更新。 因此, 要同时应用平行计算或管道结构来更新不同层的重量, 具有挑战性。 在本文中, 我们引入了一个叫作相关学习( AL) 的新学习结构, 将网络模块化成较小的部分, 每个部分都具有本地目标。 由于目标是相互独立的, AL 可以独立和同时在不同层中学习参数, 所以应用管道结构来改进培训流程。 因此, 也就是说, 这个管道结构可以提高培训时间的复杂性, 从O( nl) 来同时使用平行计算或者管道结构来更新不同层次的重量。 在O( n + l) 中, 将网络的模块模块组合成小部分, 将网络组合成一个地方级。 令人惊讶的是, 尽管 AL 中的大多数参数并不直接与目标变量互动, 深层次模型可以产生学习 。