在机器学习中,使用基于梯度的学习方法和反向传播训练人工神经网络时,会遇到梯度消失的问题。在这种方法中,每个神经网络的权值在每次迭代训练时都得到一个与误差函数对当前权值的偏导数成比例的更新。问题是,在某些情况下,梯度会极小,有效地阻止权值的改变。在最坏的情况下,这可能会完全阻止神经网络进一步的训练。作为问题原因的一个例子,传统的激活函数,如双曲正切函数的梯度在范围(0,1),而反向传播通过链式法则计算梯度。这样做的效果是将n个这些小数字相乘来计算n层网络中“前端”层的梯度,这意味着梯度(误差信号)随着n的增加呈指数递减,而前端层的训练非常缓慢。

最新论文

Understanding the patterns of misclassified ImageNet images is particularly important, as it could guide us to design deep neural networks (DNN) that generalize better. However, the richness of ImageNet imposes difficulties for researchers to visually find any useful patterns of misclassification. Here, to help find these patterns, we propose "Superclassing ImageNet dataset". It is a subset of ImageNet which consists of 10 superclasses, each containing 7-116 related subclasses (e.g., 52 bird types, 116 dog types). By training neural networks on this dataset, we found that: (i) Misclassifications are rarely across superclasses, but mainly among subclasses within a superclass. (ii) Ensemble networks trained each only on subclasses of a given superclass perform better than the same network trained on all subclasses of all superclasses. Hence, we propose a two-stage Super-Sub framework, and demonstrate that: (i) The framework improves overall classification performance by 3.3%, by first inferring a superclass using a generalist superclass-level network, and then using a specialized network for final subclass-level classification. (ii) Although the total parameter storage cost increases to a factor N+1 for N superclasses compared to using a single network, with finetuning, delta and quantization aware training techniques this can be reduced to 0.2N+1. Another advantage of this efficient implementation is that the memory cost on the GPU during inference is equivalent to using only one network. The reason is we initiate each subclass-level network through addition of small parameter variations (deltas) to the superclass-level network. (iii) Finally, our framework promises to be more scalable and generalizable than the common alternative of simply scaling up a vanilla network in size, since very large networks often suffer from overfitting and gradient vanishing.

0
0
下载
预览
参考链接
Top
微信扫码咨询专知VIP会员