We perform an empirical study of the behaviour of deep networks when pushing its activation functions to become fully linear in some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. To measure the depth of the resulting partially linearized network, we compute the average number of active nonlinearities encountered along a path in the network graph. In experiments on CNNs with sparsified PReLUs on typical image classification tasks, we make several observations: Under sparsity pressure, the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty. We consistently observe a slow decay of performance with depth until the onset of a rapid collapse in accuracy, allowing for surprisingly shallow networks at moderate losses in accuracy that outperform base-line networks of similar depth, even after increasing width to a comparable number of parameters. In terms of training, we observe a nonlinear advantage: Reducing nonlinearity after training leads to a better performance than before, in line with previous findings in linearized training, but with a gap depending on task difficulty that vanishes for easy problems.
翻译:我们从经验上研究深网络在推动其激活功能时的行为,通过网络中非线性单位总数之前的宽度,通过网络中非线性单位的总数,将某些特征频道完全线性化。为了测量由此形成的部分线性网络的深度,我们计算了在网络图中一条路径上遇到的主动非线性网络的平均数量。在有线性网络的实验中,对典型图像分类任务进行了松散的PRELUs,我们做了几项观察:在紧张的压力下,剩余的非线性单位组织成不同的结构,形成接近持续有效深度和宽度的核心网络,而这又取决于任务难度。我们不断观察到深度的性能缓慢衰减,直至准确性迅速崩溃,使得光线性网络的精度大大超过类似基线网络的精度,即使宽度增加至相似的参数数量,但就培训而言,我们观察到一个非线性优势:按照线性培训的结果,减少培训后的非线性性,比以前更好的性能,与线性化培训以前的调查结果一致,但差距取决于任务难度,容易消失的问题。