To improve how neural networks function it is crucial to understand their learning process. The information bottleneck theory of deep learning proposes that neural networks achieve good generalization by compressing their representations to disregard information that is not relevant to the task. However, empirical evidence for this theory is conflicting, as compression was only observed when networks used saturating activation functions. In contrast, networks with non-saturating activation functions achieved comparable levels of task performance but did not show compression. In this paper we developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, we explored compression in networks with a range of different activation functions. With two improved methods of estimation, firstly, we show that saturation of the activation function is not required for compression, and the amount of compression varies between different activation functions. We also find that there is a large amount of variation in compression between different network initializations. Secondary, we see that L2 regularization leads to significantly increased compression, while preventing overfitting. Finally, we show that only compression of the last layer is positively correlated with generalization.
翻译:为改善神经网络的功能,理解它们的学习过程至关重要。深度学习的信息瓶颈理论提出,神经网络通过压缩其表示来忽略与任务无关的信息,从而实现良好的泛化。然而,关于这一理论的经验证据存在冲突,因为只有当网络使用饱和激活函数时才观察到压缩。相反,使用非饱和激活函数的网络实现了相当水平的任务性能,但未表现出压缩。在本文中,我们开发了更加鲁棒的互信息估计技术,可以自适应于神经网络的隐藏活动并产生对来自所有函数的激活的更敏感的测量,特别是对于未给出边界的函数。使用这些自适应估计技术,我们探索了具有一系列不同激活函数的网络的压缩情况。通过两种改进估计的方法,首先,我们证明饱和的激活函数并非必需的压缩,不同激活函数之间的压缩量也有所不同。我们还发现,在不同的网络初始化之间压缩的变化很大。其次,我们发现L2正则化会导致显著增加压缩,同时防止过拟合。最后,我们发现只有最后一层的压缩与泛化呈正相关。