One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few layers entirely removed. This usually skimmed-over trick is actually critical for SSL methods to display competitive performances. For example, on ImageNet classification, more than 30 points of percentage can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable form of regularization that has also been used to improve generalization performance in transfer learning scenarios. In this work, through theory and experiments, we formalize GR and identify the underlying reasons behind its success in SSL methods. Our study shows that the use of this trick is essential to SSL performance for two main reasons: (i) improper data-augmentations to define the positive pairs used during training, and/or (ii) suboptimal selection of the hyper-parameters of the SSL loss.
翻译:近些年来出现的一种意想不到的技术是培训深网络(DN),采用自我监督学习(SSL)方法,在下游任务中使用这个网络,但最后几层完全去除。这通常是一种滑滑动的把戏,对于SSL展示竞争性表演的方法来说,实际上是至关重要的。例如,在图像网分类中,可以这样获得超过30个百分点的百分数。这是一个有点令人恼火的举动,因为人们希望培训(最后一层)期间以SSL标准明确强制执行的网络层应该成为下游最佳普及性工作所使用的网络。但我们的研究显示,使用这个把戏对于SSL表现的最好普及性表现至关重要,主要原因有二:(i) 使用不适当的SISL定律(i) 使用不适当的SISL的次等离差(i) 定义了SISL的高级/次等离差(i) 。