We introduce submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, ``submodels'', with stochastic depth: we activate only a subset of the layers. Each network serves as a soft teacher to the other, by providing a loss that complements the regular loss provided by the one-hot label. Our approach, dubbed cosub, uses a single set of weights, and does not involve a pre-trained external model or temporal averaging. Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation. Our approach is compatible with multiple architectures, including RegNet, ViT, PiT, XCiT, Swin and ConvNext. Our training strategy improves their results in comparable settings. For instance, a ViT-B pretrained with cosub on ImageNet-21k obtains 87.4% top-1 acc. @448 on ImageNet-val.
翻译:我们引入了与共同培训、自我蒸馏和蒸馏深度相关的常规化方法,即子模型共修培训。考虑到需要培训的神经网络,对于每个样本,我们隐含地即刻化了两个变换的网络,即“子模型”的“子模型”,并具有分层的深度:我们只激活一个子层。每个网络都是对另一个网络的软教师,通过提供补充单热标签提供的正常损失的损失的损失。我们的方法,称为“组合子”,使用单一的一组重量,而不涉及事先经过训练的外部模型或平均时间。实验性,我们显示子模型联合培训对于为图像分类和语义分解等识别任务培训骨干是有效的。我们的方法与多个结构兼容,包括RegNet、VIT、PiT、XCiT、Swin和ConvNext。我们的培训战略在可比较的环境中改进了它们的结果。例如,在图像网络-21k上接受COSub培训的ViT-B在图像网络上获得87.4%的顶端-1a.@4ccc-evilval。@448。