While adversarial training has been extensively studied for ResNet architectures and low resolution datasets like CIFAR, much less is known for ImageNet. Given the recent debate about whether transformers are more robust than convnets, we revisit adversarial training on ImageNet comparing ViTs and ConvNeXts. Extensive experiments show that minor changes in architecture, most notably replacing PatchStem with ConvStem, and training scheme have a significant impact on the achieved robustness. These changes not only increase robustness in the seen $\ell_\infty$-threat model, but even more so improve generalization to unseen $\ell_1/\ell_2$-robustness. Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust models across different ranges of model parameters and FLOPs.
翻译:虽然已经广泛研究了ResNet架构和CIFAR等低分辨率数据集的对抗性培训,但在图像网络方面却远不为人所知。鉴于最近关于变压器是否比convnets更强大的辩论,我们重新审视关于图像网络比较Vits和ConvNeXts的对抗性培训。 广泛的实验表明,建筑方面的小变化,最明显的是用ConvStem取代PatchStem,以及培训计划对已实现的强力产生重大影响。 这些变化不仅提高了所看到的$@infty$-威胁模型的稳健性,而且更进一步改进了对无法见的 $_1/\\\\ ell_2$-robustity的概括化。我们修改后的ConNeXt,ConvNeXt + ConvStem, 生成了不同模型参数和FLOPs的最强的模型。</s>