We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances of vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., applying re-parameterized large depth-wise convolutions, to design efficient high-performance large-kernel CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31x31, in contrast to commonly used 3x3. RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel CNNs, large-kernel CNNs have much larger effective receptive fields, and higher shape bias rather than texture bias. Code & models at https://github.com/megvii-research/RepLKNet.
翻译:我们重新审视了现代连锁神经网络(CNN)中的大型内核设计。在近期视觉变压器(VITs)最新进步的启发下,我们在本论文中表明,使用一些大型的连动内核而不是堆小内核可能是一个更强大的范式。我们建议了五条准则,例如,应用重新校准的大型深度变压器,设计高效的高性能大型CNN。遵循准则,我们建议RepLKNet(RepLKNet)是一个纯净的CNN架构,其内核大小相当于31x31,而通常使用的是3x3. RepLKNet(RepLKNet)则大大缩小了CNNS和VIT(例如,在图像网Swin变压器和一些典型的下游任务中取得可比或优异的结果,且低温。RepLNet还显示了大数据和大模型的可缩缩缩缩缩缩缩缩,在图像网/网络上获得了87.8%的精准度和ADE20K(ANDU)的内核的内核),而通常使用3x3x3.。在州和内核的模型中,我们之间有较大的变缩的变缩的模型中,在较大型的模型中,在更大的模型上显示了我们更高的变缩。