The rapid advances in Vision Transformer (ViT) refresh the state-of-the-art performances in various vision tasks, overshadowing the conventional CNN-based models. This ignites a few recent striking-back research in the CNN world showing that pure CNN models can achieve as good performance as ViT models when carefully tuned. While encouraging, designing such high-performance CNN models is challenging, requiring non-trivial prior knowledge of network design. To this end, a novel framework termed Mathematical Architecture Design for Deep CNN (DeepMAD) is proposed to design high-performance CNN models in a principled way. In DeepMAD, a CNN network is modeled as an information processing system whose expressiveness and effectiveness can be analytically formulated by their structural parameters. Then a constrained mathematical programming (MP) problem is proposed to optimize these structural parameters. The MP problem can be easily solved by off-the-shelf MP solvers on CPUs with a small memory footprint. In addition, DeepMAD is a pure mathematical framework: no GPU or training data is required during network design. The superiority of DeepMAD is validated on multiple large-scale computer vision benchmark datasets. Notably on ImageNet-1k, only using conventional convolutional layers, DeepMAD achieves 0.7% and 1.5% higher top-1 accuracy than ConvNeXt and Swin on Tiny level, and 0.8% and 0.9% higher on Small level.
翻译:视觉变换器(VIT)的快速进步使各种视觉任务的最新表现更新,掩盖了传统CNN模型。这引发了CNN世界中最近一些惊人的研究,显示纯CNN模型在仔细调整VIT模型时可以取得与VIT模型一样的良好表现。虽然令人鼓舞,但设计这类高性能的CNN模型具有挑战性,需要非三进制之前的网络设计知识。为此,提议以有原则的方式设计高性能CNN模型(DepMAD)。在DeepMAD中,CNN网络建模为信息处理系统,其清晰度和有效性可以通过结构参数加以分析制定。然后提出一个受限的数学程序问题,以优化这些结构参数。在CPU上使用现成的MP解答器,需要少量的记忆足迹。此外,DeepMAD是一个纯数学框架:网络设计中不需要GPU或培训数据。DeepMAD的优越性能,而仅使用多个高水平的GIMAD和1.5-D级的高级计算机基准数据,而仅使用多个GIMA-ILMOI和GI-I-I-IL-INS-ILD级的顶级标准数据。</s>