Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7\% of the baseline parameters. Our \emph{GroupedMLP} variant shares MLP weights between adjacent transformer blocks and achieves 81.47\% top-1 accuracy while maintaining the baseline computational cost. Our \emph{ShallowMLP} variant halves the MLP hidden dimension and reaches 81.25\% top-1 accuracy with a 38\% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05\%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47\% to the range 0.03\% to 0.06\%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps.
翻译:尽管缩放定律与众多实证结果表明,增大视觉Transformer的规模通常能提升性能,但模型精度与训练行为并非总是随规模单调增长。本研究以ImageNet-1K数据集上训练的ViT-B/16为基础,针对其MLP模块应用两种简单的参数缩减策略,各策略均削减基线模型32.7%的参数。我们提出的\\emph{GroupedMLP}变体通过在相邻Transformer模块间共享MLP权重,在保持基线计算成本的同时实现了81.47%的Top-1准确率;而\\emph{ShallowMLP}变体将MLP隐藏维度减半,以推理吞吐量提升38%的代价达到81.25%的Top-1准确率。两种模型均优于拥有86.6M参数的基线模型(81.05%),并展现出显著改善的训练稳定性——将峰值至最终准确率的衰减幅度从0.47%降低至0.03%至0.06%区间。这些结果表明,对于采用标准训练方案的ImageNet-1K数据集上的ViT-B/16模型,其当前处于过参数化状态,在此状态下缩减MLP容量不仅不会损害性能,甚至可能带来轻微提升。更广泛而言,我们的发现表明参数共享与宽度缩减等架构约束可作为有效的归纳偏置,并凸显了视觉Transformer设计中参数分配方式的重要性。所有代码公开于:https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps。