Recently, MLP-like vision models have achieved promising performances on mainstream visual recognition tasks. In contrast with vision transformers and CNNs, the success of MLP-like models shows that simple information fusion operations among tokens and channels can yield a good representation power for deep recognition models. However, existing MLP-like models fuse tokens through static fusion operations, lacking adaptability to the contents of the tokens to be mixed. Thus, customary information fusion procedures are not effective enough. To this end, this paper presents an efficient MLP-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion. Critically, we propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing matrices by leveraging the contents of all the tokens to be mixed. To reduce the time complexity and improve the robustness, a dimensionality reduction technique and a multi-segment fusion mechanism are adopted. Our proposed DynaMixer model (97M parameters) achieves 84.3\% top-1 accuracy on the ImageNet-1K dataset without extra training data, performing favorably against the state-of-the-art vision MLP models. When the number of parameters is reduced to 26M, it still achieves 82.7\% top-1 accuracy, surpassing the existing MLP-like models with a similar capacity. The implementation of DynaMixer will be made available to the public.
翻译:最近,类似MLP的视觉模型在主流视觉识别任务中取得了有希望的成绩。与视觉变压器和CNN相比,MLP类似的模型的成功表明,象征和渠道之间的简单信息聚合操作能够产生一种良好的代表力,从而产生深度识别模型。然而,现有的类似MLP的模型通过静态聚合操作引信符号,缺乏对象征内容的适应性,因此,习惯的信息聚合程序不够有效。为此,本文件展示了一个高效的MLP类网络结构,称为dbbed DynaMixer,采用动态信息融合。关键地是,我们提出了一个程序,DynamMix模型所依赖的程序,通过利用所有象征内容的混合来动态生成混合矩阵。为了减少时间复杂性并改进坚固性,采用减少维度技术和多层组合机制。我们提议的DynMixer模型(97M参数)将达到图像网络-1数据集上的一个84.3 ⁇ 1的顶端精确度,而没有额外的培训数据组合。我们提议的程序是,DMMM-1模型的顶级执行能力将降低现有M-M-1模型的高度。