Borrowing from the transformer models that revolutionized the field of natural language processing, self-supervised feature learning for visual tasks has also seen state-of-the-art success using these extremely deep, isotropic networks. However, the typical AI researcher does not have the resources to evaluate, let alone train, a model with several billion parameters and quadratic self-attention activations. To facilitate further research, it is necessary to understand the features of these huge transformer models that can be adequately studied by the typical researcher. One interesting characteristic of these transformer models is that they remove most of the inductive biases present in classical convolutional networks. In this work, we analyze the effect of these and more inductive biases on small to moderately-sized isotropic networks used for unsupervised visual feature learning and show that their removal is not always ideal.
翻译:从那些使自然语言处理领域发生革命的变压器模型中借款,自我监督的视觉任务特征学习也利用这些极其深厚的等热带网络取得了最先进的成功。然而,典型的AI研究人员没有资源来评估,更不用说培训一个具有数十亿参数和二次自省激活的模型。为了便于进一步的研究,有必要了解这些大型变压器模型的特征,这些模型可由典型的研究者进行充分研究。这些变压器模型的一个有趣的特征是,这些变压器模型消除了古典革命网络中存在的大多数隐含偏见。在这项工作中,我们分析了这些以及更隐含的偏见对用于非超强视觉特征学习的小型至中小型异形网络的影响,并表明其去除并非始终理想。