Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Transformer has been unlocked by self-supervised pretraining strategies based on masked autoencoders which rely on reconstructing masked inputs, directly, or contrastively from unmasked content. This pretraining strategy which has been used in BERT models in NLP, Wav2Vec models in Speech and, recently, in MAE models in Vision, forces the model to learn about relationships between the content in different parts of the input using autoencoding related objectives. In this paper, we propose a novel, but surprisingly simple alternative to content reconstruction~-- that of predicting locations from content, without providing positional information for it. Doing so requires the Transformer to understand the positional relationships between different parts of the input, from their content alone. This amounts to an efficient implementation where the pretext task is a classification problem among all possible positions for each input token. We experiment on both Vision and Speech benchmarks, where our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods. Our method also enables Transformers trained without position embeddings to outperform ones trained with full position information.
翻译:在许多应用中,包括自然语言处理(NLP)、计算机愿景和语音识别等,变异器越来越受欢迎,这些应用包括:自然语言处理(NLP)、计算机视野和语音识别,因为它们具有强大的代表性能力。然而,有效利用这种代表能力需要大量的数据、强有力的规范化或两者兼而有之,才能减轻过度的适应。最近,以蒙面自动校正为基础的自我监督的训练前战略释放了变异器的力量,这种战略依靠的是直接重建隐蔽的投入,或与无印面内容相对应。这种经过培训的训练战略在新产品中,在演讲中BERT模型、Wav2Vec模型以及最近《展望》中MAE模型中应用的BERT模型中应用了经过培训的训练前培训前战略。这等于一种高效的实施,即通过自动编码相关投入的不同部分的内容,我们提出了一个新的、但令人惊讶的、简单的替代内容重建方法,即从内容中预测位置,而无需提供定位信息。 这样做要求变异器了解投入的不同部分之间的定位关系,从内容本身的位置,到内容本身的位置。这相当于一种高效的实施,而我们所训练的变式的变现式方法则是一种模拟的升级的升级的定位,而使我们的升级的基比标准都成为了我们之间的整个的排序。