Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
翻译:目前自我监督的学习算法往往因模式而异,需要大量计算资源。为了解决这些问题,我们提高了数据2vec的培训效率,这是一个贯穿多种模式的学习目标。我们不编码掩码符号,使用快速的革命解码器和摊合努力来建立教师代表制。数据2vec 2.0受益于数据2vec 中引入的丰富背景化目标表达法,使一个快速自我监督的学习者能够快速地学习。图像Net-1K图像分类实验显示,数据2vec 2.0在16.4x较低的培训前时间,符合蒙面自动计算器的准确性,在Librispeech语音识别上,在10.6x时间里,在Wav2vec 2.0上,在GLUE自然语言上,在一半时间里与经过再培训的ROBERTA模型匹配。在图像Net-1K顶级精确度86.8 ⁇ 和VIT-L模型为150个耳。