While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
翻译:虽然自我监督学习的一般概念在模式上是相同的,但实际算法和目标却大相径庭,因为它们是以一种单一的方式开发的。为了使我们更接近一般的自我监督学习,我们提出了数据2vec,这是一个对语言、NLP或计算机视觉使用相同学习方法的框架。核心思想是预测基于隐蔽观点的完整输入数据的潜在表达形式,其依据是使用标准的变异器结构在自学系统中的投入。而不是预测特定模式的目标,如语言、视觉符号或人言中的单位,它们是本地性质的,数据2vec预测了包含来自整个输入的信息的背景化潜在表现。关于语音识别、图像分类和自然语言理解等主要基准的实验展示了艺术的新状态或主要方法的竞争性表现。