Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.
翻译:人类智能是多式的; 我们整合了视觉、语言和声学信号,以保持一个整体的世界观; 然而,目前大多数培训前方法限于一种或两种模式。 我们提出i-Code,这是一个自我监督的训练前框架,用户可以灵活地将视觉、言语和语言模式结合起来,形成统一和通用的矢量表示方式。在这个框架内,每种模式的数据首先提供给经过预先训练的单一模式编码器; 然后将编码器输出与一个多式集成网络结合起来,该网络使用新的关注机制和其他建筑创新来有效地将不同模式的信息结合起来。 整个系统都预先培训了终端到终端,并有了新的目标,包括蒙蔽模式的单元建模和跨模式的对比学习。 与以前只使用视频进行预培训的媒介研究不同, i-Code框架可以在培训和推断过程中动态地处理单一、双重和三重模式数据, 将不同模式的组合灵活地投射到一个单一的展示空间。 实验结果显示i-Code如何超越了五个视频理解方式的状态技术,包括了五种视频模型模型模型模型的模型化任务,并展示了GLUEMUL作为GMIB的基础。