We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Powered by the iterative latent cross-attention of Perceiver, our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models. To further improve the efficiency of our framework, we also study applying LayerDrop on cross-attention layers and introduce a mixed-stream architecture for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and latency while maintaining competitive performance. In addition, we also provide comprehensive analyses of various aspects of our framework, including pretraining data, scalability of latent size and input size, dropping cross-attention layers at inference to reduce latency, modality aggregation strategy, positional encoding, and weight initialization strategy. Our code and checkpoints are available at: https://github.com/zinengtang/Perceiver_VL
翻译:我们提出Perceiver-VL,这是一个高效处理长视频和文字等高维多式联运投入的愿景和语言框架,我们借助Perceiver-VL的迭代潜潜伏交叉注意,我们的框架尺度具有线性复杂性,与许多最先进的变压器模型中使用的自我注意的二次复杂程度形成对照。为了进一步提高我们框架的效率,我们还研究如何将ThileDrop应用于交叉注意层,并引入一个跨模式检索的混合流结构。我们评估不同视频文本和图像文本基准的 Perceiver-VL,其中Perceiver-VLL在保持竞争性性能的同时实现了最低GFLOPs和耐久性。此外,我们还对我们框架的各个方面进行了全面分析,包括培训前数据、潜在尺寸和投入大小的可变性、降低跨层的推断以降低渗透性、模式汇总战略、定位编码和重量初始化战略。我们的代码和检查站有:https://github.com/zinengLer/picer。