Uni-Perceiver v : 大型愿景和愿景-语言任务通用模式 (Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks)

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an improved optimizer to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

翻译：尽管基础模型取得了显著的成功,但其具体任务微调范例使其与一般认知模型的目标不符。消除这一不一致的关键是使用通用模型进行一般任务模型的模拟。然而,目前对通用模型的尝试在多功能和性能两方面都不够充分。在本文件中,我们提议Uni-Perceiver v2,这是第一个能够处理具有竞争性业绩的大型愿景和视觉语言任务的一般模型。具体地说,图像被作为一般区域提案编码,而文本则通过基于变异器的语言模型编码。编码的表达方式被一个任务-通异性解码器转换。不同的任务被设计成一个统一的最大可能性估算问题。我们进一步建议改进优化,以确保稳定的多任务学习,采用不混杂的抽样战略,有助于完成需要大规模批量培训的任务。Uni-Perceiver v2, 能够直接处理下游任务,而无需任何具体任务的调整。结果显示,Uni-Pervier v2, 将具有较强的、较强的、较强的、较强的、较强的常规的愿景模型,需要所有共同的、较强的常规的业绩模型。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日