Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an improved optimizer to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
翻译:尽管基础模型取得了显著的成功,但其具体任务微调范例使其与一般认知模型的目标不符。消除这一不一致的关键是使用通用模型进行一般任务模型的模拟。然而,目前对通用模型的尝试在多功能和性能两方面都不够充分。在本文件中,我们提议Uni-Perceiver v2,这是第一个能够处理具有竞争性业绩的大型愿景和视觉语言任务的一般模型。具体地说,图像被作为一般区域提案编码,而文本则通过基于变异器的语言模型编码。编码的表达方式被一个任务-通异性解码器转换。不同的任务被设计成一个统一的最大可能性估算问题。我们进一步建议改进优化,以确保稳定的多任务学习,采用不混杂的抽样战略,有助于完成需要大规模批量培训的任务。Uni-Perceiver v2, 能够直接处理下游任务,而无需任何具体任务的调整。结果显示,Uni-Pervier v2, 将具有较强的、较强的、较强的、较强的、较强的常规的愿景模型,需要所有共同的、较强的常规的业绩模型。