Biological intelligence systems of animals perceive the world by integrating information in different modalities and processing simultaneously for various tasks. In contrast, current machine learning research follows a task-specific paradigm, leading to inefficient collaboration between tasks and high marginal costs of developing perception models for new tasks. In this paper, we present a generic perception architecture named Uni-Perceiver, which processes a variety of modalities and tasks with unified modeling and shared parameters. Specifically, Uni-Perceiver encodes different task inputs and targets from arbitrary modalities into a unified representation space with a modality-agnostic Transformer encoder and lightweight modality-specific tokenizers. Different perception tasks are modeled as the same formulation, that is, finding the maximum likelihood target for each input through the similarity of their representations. The model is pre-trained on several uni-modal and multi-modal tasks, and evaluated on a variety of downstream tasks, including novel tasks that did not appear in the pre-training stage. Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks. The performance can be improved to a level close to state-of-the-art methods by conducting prompt tuning on 1% of downstream task data. Full-data fine-tuning further delivers results on par with or better than state-of-the-art results. Code shall be released.
翻译:动物生物情报系统通过将信息纳入不同模式并同时处理各种任务来看待世界。相比之下,目前的机器学习研究遵循一个特定任务模式,导致任务之间低效协作,开发新任务认知模式的成本低下。在本文件中,我们提出了一个名为Uni-Perceiver的通用认知架构,该架构处理各种模式和任务,并采用统一的模型和共享参数。具体地说,Uni-Perceiver将来自任意模式的不同任务投入和目标编码成一个统一的代表空间,并配以一个具有新任务模式的神学变异器编码和轻量级模式特定象征器。不同的认知任务以同样的模式为模式,即通过相似的表述找到每项投入的最大可能性目标。该模式对若干单一模式和多模式任务进行了预先培训,并评估了各种下游任务,包括没有在培训前阶段出现的新任务。结果显示,即使没有进行任何调整,我们预先培训的模式也可以在新的任务上取得合理的表现。业绩可以改进到接近于州一级的数据更新结果,而不是通过更新数据的更新方式,在更新数据中进行。