Large-scale models have exhibited remarkable capabilities across diverse domains, including automated medical services and intelligent customer support. However, as most large models are trained on single-modality corpora, enabling them to effectively process and understand multimodal signals remains a significant challenge. Current research often focuses on designing task-specific or scenario-specific tuning strategies, which limits the scalability and versatility. To address this limitation, we propose a unified framework that concurrently handles multiple tasks and modalities. In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach. To enable efficient multitask processing, we introduce a novel tuning strategy termed neural tuning, inspired by the concept of sparse distributed representation in the human brain, where only specific subsets of neurons are activated for each task. Furthermore, to advance research in multimodal and multitask learning, we present a new benchmark, MMUD, which includes samples annotated with multiple task labels spanning reasoning segmentation, referring segmentation, image captioning, and text-to-image generation. By applying neural tuning to pretrained large models on the MMUD benchmark, we demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner. All models, code, and datasets will be released publicly upon publication, fostering further research and innovation in this field.
翻译:暂无翻译