We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task. Compared to previous efforts on multi-task learning with transformers, we share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. In our experiments, we learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters. Our code is available in MMF at https://mmf.sh.
翻译:我们建议UniT是一个统一变换模型,以同时学习不同领域最突出的任务,从物体探测到自然语言理解和多式联运推理等。根据变压器编码器编码器-解码器结构,我们的UniT模型将每种输入模式编码成编码器,并用编码输入表达式的共同解码器对每项任务作出预测,然后是任务特定输出头。整个模型都是经过联合培训的端对端,每个任务都有损失。与以前与变压器进行多任务学习的努力相比,我们在所有任务中都采用相同的模型参数,而不是单独微调特定任务模型,并处理不同领域的更多任务。在我们的实验中,我们共学习了8个数据集的7项任务,在每项任务上以非常少的参数取得很强的业绩。我们的代码可以在https://mmf.sh上 MMF。