There has been a recent explosion of computer vision models which perform many tasks and are composed of an image encoder (usually a ViT) and an autoregressive decoder (usually a Transformer). However, most of this work simply presents one system and its results, leaving many questions regarding design decisions and trade-offs of such systems unanswered. In this work, we aim to provide such answers. We take a close look at autoregressive decoders for multi-task learning in multimodal computer vision, including classification, captioning, visual question answering, and optical character recognition. Through extensive systematic experiments, we study the effects of task and data mixture, training and regularization hyperparameters, conditioning type and specificity, modality combination, and more. Importantly, we compare these to well-tuned single-task baselines to highlight the cost incurred by multi-tasking. A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well. We call this setup locked-image tuning with decoder (LiT-decoder). It can be seen as teaching a decoder to interact with a pretrained vision model via natural language.
翻译:随着计算机视觉模型的快速发展,多任务模型组成的图像编码器(通常为ViT)和自动回归解码器(通常为Transformer)越来越多地被应用,这些模型可以执行多项任务。但是,这些研究大多数只是提供一个系统及其结果,未回答关于这些系统设计决策和权衡的问题。在本文中,我们旨在提供这些答案。我们仔细研究了计算机视觉多任务学习中的自回归解码器,包括分类、字幕生成、视觉问答和光学字符识别。通过系统性的实验,我们研究了任务和数据混合、训练和正则化超参数、条件类型和特定性、模态组合等因素对模型的影响。重要的是,我们将这些结果与经过精心调整的单任务基线进行比较,以凸显多任务处理所造成的成本。一个关键的发现是,一个小的解码器学习在冻结的预训练编码器之上的表现出惊人的效果。我们称这种设置为锁定图像调整与解码器(LiT解码器),它可以被视作教授解码器如何通过自然语言与预先训练的视觉模型进行交互的方式。