Previous multi-task dense prediction studies developed complex pipelines such as multi-modal distillations in multiple stages or searching for task relational contexts for each task. The core insight beyond these methods is to maximize the mutual effects of each task. Inspired by the recent query-based Transformers, we propose a simple pipeline named Multi-Query Transformer (MQTransformer) that is equipped with multiple queries from different tasks to facilitate the reasoning among multiple tasks and simplify the cross-task interaction pipeline. Instead of modeling the dense per-pixel context among different tasks, we seek a task-specific proxy to perform cross-task reasoning via multiple queries where each query encodes the task-related context. The MQTransformer is composed of three key components: shared encoder, cross-task query attention module and shared decoder. We first model each task with a task-relevant query. Then both the task-specific feature output by the feature extractor and the task-relevant query are fed into the shared encoder, thus encoding the task-relevant query from the task-specific feature. Secondly, we design a cross-task query attention module to reason the dependencies among multiple task-relevant queries; this enables the module to only focus on the query-level interaction. Finally, we use a shared decoder to gradually refine the image features with the reasoned query features from different tasks. Extensive experiment results on two dense prediction datasets (NYUD-v2 and PASCAL-Context) show that the proposed method is an effective approach and achieves state-of-the-art results. Code and models are available at https://github.com/yangyangxu0/MQTransformer.
翻译:先前的多任务密集预测研究开发了复杂的流程,例如多模态蒸馏在多个阶段或为每个任务搜索任务关系背景等。这些方法背后的核心见解是最大化每个任务的相互作用。受最近基于查询的变压器的启发,我们提出了一种名为Multi-Query Transformer(MQTransformer)的简单流程,该流程配备了来自不同任务的多个查询,以促进多个任务之间的推理并简化跨任务交互管道。我们不是对不同任务之间的密集像素上下文进行建模,而是寻找特定于任务的代理,通过多个查询执行跨任务推理,其中每个查询编码任务相关内容。MQTransformer由三个关键组件组成:共享编码器,跨任务查询注意力模块和共享解码器。我们首先使用与任务相关的查询对每个任务进行建模。然后,将特定于任务的特征输出和相关任务查询同时输入到共享编码器中,从而将任务相关查询从任务特定功能编码。其次,我们设计了跨任务查询注意力模块,以推理多个任务相关查询之间的依赖关系;这使模块只关注查询级别的交互。最后,使用共享解码器逐渐使用来自不同任务的推理查询特征来改进图像特征。对两个密集预测数据集(NYUD-v2和PASCAL-Context)的广泛实验结果表明,该方法是一种有效的方法,可以实现最先进的结果。代码和模型可在https://github.com/yangyangxu0/MQTransformer上获取。