We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
翻译:我们提出X-Decoder, 这是一种可以无缝地预测像素层分解和语言符号的通用解码模型。 X-Decodert 将两种类型的查询作为输入:(一) 普通非语层查询,和(二) 从文本输入中引发的语义查询,在同一语义空间解码不同的像素级和象征性产出。有了这样一个新颖的设计, X-Decoder 是第一个提供统一方式支持所有类型的图像分解和各种视觉语言任务的工作。此外,我们的设计可以在不同的直观粒子中进行无缝互动,并且通过学习一个共同和丰富的像素级视觉-语层理解空间,而不作任何假标签,带来互利。在对数量有限的分解数据和数百万张图纸的混合组合进行预培训之后, X-Decoder 展示了在零光版和微调环境中向一系列广泛的下游任务进行更迭的能力( ), 和微调和微调环境。 值得注意的是, 我们的设计可以实现(1) 有关公开和丰富的图像结构结构的精确结果, (3) 用于公开和一般数据分流化。