The segmentation task has traditionally been formulated as a complete-label pixel classification task to predict a class for each pixel from a fixed number of predefined semantic categories shared by all images or videos. Yet, following this formulation, standard architectures will inevitably encounter various challenges under more realistic settings where the scope of categories scales up (e.g., beyond the level of 1k). On the other hand, in a typical image or video, only a few categories, i.e., a small subset of the complete label are present. Motivated by this intuition, in this paper, we propose to decompose segmentation into two sub-problems: (i) image-level or video-level multi-label classification and (ii) pixel-level rank-adaptive selected-label classification. Given an input image or video, our framework first conducts multi-label classification over the complete label, then sorts the complete label and selects a small subset according to their class confidence scores. We then use a rank-adaptive pixel classifier to perform the pixel-wise classification over only the selected labels, which uses a set of rank-oriented learnable temperature parameters to adjust the pixel classifications scores. Our approach is conceptually general and can be used to improve various existing segmentation frameworks by simply using a lightweight multi-label classification head and rank-adaptive pixel classifier. We demonstrate the effectiveness of our framework with competitive experimental results across four tasks, including image semantic segmentation, image panoptic segmentation, video instance segmentation, and video semantic segmentation. Especially, with our RankSeg, Mask2Former gains +0.8%/+0.7%/+0.7% on ADE20K panoptic segmentation/YouTubeVIS 2019 video instance segmentation/VSPW video semantic segmentation benchmarks respectively.
翻译:分解任务传统上是一个完整的标签像素分类任务, 用来从所有图像或视频共享的固定数量预设的语义分类中预测每个像素的类。 然而, 在此配方之后, 标准架构在更现实的设置下将不可避免地遇到各种挑战, 类别范围扩大( 例如, 1k 级以上) 。 另一方面, 在典型的图像或视频中, 仅存在几个类别, 即 完整标签中有一个小的 20 子类。 基于此直觉, 我们提议将每个像素分类从所有图像或视频共享的固定数量的预设语义类类分类中分解成两个子类。 然而, 在这种配方之后, 标准架构将不可避免地级或视频级的多标签分类 。 我们的框架首先在20个完整标签上进行多标签分类, 然后将完整的标签分级分类, 然后用一个级- 直观分级分级分级的分级/, 我们建议将分级的分级和分级图像分级的分级, 将图像分级分级的分级分级分级分级,, 将图像分级的分级分级的分级段段段分级分级分级,, 。