Visual tasks vary a lot in their output formats and concerned contents, therefore it is hard to process them with an identical structure. One main obstacle lies in the high-dimensional outputs in object-level visual tasks. In this paper, we propose an object-centric vision framework, Obj2Seq. Obj2Seq takes objects as basic units, and regards most object-level visual tasks as sequence generation problems of objects. Therefore, these visual tasks can be decoupled into two steps. First recognize objects of given categories, and then generate a sequence for each of these objects. The definition of the output sequences varies for different tasks, and the model is supervised by matching these sequences with ground-truth targets. Obj2Seq is able to flexibly determine input categories to satisfy customized requirements, and be easily extended to different visual tasks. When experimenting on MS COCO, Obj2Seq achieves 45.7% AP on object detection, 89.0% AP on multi-label classification and 65.0% AP on human pose estimation. These results demonstrate its potential to be generally applied to different visual tasks. Code has been made available at: https://github.com/CASIA-IVA-Lab/Obj2Seq.
翻译:视觉任务在输出格式和内容上差异很大,因此很难用相同的结构处理它们。 目标级视觉任务中的一个主要障碍是高维输出。 在本文中, 我们提出一个以目标为中心的视觉框架, Obj2Seq。 Obj2Seq 将物体作为基本单位, 并将大多数目标级视觉任务视为物体的序列生成问题。 因此, 这些视觉任务可以分解为两个步骤。 首先, 识别给定类别的对象, 然后为每个对象生成一个序列。 产出序列的定义因不同任务而异, 模型通过将这些序列与地面图象目标相匹配来监督。 Obj2Seq 能够灵活地确定输入类别以满足定制的要求, 并容易扩展到不同的视觉任务。 在对 MS COCO 进行实验时, Obj2Seq可以达到45. AP的45. 7%, 多标签分类为 AP 89.0% AP, 和 AP65.0% 人造型估计。 这些结果显示其潜力一般适用于不同的视觉任务。 代码已经提供: http://Sgisqualb/A. /Casqub/CAA.