We present a conceptually simple, flexible, and universal visual perception head for variant visual tasks, e.g., classification, object detection, instance segmentation and pose estimation, and different frameworks, such as one-stage or two-stage pipelines. Our approach effectively identifies an object in an image while simultaneously generating a high-quality bounding box or contour-based segmentation mask or set of keypoints. The method, called UniHead, views different visual perception tasks as the dispersible points learning via the transformer encoder architecture. Given a fixed spatial coordinate, UniHead adaptively scatters it to different spatial points and reasons about their relations by transformer encoder. It directly outputs the final set of predictions in the form of multiple points, allowing us to perform different visual tasks in different frameworks with the same head design. We show extensive evaluations on ImageNet classification and all three tracks of the COCO suite of challenges, including object detection, instance segmentation and pose estimation. Without bells and whistles, UniHead can unify these visual tasks via a single visual head design and achieve comparable performance compared to expert models developed for each task.We hope our simple and universal UniHead will serve as a solid baseline and help promote universal visual perception research. Code and models are available at https://github.com/Sense-X/UniHead.
翻译:我们的方法在概念上简单、灵活和通用的视觉感知头,用于不同视觉任务,例如分类、物体探测、试样分解和估计,以及不同框架,例如一阶段或两阶段管道。我们的方法在图像中有效地识别一个对象,同时产生高质量的捆绑框或轮廓分解面罩或一组关键点。方法叫做UniChead,将不同的视觉感知任务作为通过变压器编码器结构学习的分散点。鉴于固定的空间协调,UniChead通过变压器编码器将它分散到不同空间点及其关系方面的原因。它直接以多点的形式输出最后一组预测,使我们能够在不同框架和同一设计时执行不同的视觉任务。我们对图像网络分类和COCO系列挑战的所有三个轨道进行了广泛的评价,包括对象探测、实例分解和作出估计。Unichead能够通过单一的视觉头设计来统一这些视觉任务,并实现与为每项任务开发的专家模型相比较的可比较性业绩。We hope as syal uncodeal deal 和Unical Gental Chemal deal devisulal ams ands