Gatenctor:一个用于预测气体物体的统一框架 (GaTector: A Unified Framework for Gaze Object Prediction)

Gaze object prediction (GOP) is a newly proposed task that aims to discover the objects being stared at by humans. It is of great application significance but still lacks a unified solution framework. An intuitive solution is to incorporate an object detection branch into an existing gaze prediction method. However, previous gaze prediction methods usually use two different networks to extract features from scene image and head image, which would lead to heavy network architecture and prevent each branch from joint optimization. In this paper, we build a novel framework named GaTector to tackle the gaze object prediction problem in a unified way. Particularly, a specific-general-specific (SGS) feature extractor is firstly proposed to utilize a shared backbone to extract general features for both scene and head images. To better consider the specificity of inputs and tasks, SGS introduces two input-specific blocks before the shared backbone and three task-specific blocks after the shared backbone. Specifically, a novel defocus layer is designed to generate object-specific features for object detection task without losing information or requiring extra computations. Moreover, the energy aggregation loss is introduced to guide the gaze heatmap to concentrate on the stared box. In the end, we propose a novel mDAP metric that can reveal the difference between boxes even when they share no overlapping area. Extensive experiments on the GOO dataset verify the superiority of our method in all three tracks, i.e. object detection, gaze estimation, and gaze object prediction.

翻译：Gaze 对象预测( GOP ) 是一项新提议的任务, 旨在发现人类所监视的物体。它具有巨大的应用意义, 但仍然缺乏统一的解决方案框架。一个直观的解决方案是将物体探测分支纳入现有的视觉预测方法。然而, 先前的视觉预测方法通常使用两个不同的网络从现场图像和头部图像中提取特征, 导致网络结构繁重, 并防止每个分支联合优化。在本文件中, 我们建立一个名为 GaTector 的新框架, 以统一的方式解决凝视物体预测问题。特别是, 首先, 提议使用一个特定的一般特征提取器( GSS), 以利用一个共享的骨干来提取场景和头图像的一般特征。为了更好地考虑投入和任务的特殊性, SGS 在共同的骨干和头图像之后, 通常使用两个特定的投入和任务区块来提取特征, 从而导致网络结构结构重, 防止每个分支的物体探测任务出现特定特征, 而不会丢失信息或需要额外的计算。此外, 能源汇总损失将引导凝视目标的热映射色图( GSS ) 集中到恒定的图像框。在浏览中, 共享的轨道中, 我们建议一个重复的路径中, 分享了三个。