We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.
翻译:我们提出一个新的以一级变异器为基础的语义和空间改良变异器(SSRT),以解决人类-物体互动探测任务,这要求将人类和物体本地化,并预测其相互作用。 不同于以往以变异器为基础的 HOI 方法,后者主要侧重于改进最终检测的解码输出的设计,SERT引入了两个新模块,以帮助在图像中选择最相关的对象-动作对,并利用丰富的语义和空间特征改进查询的表达方式。 这些增强导致在两个最受欢迎的 HOI基准(V-COCO和HiCO-DET)上取得最先进的结果。