Human-Object Interaction Detection tackles the problem of joint localization and classification of human object interactions. Existing HOI transformers either adopt a single decoder for triplet prediction, or utilize two parallel decoders to detect individual objects and interactions separately, and compose triplets by a matching process. In contrast, we decouple the triplet prediction into human-object pair detection and interaction classification. Our main motivation is that detecting the human-object instances and classifying interactions accurately needs to learn representations that focus on different regions. To this end, we present Disentangled Transformer, where both encoder and decoder are disentangled to facilitate learning of two sub-tasks. To associate the predictions of disentangled decoders, we first generate a unified representation for HOI triplets with a base decoder, and then utilize it as input feature of each disentangled decoder. Extensive experiments show that our method outperforms prior work on two public HOI benchmarks by a sizeable margin. Code will be available.
翻译:人类- 物体相互作用探测 人类- 物体相互作用探测 解决了人类物体相互作用的联合定位和分类问题。 现有的 HOI 变异器要么采用单一解码器进行三重预测, 要么使用两个平行解码器分别检测单个物体和相互作用, 并且用一个匹配的过程组成三重体。 相反, 我们将三重预测分解成人体- 物体对等检测和相互作用分类。 我们的主要动机是检测人体物体和对相互作用进行分类需要准确地了解不同区域的表达方式。 为此, 我们展示了分解式变异器, 这里的解码器和解码器被分解开, 以便于学习两个子任务。 为了将解开的解码器的预测联系起来, 我们首先为 HOI 三重的预测制作一个统一的表示方式, 将其用作每个分解解解开的解码器的输入特征。 广泛的实验显示, 我们的方法超越了之前两个公共 HOI 基准中两个参数的精确范围。