Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end visual composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection and human object interaction and demonstrate that PST achieves state of the art results among single-stage models, while nearly matching the results of custom designed two-stage models.
翻译:计算机视像应用,如视觉关系探测和人类物体相互作用,可以设计成一个综合(结构化)的检测问题,在其中,各部分(主体、对象和上游)和总(三重整)都要以等级分级的方式检测;在本文中,我们提出了一种新的方法,即指半成和半成探测变异器(PST),以进行端到端的视觉合成集探测;与现有的变异器不同,在这种变异器中,查询处于单一水平,我们同时用综合查询和注意模块模拟联合部分和综合假设/互动;我们明确纳入求和查询,以便能够更好地模拟标准变异器中不存在的成和成全成关系;我们的方法还采用新型的慢成部分查询和矢量求和求和求和求,并模拟其联合互动;我们报告关于视觉关系探测和人类物体相互作用的实验,并证明PST在单阶段模型中取得了艺术成果的状态,同时几乎匹配定制的两阶段模型的结果。