This paper presents a novel approach for inferring relationships between objects in visual scenes. It explicitly exploits an informative hierarchical structure that can be imposed to divide the object and relationship categories into disjoint super-categories. Specifically, our proposed method incorporates a Bayes prediction head, enabling joint predictions of the super-category as the type of relationship between the two objects, along with the detailed relationship within that super-category. This design reduces the impact of class imbalance problems. Furthermore, we also modify the supervised contrastive learning to adapt our hierarchical classification scheme. Experimental evaluations on the Visual Genome and OpenImage V6 datasets demonstrate that this factorized approach allows a relatively simple model to achieve competitive performance, particularly in predicate classification and zero-shot tasks.
翻译:暂无翻译