Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
翻译:目前,基于视频的场景图生成(VidSGG)方法在预测较少表示谓词方面表现不佳,因为训练数据中存在固有偏差分布。在本文中,我们更细致地观察了谓词,并发现大多数视觉关系(例如“sit_above”)涉及行为模式(坐)和空间模式(在上面),而偏差分布在模式级别上要少得多。基于这个发现,我们提出了一种分离标签学习(DLL)范式,从模式级别来解决不可行的视觉关系预测。具体而言,DLL解耦了谓词标签,并采用单独的分类器分别学习行为和空间模式。然后将这些模式合并并映射回谓词。此外,我们提出了一种基于知识的标签分离方法,将非目标知识从头谓词转移到相同模式内的尾谓词,以校准尾类别的分布。我们在常用的VidSGG基准测试(即VidVRD)上验证了DLL的有效性。广泛的实验表明,DLL提供了一种非常简单但高效的解决长尾问题的方法,实现了最先进的VidSGG性能。