Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. Despite recent progress in video relation detection, overall performance is still marginal and it remains unclear what the key factors are towards solving the problem. Following examples set in the object detection and action localization literature, we perform a deep dive into the error diagnosis of current video relation detection approaches. We introduce a diagnostic tool for analyzing the sources of detection errors. Our tool evaluates and compares current approaches beyond the single scalar metric of mean Average Precision by defining different error types specific to video relation detection, used for false positive analyses. Moreover, we examine different factors of influence on the performance in a false negative analysis, including relation length, number of subject/object/predicate instances, and subject/object size. Finally, we present the effect on video relation performance when considering an oracle fix for each error type. On two video relation benchmarks, we show where current approaches excel and fall short, allowing us to pinpoint the most important future directions in the field. The tool is available at \url{https://github.com/shanshuo/DiagnoseVRD}.
翻译:视频关系探测是计算机视觉中一个具有挑战性的新问题,在计算机视觉中,主体和对象需要本地化时空,只有在二者相互作用的情况下,才需要指定上游标签。尽管在视频关系探测方面最近取得了进展,但总体性能仍然微不足道,仍然不清楚解决问题的关键因素。在物体探测和行动定位文献中树立了榜样,我们深入潜入当前视频关系探测方法的错误诊断中。我们引入了一个分析探测误差源的诊断工具。我们的工具评估和比较了平均精度单一卡路里标准以外的当前方法,界定了用于虚假正面分析的视频关系探测的不同错误类型。此外,我们用虚假的负面分析,包括关系长度、主题/对象/预测实例的数量以及主题/对象大小,来审查对性能的不同影响因素。最后,我们在考虑对每一类型错误的标定符时,对视频关系性能产生影响。在两个视频关系基准上,我们显示当前方法优于和短处,使我们得以确定实地最重要的未来方向。我们可以利用的DIursma/shan/Dismagroad工具。