By design, average precision (AP) for object detection aims to treat all classes independently: AP is computed independently per category and averaged. On the one hand, this is desirable as it treats all classes, rare to frequent, equally. On the other hand, it ignores cross-category confidence calibration, a key property in real-world use cases. Unfortunately, we find that on imbalanced, large-vocabulary datasets, the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors. In fact, we show that the default implementation produces a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin. To address these limitations, we introduce two complementary metrics. First, we present a simple fix to the default AP implementation, ensuring that it is truly independent across categories as originally intended. We benchmark recent advances in large-vocabulary detection and find that many reported gains do not translate to improvements under our new per-class independent evaluation, suggesting recent improvements may arise from difficult to interpret changes to cross-category rankings. Given the importance of reliably benchmarking cross-category rankings, we consider a pooled version of AP (AP-pool) that rewards properly calibrated detectors by directly comparing cross-category rankings. Finally, we revisit classical approaches for calibration and find that explicitly calibrating detectors improves state-of-the-art on AP-pool by 1.7 points.
翻译:通过设计,物体探测的平均精确度(AP)旨在对所有类别进行独立的处理:AP是按类别独立计算,按平均计算。一方面,这是可取的,因为它对待所有类别,稀少,同样频繁。另一方面,它忽略了跨类信任校准,这是现实世界使用案例中的一个关键属性。不幸的是,我们发现,在不平衡、大词汇数据集方面,AP的默认执行既不是独立的类别,也不直接奖励适当校准的探测器。事实上,我们表明,默认执行产生了一种可游戏的衡量标准,即简单、非敏感重排政策可以大大改善AP。为了应对这些限制,我们引入了两种补充衡量标准。首先,我们对默认的AP执行提出一个简单的解决办法,确保它真正按照最初的打算在类别之间独立。我们在大词汇检测方面将最近的进展作为基准,发现许多所报告的收益并不能转化为我们新的单级独立评价的改进,建议最近的改进可能来自难以对跨类排名进行解释的方法。鉴于可靠、非敏感重排再排序方法的重要性,我们通过准确的跨级标准评级,我们考虑对AS级的升级进行升级,我们最终的AR校准。