Test sets are an integral part of evaluating models and gauging progress in object recognition, and more broadly in computer vision and AI. Existing test sets for object recognition, however, suffer from shortcomings such as bias towards the ImageNet characteristics and idiosyncrasies (e.g., ImageNet-V2), being limited to certain types of stimuli (e.g., indoor scenes in ObjectNet), and underestimating the model performance (e.g., ImageNet-A). To mitigate these problems, we introduce a new test set, called D2O, which is sufficiently different from existing test sets. Images are a mix of generated images as well as images crawled from the web. They are diverse, unmodified, and representative of real-world scenarios and cause state-of-the-art models to misclassify them with high confidence. To emphasize generalization, our dataset by design does not come paired with a training set. It contains 8,060 images spread across 36 categories, out of which 29 appear in ImageNet. The best Top-1 accuracy on our dataset is around 60% which is much lower than 91% best Top-1 accuracy on ImageNet. We find that popular vision APIs perform very poorly in detecting objects over D2O categories such as ``faces'', ``cars'', and ``cats''. Our dataset also comes with a ``miscellaneous'' category, over which we test the image tagging models. Overall, our investigations demonstrate that the D2O test set contain a mix of images with varied levels of difficulty and is predictive of the average-case performance of models. It can challenge object recognition models for years to come and can spur more research in this fundamental area.
翻译:测试设置是评价模型和衡量目标识别进展的有机组成部分,在计算机视野和AI中也更为广泛。但是,现有的目标识别测试组存在缺陷,如偏向图像网络特性和特异性(例如,图像Net-V2),限于某些类型的刺激(例如,OctalNet的室内场景),低估模型性能(例如,图像Net-A)。为了缓解这些问题,我们引入了一个新的测试组,称为D2O,与现有的测试组非常不同。图像是生成的图像和图像的混合体,以及从网络上爬升的图像。它们多种多样,未经修改,并代表现实世界情景,导致最先进的模型以高度自信将其分类化。为了强调一般化,我们设计的数据组合与培训组不匹配。它包含8,060个图像组分布于36个类别,其中29个在图像网中出现。我们数据集的顶端- 1 的精确度是60 % 的图像和图像组的精确度,我们对O- 1 的精确度的精确度也比 O- 1 的精确度要低得多。