Contemporary vision benchmarks predominantly consider tasks on which humans can achieve near-perfect performance. However, humans are frequently presented with visual data that they cannot classify with 100% certainty, and models trained on standard vision benchmarks achieve low performance when evaluated on this data. To address this issue, we introduce a procedure for creating datasets of ambiguous images and use it to produce SQUID-E ("Squidy"), a collection of noisy images extracted from videos. All images are annotated with ground truth values and a test set is annotated with human uncertainty judgments. We use this dataset to characterize human uncertainty in vision tasks and evaluate existing visual event classification models. Experimental results suggest that existing vision models are not sufficiently equipped to provide meaningful outputs for ambiguous images and that datasets of this nature can be used to assess and improve such models through model training and direct evaluation of model calibration. These findings motivate large-scale ambiguous dataset creation and further research focusing on noisy visual data.
翻译:当代愿景基准主要考虑人类能够取得近乎完美业绩的任务。然而,人类经常得到他们无法百分之百确定分类的视觉数据,而接受过标准愿景基准培训的模型在对这些数据进行评估时表现低。为了解决这一问题,我们引入了创建模糊图像数据集的程序,并使用该数据集制作从视频中提取的噪音图像集(SQUID-E ) (“Squidy ” ) 。所有图像都附有地面真实值附加说明,测试集附有人类不确定性判断。我们使用这一数据集来描述人类在愿景任务中的不确定性,并评估现有的视觉事件分类模型。实验结果表明,现有的视觉模型不具备足够能力,无法为模糊图像提供有意义的产出,而且这种性质的数据集可以通过模型培训和直接评估模型校准来评估和改进这些模型。这些发现鼓励了大规模模糊数据集的创建和进一步研究,重点是杂乱的视觉数据。