Trained computer vision models are assumed to solve vision tasks by imitating human behavior learned from training labels. Most efforts in recent vision research focus on measuring the model task performance using standardized benchmarks. Limited work has been done to understand the perceptual difference between humans and machines. To fill this gap, our study first quantifies and analyzes the statistical distributions of mistakes from the two sources. We then explore human vs. machine expertise after ranking tasks by difficulty levels. Even when humans and machines have similar overall accuracies, the distribution of answers may vary. Leveraging the perceptual difference between humans and machines, we empirically demonstrate a post-hoc human-machine collaboration that outperforms humans or machines alone.
翻译:计算机视觉模型通常被认为是通过模仿从训练标签中学习的人类行为来解决视觉任务的。最近的视觉研究大多集中于使用标准化基准来衡量模型任务绩效。但是,有限的工作致力于了解人类和机器之间的感知差异。为了弥补这一空白,我们的研究首先量化并分析了两个来源的错误的统计分布。然后,我们根据难度水平对任务进行排名,探索了人类与机器的专业知识。即使人类和机器的整体准确性相似,答案的分布也可能不同。利用人类和机器之间的感知差异,我们通过后续的人-机协作得出实验证据表明,这种方法的效果优于单独使用人类或机器。