For the task of image classification, researchers work arduously to develop the next state-of-the-art (SOTA) model, each bench-marking their own performance against that of their predecessors and of their peers. Unfortunately, the metric used most frequently to describe a model's performance, average categorization accuracy, is often used in isolation. As the number of classes increases, such as in fine-grained visual categorization (FGVC), the amount of information conveyed by average accuracy alone dwindles. While its most glaring weakness is its failure to describe the model's performance on a class-by-class basis, average accuracy also fails to describe how performance may vary from one trained model of the same architecture, on the same dataset, to another (both averaged across all categories and at the per-class level). We first demonstrate the magnitude of these variations across models and across class distributions based on attributes of the data, comparing results on different visual domains and different per-class image distributions, including long-tailed distributions and few-shot subsets. We then analyze the impact various FGVC methods have on overall and per-class variance. From this analysis, we both highlight the importance of reporting and comparing methods based on information beyond overall accuracy, as well as point out techniques that mitigate variance in FGVC results.
翻译:为了完成图像分类的任务,研究人员辛勤地努力开发下一个最先进的模型(SOTA),每个台阶都对照其前身和同龄人的成绩来标出自己的业绩。不幸的是,最经常用来描述模型性能的尺度往往被孤立地使用,平均分类精确度。随着类数的增加,例如细微的视觉分类(FGVC),以平均精度单次递减的方式传递的信息量的增加。尽管其最明显的弱点是未能逐级地描述模型的性能,但平均精确度也未能说明同一结构中经过训练的模型(同一数据集)的性能如何不同(所有类别和每类平均分类的准确性),而另一个模型的性能也常常被孤立地使用。我们首先根据数据属性来显示不同模型和班级分布的这种差异的程度,比较不同视觉域和每类图像的不同分布的结果,包括长期成型分布和几小分集。我们随后分析了各种FGVC方法对总体和每类分析结果的影响,从分析来看,从总体的准确性来看,这种差异性,从分析来看,从这种分析来看,从分析来看,从分析来看,从整体和按千级分析来看,从分析来看,从差差差分分法的重要性,从分析,从分析来看,从分析来看,从比较,从分析,从比较,从比较,从整个,从分析,从分析,从分析,从比较,从分析,从比较,从分析,从比较,从比较,从比较,从比较,从整个,从比较,从比较,从比较,从比较,看,从比较,从比较,分分数分分分分分。