An implicit but pervasive hypothesis of modern computer vision research is that convolutional neural network (CNN) architectures that perform better on ImageNet will also perform better on other vision datasets. We challenge this hypothesis through an extensive empirical study for which we train 500 sampled CNN architectures on ImageNet as well as 8 other image classification datasets from a wide array of application domains. The relationship between architecture and performance varies wildly, depending on the datasets. For some of them, the performance correlation with ImageNet is even negative. Clearly, it is not enough to optimize architectures solely for ImageNet when aiming for progress that is relevant for all applications. Therefore, we identify two dataset-specific performance indicators: the cumulative width across layers as well as the total depth of the network. Lastly, we show that the range of dataset variability covered by ImageNet can be significantly extended by adding ImageNet subsets restricted to few classes.
翻译:现代计算机视觉研究隐含但普遍存在的假设是,在图像网络上表现较好的神经网络(CNN)结构也会在其他视觉数据集上取得更好的效果。我们通过在图像网络上培训500个抽样CNN结构的广泛的实验性研究以及来自广泛应用领域的8个其他图像分类数据集来质疑这一假设。建筑和性能之间的关系因数据集的不同而大相径庭。对其中一些而言,与图像网络的性能相关性甚至是负面的。显然,仅仅优化图像网络的架构是不够的,因为只有图像网络才致力于取得与所有应用相关的进步。因此,我们确定了两个特定数据集的业绩指标:跨层的累积宽度以及网络的总体深度。最后,我们表明,通过将限于少数类别的图像网络子集加入图像网络,可以大大扩大图像网络覆盖的数据集变化范围。