Recent research has investigated the shape and texture biases of pre-trained deep neural networks (DNNs) in image classification. Those works test how much a trained DNN relies on specific image cues like texture. The present study shifts the focus to understanding the cue influence during training, analyzing what DNNs can learn from shape, texture, and color cues in absence of the others; investigating their individual and combined influence on the learning success. We analyze these cue influences at multiple levels by decomposing datasets into cue-specific versions. Addressing semantic segmentation, we learn the given task from these reduced cue datasets, creating cue experts. Early fusion of cues is performed by constructing appropriate datasets. This is complemented by a late fusion of experts which allows us to study cue influence location-dependent on pixel level. Experiments on Cityscapes, PASCAL Context, and a synthetic CARLA dataset show that while no single cue dominates, the shape + color expert predominantly improves the prediction of small objects and border pixels. The cue performance order is consistent for the tested convolutional and transformer architecture, indicating similar cue extraction capabilities, although pre-trained transformers are said to be more biased towards shape than convolutional neural networks.
翻译:暂无翻译