Most action recognition models today are highly parameterized, and evaluated on datasets with predominantly spatially distinct classes. It has also been shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape in still image recognition tasks. Taken together, this raises suspicion that large video models partly learn spurious correlations rather than to track relevant shapes over time to infer generalizable semantics from their movement. A natural way to avoid parameter explosion when learning visual patterns over time is to make use of recurrence. In this article, we empirically study whether the choice of low-level temporal modeling has consequences for texture bias and cross-domain robustness. In order to enable a light-weight and systematic assessment of the ability to capture temporal structure, not revealed from single frames, we provide the Temporal Shape (TS) dataset, as well as modified domains of Diving48 allowing for the investigation of texture bias for video models. We find that across a variety of model sizes, convolutional-recurrent and attention-based models show better out-of-domain robustness on TS than 3D CNNs. In domain shift experiments on Diving48, our experiments indicate that 3D CNNs and attention-based models exhibit more texture bias than convolutional-recurrent models. Moreover, qualitative examples suggest that convolutional-recurrent models learn more correct class attributes from the diving data when compared to the other two types of models at the same global validation performance.
翻译:今天,大多数行动识别模型都是高度参数化的,并且对以空间特征为主的类别数据集进行了评估。还显示,2D进化神经网络(CNNs)倾向于偏向于纹理,而不是在图像识别任务中形成形状。加在一起,这使人怀疑大型视频模型部分学习假相,而不是跟踪相关形状,以便从移动中推断出一般的语义学。在学习视觉模式时避免参数爆炸的自然方法是利用反复出现。在本篇文章中,我们实验研究选择低级时间模型是否对纹理偏差和跨部坚固性产生后果。为了能够对捕捉取时间结构的能力进行轻度和系统评估,而不是从单一框架显示,我们提供了Temoral 形状数据集,以及经过修改的Diving48域域,以便能够对视频模型的文字偏差进行调查。我们发现,在各种模型的大小、进化、经常和关注模型中,低级模型显示TS&TSA的坚固度,比3D级的直径直径性模型要好。