This paper introduces video domain generalization where most video classification networks degenerate due to the lack of exposure to the target domains of divergent distributions. We observe that the global temporal features are less generalizable, due to the temporal domain shift that videos from other unseen domains may have an unexpected absence or misalignment of the temporal relations. This finding has motivated us to solve video domain generalization by effectively learning the local-relation features of different timescales that are more generalizable, and exploiting them along with the global-relation features to maintain the discriminability. This paper presents the VideoDG framework with two technical contributions. The first is a new deep architecture named the Adversarial Pyramid Network, which improves the generalizability of video features by capturing the local-relation, global-relation, and cross-relation features progressively. On the basis of pyramid features, the second contribution is a new and robust approach of adversarial data augmentation that can bridge different video domains by improving the diversity and quality of augmented data. We construct three video domain generalization benchmarks in which domains are divided according to different datasets, different consequences of actions, or different camera views, respectively. VideoDG consistently outperforms the combinations of previous video classification models and existing domain generalization methods on all benchmarks.
翻译:本文介绍大部分视频分类网络由于缺乏接触不同分布的目标领域而退化的视频域常规化。 我们观察到,全球时间特征不那么普遍,因为来自其他隐蔽域的视频可能会意外地缺乏或错配时间关系,时间域变换可能出乎意料。 这一发现促使我们通过有效学习不同时间尺度的局部关系特征来解决视频域概括化问题,这些不同时间尺度比较普遍,并且利用它们和全球关系特征来维持差异性。本文用两种技术贡献来介绍视频DG框架。第一个是名为Adversarial Pyramid网络的新的深层结构,通过捕捉地方关系、全球关系和交叉关系等特征逐步改善视频特征的通用性。根据金字塔特征,第二个贡献是新的和强有力的对抗数据增强方法,通过改进扩大数据的多样性和质量来连接不同的视频域。我们构建了三个视频域通用基准,根据不同数据集、不同后果、不同行动或不同领域基准,分别从以往的视频模型和不同领域分类中,根据所有现有视频模型的组合模式和不同图像格式,对视频域进行了区分。