Learning transferable and domain adaptive feature representations from videos is important for video-relevant tasks such as action recognition. Existing video domain adaptation methods mainly rely on adversarial feature alignment, which has been derived from the RGB image space. However, video data is usually associated with multi-modal information, e.g., RGB and optical flow, and thus it remains a challenge to design a better method that considers the cross-modal inputs under the cross-domain adaptation setting. To this end, we propose a unified framework for video domain adaptation, which simultaneously regularizes cross-modal and cross-domain feature representations. Specifically, we treat each modality in a domain as a view and leverage the contrastive learning technique with properly designed sampling strategies. As a result, our objectives regularize feature spaces, which originally lack the connection across modalities or have less alignment across domains. We conduct experiments on domain adaptive action recognition benchmark datasets, i.e., UCF, HMDB, and EPIC-Kitchens, and demonstrate the effectiveness of our components against state-of-the-art algorithms.
翻译:现有视频域适应方法主要依赖对抗性特征校对,这是来自RGB图像空间,然而,视频数据通常与多模式信息相关,例如RGB和光学流,因此,设计一个更好的方法,在跨领域适应设置下考虑跨模式投入,仍然是一项挑战。为此,我们提议一个视频域适应统一框架,同时规范跨模式和跨领域特征表达。具体地说,我们将每个模式视为一个领域,利用设计得当的对比式学习技术进行抽样战略。结果,我们的目标规范特征空间,这些空间最初缺乏不同模式的联系,或者不同领域之间不那么一致。我们开展了关于适应行动识别基准数据集的实验,即UCF、HMDB和EPIC-Kitchens,并展示了我们各组成部分相对于最先进的算法的有效性。