Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and video frames; (2) modality gap between image and video data. To address these challenges, we propose Cycle Domain Adaptation (CycDA), a cycle-based approach for unsupervised image-to-video domain adaptation by leveraging the joint spatial information in images and videos on the one hand and, on the other hand, training an independent spatio-temporal model to bridge the modality gap. We alternate between the spatial and spatio-temporal learning with knowledge transfer between the two in each cycle. We evaluate our approach on benchmark datasets for image-to-video as well as for mixed-source domain adaptation achieving state-of-the-art results and demonstrating the benefits of our cyclic adaptation. Code is available at \url{https://github.com/wlin-at/CycDA}.
翻译:虽然动作识别在过去几年中取得了令人印象深刻的结果,但训练数据集的收集和标注仍然耗时且成本高昂。因此,已经提出了从无标注的目标视频中适应基于无标签的网络图像来源的图像到视频自适应。这引出了两个主要挑战:(1)网络图像和视频帧之间的空间域偏移;(2)图像和视频数据之间的模态差距。为了解决这些挑战,我们提出了Cycle Domain Adaptation(CycDA),“循环”方法,通过利用图像和视频中的联合空间信息以解决第一个挑战,并训练一个独立的“时空模型”来两者之间的模态差异。我们在每个循环中在空间和时空学习之间交替进行知识转移。我们在图像到视频和混合源领域自适应的基准数据集上评估了我们的方法,取得了最新的结果,并展示了我们的循环适应的优势。代码可在\url{https://github.com/wlin-at/CycDA}中获得。