Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily affected by domain shift. In this paper, we propose a simple and novel UDA approach for video action recognition. Our approach leverages recent advances on spatio-temporal transformers to build a robust source model that better generalises to the target domain. Furthermore, our architecture learns domain invariant features thanks to the introduction of a novel alignment loss term derived from the Information Bottleneck principle. We report results on two video action recognition benchmarks for UDA, showing state-of-the-art performance on HMDB$\leftrightarrow$UCF, as well as on Kinetics$\rightarrow$NEC-Drone, which is more challenging. This demonstrates the effectiveness of our method in handling different levels of domain shift. The source code is available at https://github.com/vturrisi/UDAVT.
翻译:在过去几年里,无人监督的域适应(UDA)技术在计算机视野中变得相当重要和受欢迎。然而,与可供图像使用的广泛文献相比,视频领域相对来说仍然没有探索。另一方面,行动识别模型的性能受到域变换的严重影响。在本文中,我们提出一个简单和新颖的UDA方法用于视频动作识别。我们的方法利用空间时空变换器的最新进步来构建一个强大的源模型,更好地概括到目标域。此外,我们的建筑学习了域因采用信息博特勒内克原则产生的新的校准损失术语而异。我们报告了UDA的两个视频行动识别基准的结果,显示HMDB$\leftrightrightrownrollo$NEC-Drone的最新表现,以及更具有挑战性的Kinitics $\rightrowrowr$NECEEC-Drone。这显示了我们处理不同级别域变换的方法的有效性。源代码可在https://github.com/vturrisi/AUDUDT查阅。