Unsupervised video domain adaptation is a practical yet challenging task. In this work, for the first time, we tackle it from a disentanglement view. Our key idea is to disentangle the domain-related information from the data during the adaptation process. Specifically, we consider the generation of cross-domain videos from two sets of latent factors, one encoding the static domain-related information and another encoding the temporal and semantic-related information. A Transfer Sequential VAE (TranSVAE) framework is then developed to model such generation. To better serve for adaptation, we further propose several objectives to constrain the latent factors in TranSVAE. Extensive experiments on the UCF-HMDB, Jester, and Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE compared with several state-of-the-art methods. Code is publicly available at https://github.com/ldkong1205/TranSVAE.
翻译:未经监督的视频域适应是一项实际但具有挑战性的任务。 在这项工作中,我们第一次从分解的角度处理它。 我们的关键想法是在适应过程中将域相关信息与数据脱钩。 具体地说, 我们考虑从两组潜在因素中生成跨域视频, 一组是静态域相关信息编码,另一组是时间和语义相关信息编码。 然后开发了一个传输序列VAE(TranSVAE)框架来模拟这种生成。 为了更好地为适应服务,我们进一步提出了若干目标,以限制TranSVAE中的潜在因素。 关于UCF-HMDB、Jester和Epic-Kitchens数据集的广泛实验可以核实TranSVAE与若干最新方法相比的有效性和优越性。 代码可在 https://github.com/ldkong1205/TranSVAE上公开查阅。