Open-set Unsupervised Video Domain Adaptation (OUVDA) deals with the task of adapting an action recognition model from a labelled source domain to an unlabelled target domain that contains "target-private" categories, which are present in the target but absent in the source. In this work we deviate from the prior work of training a specialized open-set classifier or weighted adversarial learning by proposing to use pre-trained Language and Vision Models (CLIP). The CLIP is well suited for OUVDA due to its rich representation and the zero-shot recognition capabilities. However, rejecting target-private instances with the CLIP's zero-shot protocol requires oracle knowledge about the target-private label names. To circumvent the impossibility of the knowledge of label names, we propose AutoLabel that automatically discovers and generates object-centric compositional candidate target-private class names. Despite its simplicity, we show that CLIP when equipped with AutoLabel can satisfactorily reject the target-private instances, thereby facilitating better alignment between the shared classes of the two domains. The code is available.
翻译:开放非监督视频域自适应(OUVDA)涉及将动作识别模型从标记源域适应到包含“目标专用”类别的未标记目标域的任务,这些类别存在于目标中但不存在于源中。在这项工作中,我们不同于先前的工作,提出使用预训练语言和视觉模型(CLIP)。由于其丰富的表示和零-shot识别能力,CLIP适用于OUVDA。然而,使用CLIP的零-shot协议拒绝目标专用实例需要关于目标专用标签名称的oracle知识。为了规避无法了解标签名称的问题,我们提出了AutoLabel,可以自动发现和生成基于对象的组合候选目标专用类名称。尽管它的简单性,但我们表明,当配备AutoLabel时,CLIP可以令人满意地拒绝目标专用实例,从而促进两个域的共享类之间更好的对齐。代码是可用的。