In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source. The challenge lies in aligning the shared classes of the two domains while separating the shared classes from the unknown ones. In this work we propose to address OUVDA with an unified contrastive learning framework that learns discriminative and well-clustered features. We also propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data. We show that discriminative feature space facilitates better separation of the unknown classes, and thereby allows us to use a simple similarity based score to identify them. We conduct thorough experimental evaluation on multiple OUVDA benchmarks and show the effectiveness of our proposed method against the prior art.
翻译:为了降低行动识别的注释成本,提议了未经监督的视频域适应方法,目的是将一个预测模型从标签数据集(即源域)改造成一个未标签数据集(即目标域),在这项工作中,我们处理的是一个更现实的情景,即所谓的开放设置视频域适应(OUVDA),目标数据集包含“未知的”语义类别,与源不共享。挑战在于将两个域的共享类别与未知类别区分开来。在这项工作中,我们提议用一个统一的对比性学习框架来处理OUVDA问题,以学习有区别和有条理的特征。我们还提议了一个以视频为导向的时间对比性损失,使我们能够通过利用视频数据中可自由获取的时间信息更好地组合地段空间。我们显示,歧视性特征空间有助于更好地区分未知类别,从而使我们能够使用基于简单相似的分数来识别这些类别。我们对多个OVDA基准进行彻底的实验评估,并展示我们所提议的方法相对于先前艺术的有效性。