Unsupervised Domain Adaptation (UDA) can transfer knowledge from labeled source data to unlabeled target data of the same categories. However, UDA for first-person action recognition is an under-explored problem, with lack of datasets and limited consideration of first-person video characteristics. This paper focuses on addressing this problem. Firstly, we propose two small-scale first-person video domain adaptation datasets: ADL$_{small}$ and GTEA-KITCHEN. Secondly, we introduce channel-temporal attention blocks to capture the channel-wise and temporal-wise relationships and model their inter-dependencies important to first-person vision. Finally, we propose a Channel-Temporal Attention Network (CTAN) to integrate these blocks into existing architectures. CTAN outperforms baselines on the two proposed datasets and one existing dataset EPIC$_{cvpr20}$.
翻译:未受监督的域适应(UDA)可以将知识从标签源数据转移到同一类别无标签的目标数据。然而,用于第一人行动识别的UDA是一个未得到充分探讨的问题,缺乏数据集,对第一人视频特征的考虑有限。本文侧重于解决这一问题。首先,我们提议两个小规模第一人视频域适应数据集:ADL$ ⁇ small}$和GTEA-KITCHEN。第二,我们引入了频道时钟关注区块,以捕捉对第一人愿景十分重要的频道和时间关系并模拟其相互依存关系。最后,我们提议建立一个频道时钟注意网络(CTAN),将这些区块纳入现有结构。CTAN优于两个拟议数据集的基线和现有的1个数据数据集EPIC$ ⁇ cvpr20}。