This report describes the technical details of our submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition. The EPIC-Kitchens dataset is more difficult than other video domain adaptation datasets due to multi-tasks with more modalities. Firstly, to participate in the challenge, we employ a transformer to capture the spatial information from each modality. Secondly, we employ a temporal attention module to model temporal-wise inter-dependency. Thirdly, we employ the adversarial domain adaptation network to learn the general features between labeled source and unlabeled target domain. Finally, we incorporate multiple modalities to improve the performance by a three-stream network with late fusion. Our network achieves the comparable performance with the state-of-the-art baseline T$A^3$N and outperforms the baseline on top-1 accuracy for verb class and top-5 accuracies for all three tasks which are verb, noun and action. Under the team name xy9, our submission achieved 5th place in terms of top-1 accuracy for verb class and all top-5 accuracies.
翻译:本报告介绍了我们提交 EPIC-Kitchens 2021 不受监督的适应行动识别域域挑战的技术细节。 EPIC-Kitchens 数据集比其他视频域适应数据集更加困难,因为多任务形式更多。 首先,为了参与这项挑战,我们使用变压器从每种模式中获取空间信息。 其次,我们使用一个时间关注模块,以模拟时间-时间-相互依存性。 第三,我们使用对称域适应网络学习标签源与未标目标域之间的一般特征。 最后,我们采用了多种模式,改进三流网络的性能,以延迟聚合方式改进三流网络。我们的网络实现了与最先进的基线T$A3$N的可比性能,并超越了动词级和前五级之间所有三种任务上一精度的基线。根据团队名称xy9,我们提交的材料在动词类和所有前五级最高精度方面达到了第五个位置。