We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition. We build our method on Transformers for its efficacy. Although we have witnessed great progress for video action recognition in the past decade, it remains challenging yet valuable how to train a single model that can perform well across multiple datasets. Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss, aiming to learn robust representations for action recognition. In particular, the informative loss maximizes the expressiveness of the feature embedding while the projection loss for each dataset mines the intrinsic relations between classes across datasets. We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2 datasets. Extensive experimental results show that our method can consistently improve state-of-the-art performance. Code and models are released.
翻译:我们研究了强健的地貌表现任务,旨在全面推广多种数据集,以利行动识别。我们建立了以变异器为主的方法。虽然我们在过去十年里在视频动作识别方面取得了巨大进展,但是,它仍然具有挑战性,但却很有价值,如何训练一个能够跨越多个数据集很好地发挥作用的单一模型。在这里,我们提出了一个新的多数据集培训模式,即多轨培训,设计了两个新的损失术语,即信息化损失和投影损失,目的是学习强健的动作识别。特别是,信息性损失使特性嵌入的清晰度最大化,而每个数据集的预测损失则使各数据集之间的内在关系得到释放。我们核查了我们在五个具有挑战性的数据集、动因-400、动因-700、动因-700、运动-即时、活动网和某样-东西-V2数据集上的方法的有效性。广泛的实验结果表明,我们的方法可以不断改进最先进的性能。代码和模型被发布。