This report presents the technical details of our submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021. To participate in the challenge we deployed spatio-temporal feature extraction and aggregation models we have developed recently: GSF and XViT. GSF is an efficient spatio-temporal feature extracting module that can be plugged into 2D CNNs for video action recognition. XViT is a convolution free video feature extractor based on transformer architecture. We design an ensemble of GSF and XViT model families with different backbones and pretraining to generate the prediction scores. Our submission, visible on the public leaderboard, achieved a top-1 action recognition accuracy of 44.82%, using only RGB.
翻译:本报告介绍了我们提交EPIC-Kitchens-100行动识别挑战(2021年)的技术细节。为了参与我们最近开发的超时特征提取和集成模型:GSF和XViT。GSF是一个高效的超时特征提取模块,可以插入2DCNN进行视频动作识别。XViT是一个基于变压器结构的革命自由视频特征提取器。我们设计了一个具有不同骨干和预培训的GSF和XViT模型家庭组合,以产生预测分数。我们的文件在公共领导板上可以看到,实现了44.82%的上一级行动识别精确度,仅使用RGB。