The ubiquity of camera-enabled mobile devices has lead to large amounts of unlabelled video data being produced at the edge. Although various self-supervised learning (SSL) methods have been proposed to harvest their latent spatio-temporal representations for task-specific training, practical challenges including privacy concerns and communication costs prevent SSL from being deployed at large scales. To mitigate these issues, we propose the use of Federated Learning (FL) to the task of video SSL. In this work, we evaluate the performance of current state-of-the-art (SOTA) video-SSL techniques and identify their shortcomings when integrated into the large-scale FL setting simulated with kinetics-400 dataset. We follow by proposing a novel federated SSL framework for video, dubbed FedVSSL, that integrates different aggregation strategies and partial weight updating. Extensive experiments demonstrate the effectiveness and significance of FedVSSL as it outperforms the centralized SOTA for the downstream retrieval task by 6.66% on UCF-101 and 5.13% on HMDB-51.
翻译:虽然提出了各种自我监督的学习方法(SSL),以收获其潜在的时空空间代表,以进行具体任务的培训,但包括隐私关切和通信成本在内的实际挑战,包括隐私问题和通信成本在内的实际挑战,使SSL无法大规模部署。为了缓解这些问题,我们提议使用Fedal Learning(FL)来完成视频SSL的任务。在这项工作中,我们评估了目前最先进的SOTA(SOTA)视频SSL(SOTA)视频-SSL)技术的性能,并查明了这些技术在融入大规模FL(以动能-400数据集模拟设置时)时的缺点。我们随后提出了一个新的视频、哑装的FedVSSL(SL)联合框架,将不同的组合战略和部分重量更新结合起来。广泛的实验表明,FedVSSL(FedVSL)的效力和意义,因为它比中央SOTA(UCFC-101)和HMDB-51的5.13%的下游检索任务高出6.66%。