The rapid growth of short-form video platforms increases the need for privacy-preserving moderation, as cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency. To address these challenges, we propose an on-device federated learning framework for video violence detection that integrates self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection. Our approach reduces the trainable parameter count to 5.5M (~3.5% of a 156M backbone) and incorporates DP-SGD with configurable privacy budgets and secure aggregation. Experiments on RWF-2000 with 40 clients achieve 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, while reducing communication cost by $28.3\times$ compared to full-model federated learning. The code is available at: {https://github.com/zyt-599/FedVideoMAE}
翻译:短视频平台的快速增长加大了对隐私保护内容审核的需求,因为基于云端的处理流程会将原始视频暴露于隐私风险、高带宽成本和推理延迟之下。为应对这些挑战,我们提出了一种用于视频暴力检测的端侧联邦学习框架,该框架集成了自监督VideoMAE表征、基于LoRA的参数高效适配以及纵深防御隐私保护技术。我们的方法将可训练参数量降至550万(约占1.56亿骨干网络的3.5%),并结合了可配置隐私预算的差分隐私随机梯度下降与安全聚合机制。在40个客户端上使用RWF-2000数据集进行的实验表明,无隐私保护时准确率达到77.25%,在强差分隐私约束下仍能保持65-66%的准确率,同时相比全模型联邦学习将通信成本降低了28.3倍。代码发布于:{https://github.com/zyt-599/FedVideoMAE}