With the increasing amount of multimedia data on modern mobile systems and IoT infrastructures, harnessing these rich multimodal data without breaching user privacy becomes a critical issue. Federated learning (FL) serves as a privacy-conscious alternative to centralized machine learning. However, existing FL methods extended to multimodal data all rely on model aggregation on single modality level, which restrains the server and clients to have identical model architecture for each modality. This limits the global model in terms of both model complexity and data capacity, not to mention task diversity. In this work, we propose Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL), a multimodal federated learning framework that enables training larger server models from clients with heterogeneous model architectures and data modalities, while only communicating knowledge on public dataset. To achieve better multimodal representation fusion, we design a global-local cross-modal ensemble strategy to aggregate client representations. To mitigate local model drift caused by two unprecedented heterogeneous factors stemming from multimodal discrepancy (modality gap and task gap), we further propose two inter-modal and intra-modal contrasts to regularize local training, which complements information of the absent modality for uni-modal clients and regularizes local clients to head towards global consensus. Thorough evaluations and ablation studies on image-text retrieval and visual question answering tasks showcase the superiority of CreamFL over state-of-the-art FL methods and its practical value.
翻译:随着现代移动系统和IoT基础设施的多媒体数据数量不断增加,利用这些丰富的多式联运数据而不侵犯用户隐私,这是一个关键问题。联邦学习(FL)是一个隐私意识的替代中央机器学习的替代方案,但是,现有的FL方法扩大到多式联运数据,都依赖单一模式一级的模型汇总,这限制了服务器和客户对每种模式的模型结构的相同性,从模型复杂性和数据能力这两个方面限制了全球模式,更不用说任务的多样性。在这项工作中,我们提议了多式FL(CreamFL)的对比代表和聚合,这是一个多式FL(CreamFl)的多式组合化学习框架,能够对具有多种模式架构和数据模式的客户的大型服务器模型进行培训,而只是传播公共数据集方面的知识。为了实现更好的多式联运混合,我们设计了一个全球-地方交叉模式组合组合战略,以汇总客户的表示方式。为了减轻由于多种模式的差异(模式差距和任务差距)造成的两个前所未有的差异因素造成的本地模式流,我们进一步建议两种模式间和内部混合式FL的对比,即多式联运的优势对比,使当地客户的升级和视觉回溯路路的客户的学习方式,以规范化。</s>