Federated learning is proposed as an alternative to centralized machine learning since its client-server structure provides better privacy protection and scalability in real-world applications. In many applications, such as smart homes with IoT devices, local data on clients are generated from different modalities such as sensory, visual, and audio data. Existing federated learning systems only work on local data from a single modality, which limits the scalability of the systems. In this paper, we propose a multimodal and semi-supervised federated learning framework that trains autoencoders to extract shared or correlated representations from different local data modalities on clients. In addition, we propose a multimodal FedAvg algorithm to aggregate local autoencoders trained on different data modalities. We use the learned global autoencoder for a downstream classification task with the help of auxiliary labelled data on the server. We empirically evaluate our framework on different modalities including sensory data, depth camera videos, and RGB camera videos. Our experimental results demonstrate that introducing data from multiple modalities into federated learning can improve its accuracy. In addition, we can use labelled data from only one modality for supervised learning on the server and apply the learned model to testing data from other modalities to achieve decent accuracy (e.g., approximately 70% as the best performance), especially when combining contributions from both unimodal clients and multimodal clients.
翻译:联邦学习是作为中央机器学习的替代方案而提出的,因为其客户-服务器结构提供了更好的隐私保护,并且能够在现实世界应用程序中进行扩缩。在许多应用程序中,例如智能家庭使用IoT设备,当地客户数据来自感官、视觉和音频数据等不同模式。现有的联邦化学习系统仅从单一模式中就本地数据开展工作,从而限制系统的可扩缩性。在本文件中,我们提出了一个多式联运和半监督的半联邦化学习框架,用于培训自动编码器,以便从不同的当地数据模式中提取关于客户的共享或相关表述。此外,我们建议对接受不同数据模式培训的当地综合自动编码器采用多式FedAvg算法。我们在服务器上辅助标签数据帮助下游分类任务时,使用已学过的全球自动编码器。我们实情评估了不同模式的框架,包括感官数据、深层相机和RGB摄像机。我们的实验结果表明,将多种模式的数据引入联邦化学习可以提高其准确性。此外,我们只能使用一种模式的标签数据,从一种模式中提取数据,用于监管服务器上的准确性,特别是将70个客户的模拟测试。