The ubiquity of microphone-enabled devices has lead to large amounts of unlabelled audio data being produced at the edge. The integration of self-supervised learning (SSL) and federated learning (FL) into one coherent system can potentially offer data privacy guarantees while also advancing the quality and robustness of speech representations. In this paper, we provide a first-of-its-kind systematic study of the feasibility and complexities for training speech SSL models under FL scenarios from the perspective of algorithms, hardware, and systems limits. Despite the high potential of their combination, we find existing system constraints and algorithmic behaviour make SSL and FL systems nearly impossible to build today. Yet critically, our results indicate specific performance bottlenecks and research opportunities that would allow this situation to be reversed. While our analysis suggests that, given existing trends in hardware, hybrid SSL and FL speech systems will not be viable until 2027. We believe this study can act as a roadmap to accelerate work towards reaching this milestone much earlier.
翻译:麦克风助听装置的普遍存在导致大量未贴标签的音频数据正在边缘生成。将自我监督的学习(SSL)和联合学习(FL)整合为一个连贯的系统,有可能提供数据隐私保障,同时提高语言表述的质量和稳健性。在本文件中,我们从算法、硬件和系统限制的角度,对在FL情景下培训演讲的SSL模型的可行性和复杂性进行首次系统研究。尽管这些模型的组合潜力巨大,但我们发现现有的系统制约和算法行为使SSL和FL系统今天几乎无法建立。然而,我们的结果表明,具体的业绩瓶颈和研究机会使得这种情况得以扭转。虽然我们的分析表明,鉴于硬件、混合的SSL和FL语音系统的现有趋势,在2027年之前是行不通的。我们认为,这一研究可以作为一种路线图,加速更早地实现这一里程碑的工作。