Training Automatic Speech Recognition (ASR) models under federated learning (FL) settings has attracted a lot of attention recently. However, the FL scenarios often presented in the literature are artificial and fail to capture the complexity of real FL systems. In this paper, we construct a challenging and realistic ASR federated experimental setup consisting of clients with heterogeneous data distributions using the French and Italian sets of the CommonVoice dataset, a large heterogeneous dataset containing thousands of different speakers, acoustic environments and noises. We present the first empirical study on attention-based sequence-to-sequence End-to-End (E2E) ASR model with three aggregation weighting strategies -- standard FedAvg, loss-based aggregation and a novel word error rate (WER)-based aggregation, compared in two realistic FL scenarios: cross-silo with 10 clients and cross-device with 2K and 4K clients. Our analysis on E2E ASR from heterogeneous and realistic federated acoustic models provides the foundations for future research and development of realistic FL-based ASR applications.
翻译:根据联合学习(FL)设置,培训自动语音识别模型最近引起了许多注意,然而,文献中经常介绍的FL假设情景是人为的,未能捕捉到真正的FL系统的复杂性。在本文件中,我们构建了一个富有挑战性和现实性的ASR联合实验性结构,由使用法国和意大利通用Voice数据集的多种数据分布的客户组成,这是一个大型的多种数据集,包含数千名不同的讲者、声音环境和噪音。我们介绍了关于基于关注的顺序顺序至顺序的E2E(E2E) ASR模型,其中有三个汇总加权战略 -- -- 标准FedAvg、基于损失的汇总和基于新词错误率的汇总。相比之下,两种现实的FL假设情景是:与10个客户交叉的Silio和与2K和4K客户的交叉构件。我们从多种和现实的FL-SR应用中对E2E ASR的分析为未来研究和开发切合实际的FL-SR应用奠定了基础。