The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.
翻译:对多地点储存的数据的分析越来越普遍,引起了对数据储存和通信安全的新关切; 联邦学习(不需要集中数据)是防止重数据运输、确保有价值数据和保护个人信息保护的共同办法,因此,确定如何将分析数据获得的信息汇总到不同的地方地点已成为一个重要的统计问题; 通常使用的平均方法可能不合适,因为数据不尽相同,各地点之间无法比较结果,采用这些方法可能导致个人分析获得的信息丢失; 采用分散计算联合学习的顺序方法,可以促进整合和加快分析进程; 我们开发一种数据驱动方法,通过分析当地数据来高效和有效地汇总有价值的信息,而不会遇到信息安全和因数据通信而需大量运输等潜在问题; 此外,拟议方法可以保留典型的顺序适应设计的性质,例如数据驱动的抽样大小和在应用一般线性模型时的精确度; 我们使用模拟数据的数字研究,以及从墨西哥32家医院收集的COVID-19数据的应用,以说明拟议的方法。