Distributed data naturally arise in scenarios involving multiple sources of observations, each stored at a different location. Directly pooling all the data together is often prohibited due to limited bandwidth and storage, or due to privacy protocols. This paper introduces a new robust distributed algorithm for fitting linear regressions when data are subject to heavy-tailed and/or asymmetric errors with finite second moments. The algorithm only communicates gradient information at each iteration and therefore is communication-efficient. Statistically, the resulting estimator achieves the centralized nonasymptotic error bound as if all the data were pooled together and came from a distribution with sub-Gaussian tails. Under a finite $(2+\delta)$-th moment condition, we derive a Berry-Esseen bound for the distributed estimator, based on which we construct robust confidence intervals. Numerical studies further confirm that compared with extant distributed methods, the proposed methods achieve near-optimal accuracy with low variability and better coverage with tighter confidence width.
翻译:分布式数据自然出现在多个观测来源的情景中,每个观测来源都储存在不同地点。 由于带宽和存储有限,或者由于隐私协议, 直接将所有数据集中在一起往往被禁止。 本文引入了一种新的稳健分布式算法, 用于在数据发生重尾和/或不对称误差时安装线性回归( 有限秒) 。 该算法只在每个迭代中传递梯度信息, 因而具有通信效率。 从统计上看, 由此得出的估计器实现了集中式非抽取错误, 仿佛所有数据都是集合起来的, 并且来自与亚加西南尾巴的分布。 在一个有限的 $( 2 ⁇ delta) 时刻条件下, 我们为分布式估测仪找到一条连接的“ 莓- Esseen ” 值, 我们以此为基础构建稳健的信心间隔。 数值研究进一步证实, 与流传方法相比, 拟议方法的精确度接近最优化, 且变化性小, 且覆盖范围更窄。