We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available at GitHub: https://github.com/skchao74/Distributed-bootstrap.
翻译:我们建议了一种分布式靴套方法,用于同时推断用多种机器储存和处理的高维大规模数据。该方法产生一个基于通信效率低偏向的诺尔姆信任区,我们建议了一种有效的交叉校准方法,以调和每次迭代的方法。我们理论上证明,对于需要统计准确性和效率的通信回合数($tau ⁇ min}$)的制约较低。此外,$sau ⁇ min}美元只增加了对数,与工人的数量和内在的维度相比,几乎与名义的维度不相容。我们通过广泛的模拟研究测试我们的理论,以及基于美国空线实时性能数据集的半合成数据集的可变筛选任务。复制数字结果的代码可以在 GitHub: https://github.com/skchao74/diplated-botstspreg。