As datasets grow larger, they are often distributed across multiple machines that compute in parallel and communicate with a central machine through short messages. In this paper, we focus on sparse regression and propose a new procedure for conducting selective inference with distributed data. Although many distributed procedures exist for point estimation in the sparse setting, few options are available for estimating uncertainties or conducting hypothesis tests based on the estimated sparsity. We solve a generalized linear regression on each machine, which then communicates a selected set of predictors to the central machine. The central machine uses these selected predictors to form a generalized linear model (GLM). To conduct inference in the selected GLM, our proposed procedure bases approximately-valid selective inference on an asymptotic likelihood. The proposal seeks only aggregated information, in relatively few dimensions, from each machine which is merged at the central machine for selective inference. By reusing low-dimensional summary statistics from local machines, our procedure achieves higher power while keeping the communication cost low. This method is also applicable as a solution to the notorious p-value lottery problem that arises when model selection is repeated on random splits of data.
翻译:随着数据集的扩大,它们往往分布在多个机器之间,这些机器平行计算,并通过短信息与中央机器进行通信。在本文中,我们侧重于稀释回归,并提出了对分布的数据进行选择性推断的新程序。虽然在稀少的环境下存在许多分布式的点估程序,但是在根据估计宽度进行假设测试时,很少有选择方案可供选择。我们解决了每台机器的普遍线性回归,然后将一组选定的预测器传送到中央机器。中央机器利用这些选定的预测器形成一个通用的线性模型(GLM)。在选定的GLM中进行推断,我们提议的程序以无症状的可能性为基础进行大约有效选择性的选择性推断。提案只寻求从中央机器合并的每台机器中收集相对较少的信息,以便进行选择性推断。通过从本地机器中重新使用低维摘要统计数据,我们的程序获得更高的能量,同时保持通信成本低廉。这个方法也适用于解决当模型在随机分割数据时出现的臭名的彩票问题。</s>