Nowadays, big datasets are spread over many machines which compute in parallel and communicate with a central machine through short messages. We consider a sparse regression setting in our paper and develop a new procedure for selective inference with distributed data. While there are many distributed procedures for point estimation in the sparse setting, not many options exist for estimating uncertainties or conducting hypothesis tests in models based on the estimated sparsity. We solve a generalized linear regression on each machine which communicates a selected set of predictors to the central machine. The central machine forms a generalized linear model with the selected predictors. How do we conduct selective inference for the selected regression coefficients? Is it possible to reuse distributed data, in an aggregated form, for selective inference? Our proposed procedure bases approximately-valid selective inference on an asymptotic likelihood. The proposal seeks only aggregated information, in relatively few dimensions, from each machine which is merged at the central machine to construct selective inference. Our procedure is also broadly applicable as a solution to the p-value lottery problem that arises with model selection on random splits of data.
翻译:目前,大数据集分布在许多机器上,这些机器同时进行计算,并通过短信息与中央机器进行通信。我们考虑在我们的文件中建立一个细小的回归环境,并开发出一种对分布数据进行选择性推断的新程序。虽然在稀少环境中有许多分布式的点估程序,但在根据估计的宽度进行模型的假设测试方面,并没有很多选择方案可以估计不确定性或进行假设性推断。我们解决了每台机器的普遍线性回归,该机器向中央机器通报一组选定的预测器。中央机器与所选的预测器形成了一个普遍的线性模型。我们如何对选定的回归系数进行选择性推断?我们建议的程序能否以汇总的形式重新利用分布的数据?我们提议的程序基于一种随机分解的可能性,以大约有效的选择性推断为基础。这个提议只寻求从中央机器合并的每台中收集相对较少的汇总信息,以构建选择性推断。我们的程序也广泛适用于p值的彩票问题的解决办法,因为通过对数据进行随机分解的模式选择而出现的p-val值的彩票问题。