An increasingly urgent task in analysis of networks is to develop statistical models that include contextual information in the form of covariates while respecting degree heterogeneity and sparsity. In this paper, we propose a new parameter-sparse random graph model for density-sparse directed networks, with parameters to explicitly account for all these features. The resulting objective function of our model is akin to that of the high-dimensional logistic regression, with the key difference that the probabilities are allowed to go to zero at a certain rate to accommodate sparse networks. We show that under appropriate conditions, an estimator obtained by the familiar penalized likelihood with an $\ell_1$ penalty to achieve parameter sparsity can alleviate the curse of dimensionality, and crucially is selection and rate consistent. Interestingly, inference on the covariate parameter can be conducted straightforwardly after the model fitting, without the need of the kind of debiasing commonly employed in $\ell_1$ penalized likelihood estimation. Simulation and data analysis corroborate our theoretical findings. In developing our model, we provide the first result highlighting the fallacy of what we call data-selective inference, a common practice of artificially truncating the sample by throwing away nodes based on their connections, by examining the estimation bias in the Erd\"os-R\'enyi model theoretically and in the stochastic block model empirically.
翻译:分析网络的日益紧迫的任务是开发统计模型,以共变形式纳入背景信息,同时尊重程度异质性和广度。 在本文中,我们为密度偏差的定向网络提出一个新的参数分析随机图表模型,并配有明确说明所有这些特征的参数。因此,我们模型的客观功能类似于高维后勤回归,关键区别是允许概率以一定速度降至零,以适应稀有网络。我们表明,在适当条件下,通过熟悉的受罚可能性获得的1美元罚款的估测器可以减轻参数偏移的诅咒,关键是选择和率的一致性。有趣的是,在模型调整之后,可以直接地对共变差参数进行推论,而不需要在1美元模型中通常使用的偏差度,以适应于零位的概率估计。模拟和数据分析证实了我们的理论结论。在开发模型时,我们提供了第一个结果,突出我们称之为数据偏差的偏差,即通过不通过常规的概率分析,将数据偏差的测算结果显示我们所谓的数据偏差。