Regression models applied to network data where node attributes are the dependent variables poses a methodological challenge. As has been well studied, naive regression neither properly accounts for community structure, nor does it account for the dependent variable acting as both model outcome and covariate. To address this methodological gap, we propose a network regression model motivated by the important observation that controlling for community structure can, when a network is modular, significantly account for meaningful correlation between observations induced by network connections. We propose a generalized estimating equation (GEE) approach to learn model parameters based on clusters defined through any single-membership community detection algorithm applied to the observed network. We provide a necessary condition on the network size and edge formation probabilities to establish the asymptotic normality of the model parameters under the assumption that the graph structure is a stochastic block model. We evaluate the performance of our approach through simulations and apply it to estimate the joint impact of baseline covariates and network effects on COVID-19 incidence rate among countries connected by a network of commercial airline traffic. We find that during the beginning of the pandemic the network effect has some influence, the percentage of urban population has more influence on the incidence rate compared to the network effect after the travel ban was in effect.
翻译:在网络数据中,节点属性是附属变量,因此,对网络数据应用的回归模型提出了方法上的挑战。正如已经仔细研究过的那样,天真的回归既不能适当地说明社区结构,也不能说明作为模型结果和共变体的依附变量。为了解决这一方法上的差距,我们提议了一个基于重要观察的网络回归模型,即当网络是模块化的时,控制社区结构可以在很大程度上说明网络连接导致的观测结果之间的有意义的相互关系。我们提议采用通用估计方程(GEEE)方法,学习基于对所观察到的网络应用的任何单一成员社区检测算法所定义的集群的模型参数。我们提供了网络规模和边缘形成概率的必要条件,以确定模型参数的无损正常性。假设是,图形结构是一个随机区块模型。我们通过模拟来评估我们的方法的绩效,并应用它来估计基线变量和网络对通过商业航空交通网络连接的国家之间COVID-19发病率的联合影响。我们发现,在大流行病开始时,网络效应具有某种影响,城市人口的百分比对网络影响比禁止率的影响更大。