Collecting complete network data is expensive, time-consuming, and often infeasible. Aggregated Relational Data (ARD), which capture information about a social network by asking a respondent questions of the form ``How many people with trait X do you know?'' provide a low-cost option when collecting complete network data is not possible. Rather than asking about connections between each pair of individuals directly, ARD collects the number of contacts the respondent knows with a given trait. Despite widespread use and a growing literature on ARD methodology, there is still no systematic understanding of when and why ARD should accurately recover features of the unobserved network. This paper provides such a characterization by deriving conditions under which statistics about the unobserved network (or functions of these statistics like regression coefficients) can be consistently estimated using ARD. We do this by first providing consistent estimates of network model parameters for three commonly used probabilistic models: the beta-model with node-specific unobserved effects, the stochastic block model with unobserved community structure, and latent geometric space models with unobserved latent locations. A key observation behind these results is that cross-group link probabilities for a collection of (possibly unobserved) groups identifies the model parameters, meaning ARD is sufficient for parameter estimation. With these estimated parameters, it is possible to simulate graphs from the fitted distribution and analyze the distribution of network statistics. We can then characterize conditions under which the simulated networks based on ARD will allow for consistent estimation of the unobserved network statistics, such as eigenvector centrality or response functions by or of the unobserved network, such as regression coefficients.
翻译:收集完整的网络数据是昂贵的、耗时的,而且往往不可行。 综合关系数据(ARD)收集社会网络信息的方式是“ 有多少具有特质X的人知道? ” 。 它在收集完整的网络数据时提供低成本的选择是不可能的。 与其直接询问每对个人之间的连接, ARD收集被申请人知道特定特性的接触次数。 尽管广泛使用且关于ARD方法的文献越来越多, 仍然无法系统地理解ARD准确恢复未观测网络特征的时间和原因。 本文提供这种回归统计数据的特征化,通过得出关于未观测网络的统计数据( 或这些统计数据的功能, 如回归系数等 ) 在收集完整的数据时无法使用ARD。 我们这样做的方式是首先为三种常用的概率性模型提供一致的网络模型参数估计数: 具有不具体可见效果的乙型模型, 具有未观测的共同体结构的透视点模型, 以及具有未观测到的隐蔽位置的潜值空间模型。 一份关键的观测结果显示, 这些核心的网络的精确性参数是这些模型的跨组。