We consider the problem of learning the interaction strength between the nodes of a network based on dependent binary observations residing on these nodes, generated from a Markov Random Field (MRF). Since these observations can possibly be corrupted/noisy in larger networks in practice, it is important to robustly estimate the parameters of the underlying true MRF to account for such inherent contamination in observed data. However, it is well-known that classical likelihood and pseudolikelihood based approaches are highly sensitive to even a small amount of data contamination. So, in this paper, we propose a density power divergence (DPD) based robust generalization of the computationally efficient maximum pseudolikelihood (MPL) estimator of the interaction strength parameter, and derive its rate of consistency under the pure model. Moreover, we show that the gross error sensitivities of the proposed DPD based estimators are significantly smaller than that of the MPL estimator, thereby theoretically justifying the greater (local) robustness of the former under contaminated settings. We also demonstrate the superior (finite sample) performance of the DPD-based variants over the traditional MPL estimator in a number of synthetically generated contaminated network datasets. Finally, we apply our proposed DPD based estimators to learn the network interaction strength in several real datasets from diverse domains of social science, neurobiology and genomics.
翻译:本文研究基于网络节点上依赖二元观测数据学习节点间交互强度的问题,这些数据由马尔可夫随机场生成。由于实践中大规模网络的观测数据可能存在污染/噪声,对底层真实马尔可夫随机场的参数进行稳健估计以处理观测数据中固有的污染至关重要。然而,众所周知,基于经典似然和伪似然的方法即使对少量数据污染也极为敏感。因此,本文提出一种基于密度幂散度的稳健推广方法,用于计算高效的交互强度参数最大伪似然估计量,并在纯模型下推导其相合速率。此外,我们证明所提出的基于密度幂散度的估计量的粗误差敏感度显著小于最大伪似然估计量,从而从理论上证实了前者在污染设定下具有更强的局部稳健性。我们还在多个合成生成的污染网络数据集中,展示了基于密度幂散度的变体相较于传统最大伪似然估计量的优越有限样本性能。最后,我们将提出的基于密度幂散度的估计量应用于社会科学、神经生物学和基因组学等多个真实数据集中的网络交互强度学习。