We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a spatial domain, a temporal domain, or a social network, which induce dependencies. We model these dependencies in the language of Markov Random Fields and, importantly, allow these dependencies to be substantial, i.e do not assume that the Markov Random Field capturing these dependencies is in high temperature. As our main contribution we provide algorithms and statistically efficient estimation rates for this model, giving several instantiations of our bounds in logistic regression, sparse logistic regression, and neural network settings with dependent data. Our estimation guarantees follow from novel results for estimating the parameters (i.e. external fields and interaction strengths) of Ising models from a {\em single} sample. {We evaluate our estimation approach on real networked data, showing that it outperforms standard regression approaches that ignore dependencies, across three text classification datasets: Cora, Citeseer and Pubmed.}
翻译:我们认为一个一般性的统计估计问题,即不同观测的二进制标签并非以其特性矢量为独立条件,而是依赖,捕捉在空间领域、时空领域或社会网络收集这些观测结果的环境,从而产生依赖性。我们用Markov随机字段的语言来模拟这些依赖性,而且重要的是,允许这些依赖性具有实质性,即不假定捕捉这些依赖性的Markov随机字段处于高温中。由于我们的主要贡献,我们为这一模型提供了算法和统计上有效的估计率,使我们在后勤回归、后勤回归偏少和神经网络设置以及依赖性数据方面的界限具有若干即时性。我们的估算保证来自于从一个 ~ 单一的样本中估算Ising模型参数(即外部领域和互动优势)的新的结果。 {我们评估了我们对于实际网络数据的估计方法,表明它超过了在三个文本分类数据集(科拉、Citeseer和Pubmed)中忽略依赖性的标准回归方法。}