Minimum spanning trees (MSTs) are used in a variety of fields, from computer science to geography. Infectious disease researchers have used them to infer the transmission pathway of certain pathogens. However, these are often the MSTs of sample networks, not population networks, and surprisingly little is known about what can be inferred about a population MST from a sample MST. We prove that if $n$ nodes (the sample) are selected uniformly at random from a complete graph with $N$ nodes and unique edge weights (the population), the probability that an edge is in the population graph's MST given that it is in the sample graph's MST is $\frac{n}{N}$. We use simulation to investigate this conditional probability for $G(N,p)$ graphs, Barab\'{a}si-Albert (BA) graphs, graphs whose nodes are distributed in $\mathbb{R}^2$ according to a bivariate standard normal distribution, and an empirical HIV genetic distance network. Broadly, results for the complete, $G(N,p)$, and normal graphs are similar, and results for the BA and empirical HIV graphs are similar. We recommend that researchers use an edge-weighted random walk to sample nodes from the population so that they maximize the probability that an edge is in the population MST given that it is in the sample MST.
翻译:从计算机科学到地理等各个领域都使用最低覆盖树(MSTs),从计算机科学到地理。传染病研究人员利用他们来推断某些病原体的传播途径。然而,这些往往是抽样网络的MSTs,而不是人口网络,令人惊讶的是,对于从抽样MST中可以推断出的人口MST(MST),我们很少知道什么是MST(MST)。我们证明,如果从完整的图表中随机地选择美元(MST)(样本),用美元节点和独特的边缘重量(人口)来计算,那么在人口图的MST(MST)中可能有一个边缘。从抽样图中得出完整的、$G(N){N}N}N}N}N}$的边缘。我们用模拟来调查美元(G)图、Barab\'{a}si-Albert(BA)图的这一有条件的概率,其节点按正比值分布为$mathbethbb{R}2美元,以及一个经验性艾滋病毒基因距离网络。广而言,其结果完整、$G(N,ST(N,p)是用于正常的BA和BAST(BA) 和M(O)的平比值图表中,我们使用的RV)的概率是类似的平比值。