Modeling the distribution of high dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, recover the structure separately of multiple, possibly random subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop Spectral Top-Down Recovery (STDR), a deterministic divide-and-conquer approach to infer large latent tree models. Unlike previous methods, STDR partitions the terminal nodes in a non random way, based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions, this partitioning is consistent with the tree structure. This, in turn, leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy.
翻译:以隐形树图形模型模拟高维数据分布是多个科学领域的普遍做法。 一项共同任务是推断树底结构, 仅对其终端节点进行观察。 许多树的恢复算法是计算密集的, 限制了其对中小树的可适用性。 对于大树, 一种共同的方法, 称为分裂和征服, 是将树结构分成两步。 首先, 将终端节点的多个、 可能是随机的子集分开, 其次, 其次, 其次, 合并成一整棵树。 在这里, 我们开发了光谱顶层恢复( STDR), 这是一种决定性的分化分化法, 用以推断大树的树形模型。 与以前的方法不同, STDR 以非随机的方式将终端节点分割成非随机的。 我们证明, 在某些条件下, 此间隔断结构与树底结构结构结构相符合。 而这又导致小子树底树的合并程序大大简化。 我们证明, STDR 与许多相似的模型具有统计性、 和固定的精确性, 在树底模型中, 的模型中, 我们用大量的模型显示。