Modeling the distribution of high dimensional data by a latent tree graphical model is a common approach in multiple scientific domains. A common task is to infer the underlying tree structure given only observations of the terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, recover the structure separately for multiple randomly selected subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop Spectral Top-Down Recovery (STDR), a divide-and-conquer approach for inference of large latent tree models. Unlike previous methods, STDR's partitioning step is non-random. Instead, it is based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions this partitioning is consistent with the tree structure. This, in turn leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent, and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy.
翻译:以隐形树图形模型模拟高维数据分布是多个科学领域的常见方法。 共同的任务是推断只对终端节点进行观测的树底结构。 许多树的恢复算法是计算密集的, 限制了其对中等大小树木的适用性。 对于大树而言, 一种共同的方法, 称为分裂和征服, 是将树结构分成分为两步。 首先, 将结构分别用于终端节点中随机选定的多个子集。 第二, 将由此产生的亚树合并成一整棵树。 在这里, 我们开发了光谱顶层恢复( STDR), 这是一种区分和征服方法, 用以推断大型潜在树型模型。 与以往的方法不同, STDR 的分区步骤是非随机的。 相反, 它以与所观察到的节点相关的合适的 Laplacecian 矩阵的Fiedler 矢量为基础。 我们证明, 在某些条件下, 此间隔断层结构与树结构相符合。 这反过来导致小子树的合并程序大大简化。 我们证明, STDR的精确的精确性模型与我们使用了一些共同的模型。