We provide high probability finite sample complexity guarantees for hidden non-parametric structure learning of tree-shaped graphical models, whose hidden and observable nodes are discrete random variables with either finite or countable alphabets. We study a fundamental quantity called the (noisy) information threshold, which arises naturally from the error analysis of the Chow-Liu algorithm and, as we discuss, provides explicit necessary and sufficient conditions on sample complexity, by effectively summarizing the difficulty of the tree-structure learning problem. Specifically, we show that the finite sample complexity of the Chow-Liu algorithm for ensuring exact structure recovery from noisy data is inversely proportional to the information threshold squared (provided it is positive), and scales almost logarithmically relative to the number of nodes over a given probability of failure. Conversely, we show that, if the number of samples is less than an absolute constant times the inverse of information threshold squared, then no algorithm can recover the hidden tree structure with probability greater than one half. As a consequence, our upper and lower bounds match with respect to the information threshold, indicating that it is a fundamental quantity for the problem of learning hidden tree-structured models. Further, the Chow-Liu algorithm with noisy data as input achieves the optimal rate with respect to the information threshold. Lastly, as a byproduct of our analysis, we resolve the problem of tree structure learning in the presence of non-identically distributed observation noise, providing conditions for convergence of the Chow-Liu algorithm under this setting, as well.
翻译:我们为隐蔽的非参数结构学习树形图形模型提供了高概率的有限抽样复杂性保障,这些模型的隐藏和可观测节点是带有有限或可计算字母的离散随机变量。我们研究了一个称为(noisy)信息阈值的基本数量,该数量自然来自周露算法的错误分析,并在我们讨论时,通过有效总结树结构学习问题的难度,为样本复杂性提供了明确必要和充分的条件。具体地说,我们表明,为保证从噪音数据中确切恢复结构而使用的周露算法的有限抽样复杂性与信息阈值(前提是数据为正数)成反比。我们研究一个称为(noisy)的信息阈值数量,这与某个特定失败概率的节点数量几乎成正比。相反,我们表明,如果样本数量少于信息阈值的绝对常数,那么任何算法都无法以高于一半的概率来恢复隐藏的树结构结构。因此,我们的上下限和下限与信息阈值相匹配,表明这是学习隐藏的观测阈值问题的一个基本数量,而我们学习最保守的树结构模型,我们作为最接近的模型,从而获得最接近的压数据,从而获得最接近的压的矩阵结构分析。