Kernel techniques are among the most popular and powerful approaches of data science. Among the key features that make kernels ubiquitous are (i) the number of domains they have been designed for, (ii) the Hilbert structure of the function class associated to kernels facilitating their statistical analysis, and (iii) their ability to represent probability distributions without loss of information. These properties give rise to the immense success of Hilbert-Schmidt independence criterion (HSIC) which is able to capture joint independence of random variables under mild conditions, and permits closed-form estimators with quadratic computational complexity (w.r.t. the sample size). In order to alleviate the quadratic computational bottleneck in large-scale applications, multiple HSIC approximations have been proposed, however these estimators are restricted to $M=2$ random variables, do not extend naturally to the $M\ge 2$ case, and lack theoretical guarantees. In this work, we propose an alternative Nystr\"om-based HSIC estimator which handles the $M\ge 2$ case, prove its consistency, and demonstrate its applicability in multiple contexts, including synthetic examples, dependency testing of media annotations, and causal discovery.
翻译:内核技术是数据科学中最受欢迎和最有力的方法之一,使内核无处不在的关键特征包括:(一) 它们设计用于(二) 内核相关功能类的Hilbert结构,(二) 有助于其统计分析的功能类的Hilbert结构,(三) 它们能够代表概率分布而不会丢失信息,这些特性导致Hilbert-Schmidt独立标准(HSIC)的巨大成功,它能够在温和条件下捕捉随机变量的联合独立性,并允许具有等式计算复杂性(w.r.t.样尺寸)的封闭式高估测仪。为了减轻大规模应用中的四边计算瓶颈,提出了多种HSCY的近似结构,然而,这些估计值仅限于$M=2美元随机变量,并不自然延伸至$Mge 2案例,缺乏理论上的保证。在这项工作中,我们提议了另一个基于Nystr\'om的 HISIC 估计仪,处理$M\ge 2 的合成媒体背景的多因果性判断,证明其适用性, 并展示其多因果性判断性判断。