独立测试的数据驱动代表:建模、分析和与相互信息估算的连接 (Data-Driven Representations for Testing Independence: Modeling, Analysis and Connection with Mutual Information Estimation)

This work addresses testing the independence of two continuous and finite-dimensional random variables from the design of a data-driven partition. The empirical log-likelihood statistic is adopted to approximate the sufficient statistics of an oracle test against independence (that knows the two hypotheses). It is shown that approximating the sufficient statistics of the oracle test offers a learning criterion for designing a data-driven partition that connects with the problem of mutual information estimation. Applying these ideas in the context of a data-dependent tree-structured partition (TSP), we derive conditions on the TSP's parameters to achieve a strongly consistent distribution-free test of independence over the family of probabilities equipped with a density. Complementing this result, we present finite-length results that show our TSP scheme's capacity to detect the scenario of independence structurally with the data-driven partition as well as new sampling complexity bounds for this detection. Finally, some experimental analyses provide evidence regarding our scheme's advantage for testing independence compared with some strategies that do not use data-driven representations.

翻译：这项工作旨在测试数据驱动分区设计中两个连续和有限维随机变量的独立性。经验日志类统计被采纳, 以估计一个神器测试相对于独立的充分统计( 了解两种假设) 。事实证明, 近似于神器测试的充分统计为设计一个数据驱动分区设计提供了一个学习标准, 该分区与相互信息估计问题相关联。在数据依赖树结构分割( TSP) 的背景下应用这些想法, 我们根据TSP的参数提出一些条件, 以便在具有密度的概率组别上实现一个非常一致的无分布性独立测试。为了补充这一结果, 我们提出了有限的结果, 表明我们的TSP 计划在结构上能够通过数据驱动分区以及新的取样复杂度来探测独立情景。最后, 一些实验性分析提供了证据, 证明我们的计划在测试独立性方面优势, 与一些不使用数据驱动的表达方式的战略相比, 。