Evaluating whether data streams were generated by the same distribution is at the heart of many machine learning problems, e.g. to detect changes. This is particularly relevant for data generated by dynamical systems since they are essential for many real-world processes in biomedical, economic, or engineering systems. While kernel two-sample tests are powerful for comparing independent and identically distributed random variables, no established method exists for comparing dynamical systems. The key problem is the critical independence assumption, which is inherently violated in dynamical systems. We propose a novel two-sample test for dynamical systems by addressing three core challenges: we (i) introduce a novel notion of mixing that captures autocorrelations in a relevant metric, (ii) propose an efficient way to estimate the speed of mixing purely from data, and (iii) integrate these into established kernel-two sample tests. The result is a data-driven method for comparison of dynamical systems that is easy to use in practice and comes with sound theoretical guarantees. In an example application to anomaly detection from human walking data, we show that the test readily applies without the need for feature engineering, heuristics, and human expert knowledge.
翻译:评估数据流是否由同一分布系统生成是许多机器学习问题的核心,例如用于检测变化。这对动态系统生成的数据尤为重要,因为动态系统对于生物医学、经济或工程系统的许多真实世界过程至关重要。虽然内核两样测试对于比较独立和相同分布的随机变量来说是强大的,但是没有固定的比较动态系统的方法。关键问题是关键的独立性假设,动态系统本身就受到侵犯。我们提出一个新的动态系统二样测试,通过应对三个核心挑战:我们(一) 引入一种新的混合概念,在相关指标中捕捉自动调节,(二) 提出有效估计纯与数据混合速度的方法,以及(三) 将这些结果是一种数据驱动方法,用于比较动态系统,这种系统在实际中易于使用,并带有良好的理论保证。在从人类行走数据中检测异常现象的示例中,我们表明,测试很容易应用,不需要地貌工程、超光学和人类专家知识。