Evaluating whether data streams are drawn from the same distribution is at the heart of various machine learning problems. This is particularly relevant for data generated by dynamical systems since such systems are essential for many real-world processes in biomedical, economic, or engineering systems. While kernel two-sample tests are powerful for comparing independent and identically distributed random variables, no established method exists for comparing dynamical systems. The main problem is the inherently violated independence assumption. We propose a two-sample test for dynamical systems by addressing three core challenges: we (i) introduce a novel notion of mixing that captures autocorrelations in a relevant metric, (ii) propose an efficient way to estimate the speed of mixing relying purely on data, and (iii) integrate these into established kernel two-sample tests. The result is a data-driven method that is straightforward to use in practice and comes with sound theoretical guarantees. In an example application to anomaly detection from human walking data, we show that the test is readily applicable without any human expert knowledge and feature engineering.
翻译:评估数据流是否来自同一分布,是各种机器学习问题的核心所在。这对于动态系统产生的数据尤为重要,因为这种系统对于生物医学、经济或工程系统的许多真实世界过程至关重要。虽然两样样样的测试对比较独立和相同分布的随机变量具有很大的作用,但是没有固定的方法来比较动态系统。主要问题是内在的违反独立假设。我们提出动态系统两个样的测试,通过应对三个核心挑战:我们(一) 引入一种新颖的混合概念,在相关指标中捕捉自动关系;(二) 提出一种有效的方法来估计纯粹依赖数据的混合速度,以及(三) 将这些结果是一种数据驱动方法,在实际中可以直接使用,并带有合理的理论保证。在从人类行走数据中探测异常现象的一个实例中,我们表明在没有任何人类专家知识和特征工程的情况下,该测试很容易应用。