In the real world, data streams are ubiquitous -- think of network traffic or sensor data. Mining patterns, e.g., outliers or clusters, from such data must take place in real time. This is challenging because (1) streams often have high dimensionality, and (2) the data characteristics may change over time. Existing approaches tend to focus on only one aspect, either high dimensionality or the specifics of the streaming setting. For static data, a common approach to deal with high dimensionality -- known as subspace search -- extracts low-dimensional, `interesting' projections (subspaces), in which patterns are easier to find. In this paper, we address both Challenge (1) and (2) by generalising subspace search to data streams. Our approach, Streaming Greedy Maximum Random Deviation (SGMRD), monitors interesting subspaces in high-dimensional data streams. It leverages novel multivariate dependency estimators and monitoring techniques based on bandit theory. We show that the benefits of SGMRD are twofold: (i) It monitors subspaces efficiently, and (ii) this improves the results of downstream data mining tasks, such as outlier detection. Our experiments, performed against synthetic and real-world data, demonstrate that SGMRD outperforms its competitors by a large margin.
翻译:在现实世界中,数据流是无处不在的 -- -- 以网络流量或传感器数据来思考。采矿模式,例如外部或集群,必须实时地从这些数据中找到。这具有挑战性,因为(1)流往往具有高度的维度,(2)数据特征可能随时间而变化。现有方法往往只侧重于一个方面,要么是高度的维度,要么是流流环境的具体方面。对于静态数据,一种处理高维度的共同方法 -- -- 称为子空间搜索 -- -- 提取低维度、“感兴趣”预测(子空间),其中的模式更容易找到。在本文件中,我们既处理挑战(1),又处理(2),对数据流进行一般的次空间搜索。我们的方法,即,即移动腐蚀性最大随机脱轨(SGMRD),监测高维数据流中有趣的亚空间。它利用新颖的多变依赖度估计器和基于强势理论的监测技术。我们表明,SGMRD的效益是双重的:(i)它能高效地监测子空间,并且(ii)我们既能地监测子空间,又能改进了我们进行的全球数据实验,又能探测。