Modern large-scale data-farms consist of hundreds of thousands of storage devices that span distributed infrastructure. Devices used in modern data centers (such as controllers, links, SSD- and HDD-disks) can fail due to hardware as well as software problems. Such failures or anomalies can be detected by monitoring the activity of components using machine learning techniques. In order to use these techniques, researchers need plenty of historical data of devices in normal and failure mode for training algorithms. In this work, we challenge two problems: 1) lack of storage data in the methods above by creating a simulator and 2) applying existing online algorithms that can faster detect a failure occurred in one of the components. We created a Go-based (golang) package for simulating the behavior of modern storage infrastructure. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. The package's flexible structure allows us to create a model of a real-world storage system with a configurable number of components. The primary area of interest is exploring the storage machine's behavior under stress testing or exploitation in the medium- or long-term for observing failures of its components. To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode. The goal of the change-point detection is to discover differences in time series distribution. This work describes an approach for failure detection in time series data based on direct density ratio estimation via binary classifiers.
翻译:现代大型数据农场由分布式基础设施的数十万个储存装置组成。现代数据中心使用的设备(如控制器、链接、SSD-和HDD-disks)可能因硬件和软件问题而失效。这种故障或异常可以通过使用机器学习技术监测部件的活动来检测。为了使用这些技术,研究人员需要大量关于正常和故障设备的历史数据,以用于培训算法。在这项工作中,我们提出了两个问题:(1) 缺乏上述方法中的存储数据,通过创建一个模拟器和(2) 应用现有的在线算法,以更快地检测某个部件中的故障。我们为模拟现代储存基础设施的行为创建了一个基于Go(Gangang)的软件包。这种软件以离散的模型为基础,并捕捉到高层次储存系统建筑结构的结构和动态。在这个软件的灵活结构中,我们创建了一个基于一个基于一个硬质组件的存储系统模式的模型。主要关注领域是探索存储机的序列行为,以更快的速度检测一个组件的失败,在中或长期的测算模型中,在测算中,在测算序列中,在测算中,一个测算序列中,在测算中,一个测算序列中,一个时间序列中,一个时间序列中,一个测算中,一个测测测测测测测测测测测测到一个时间序列中,一个时间序列中,一个时间序列中,一个测得一个测算到时间序列中,一个测得一个测到一个测算序列中,一个错。