Data cleaning is a crucial part of every data analysis exercise. Yet, the currently available R packages do not provide fast and robust methods for cleaning and preparation of time series data. The open source package tsrobprep introduces efficient methods for handling missing values and outliers using model based approaches. For data imputation a probabilistic replacement model is proposed, which may consist of autoregressive components and external inputs. For outlier detection a clustering algorithm based on finite mixture modelling is introduced, which considers typical time series related properties as features. By assigning to each observation a probability of being an outlying data point, the degree of outlyingness can be determined. The methods work robust and are fully tunable. Moreover, by providing the auto_data_cleaning function the data preprocessing can be carried out in one cast, without manual tuning and providing suitable results. The primary motivation of the package is the preprocessing of energy system data, however, the package is also suited for other moderate and large sized time series data set. We present application for electricity load, wind and solar power data.
翻译:数据清理是每项数据分析工作的一个关键部分。然而,目前可用的 R 包并不提供快速和稳健的清理和时间序列数据编制方法。 开放源代码包 tsrobprep 采用基于模型的方法,采用高效的方法处理缺失的值和外部值。 对于数据估算,提出了一种概率替代模型,该模型可能由自动递增组件和外部输入组成。 对于基于有限混合物模型的群集算法,引入了超强检测算法,该算法将典型的时间序列特性视为特性。通过给每观察点分配一个外围数据点的概率,可以确定外围值的程度。该方法非常健全,完全可以捕捉。此外,通过提供自动数据清理功能,数据预处理可以一次性完成,无需手动调整和提供适当结果。该包的主要动机是先处理能源系统数据,然而,该包也适合其他中大型的时间序列数据集。我们介绍了电荷载、风能和太阳能数据的应用。