Data cleaning is a crucial part of every data analysis exercise. Yet, the currently available R packages do not provide fast and robust methods for cleaning and preparation of time series data. The open source package tsrobprep introduces efficient methods for handling missing values and outliers using model based approaches. For data imputation a probabilistic replacement model is proposed, which may consist of autoregressive components and external inputs. For outlier detection a clustering algorithm based on finite mixture modelling is introduced, which considers time series properties in terms of the gradient and the underlying seasonality as features. The procedure allows to return a probability for each observation being outlying data as well as a specific cause for an outlier assignment in terms of the provided feature space. The methods work robust and are fully tunable. Moreover, by providing the auto_data_cleaning function the data preprocessing can be carried out in one cast, without comprehensive tuning and providing suitable results. The primary motivation of the package is the preprocessing of energy system data. We present application for electricity load, wind and solar power data.
翻译:数据清理是每项数据分析工作的一个关键部分。然而,目前可用的R包并不提供快速和稳健的清理和时间序列数据编制方法。开放源代码包tsrobprep采用基于模型的方法,提出了处理缺失值和外部值的有效方法。关于数据估算,提出了一种概率替代模型,该模型可能由自动递增组件和外部输入组成。为了更精确地检测基于有限混合物模型的群集算法,引入了基于有限混合物模型的群集算法,该算法从梯度和基本季节性特征的角度考虑时间序列特性。该程序允许返回每项观测的概率,以所提供的地貌空间作为外值分配的具体原因。该方法既健全又完全可图解。此外,通过提供自动数据清理功能,数据预处理可以一次性进行,而不进行全面调整和提供适当结果。该包的主要动机是能源系统数据预处理。我们介绍了电荷、风能和太阳能数据的应用。