Data pre-treatment plays a significant role in improving data quality, thus allowing extraction of accurate information from raw data. One of the data pre-treatment techniques commonly used is outliers detection. The so-called 3${\sigma}$ method is a common practice to identify the outliers. As shown in the manuscript, it does not identify all outliers, resulting in possible distortion of the overall statistics of the data. This problem can have a significant impact on further data analysis and can lead to reduction in the accuracy of predictive models. There is a plethora of various techniques for outliers detection, however, aside from theoretical work, they all require case study work. Two types of outliers were considered: short-term (erroneous data, noise) and long-term outliers (e.g. malfunctioning for longer periods). The data used were taken from the vacuum distillation unit (VDU) of an Asian refinery and included 40 physical sensors (temperature, pressure and flow rate). We used a modified method for 3${\sigma}$ thresholds to identify the short-term outliers, i.e. ensors data are divided into chunks determined by change points and 3${\sigma}$ thresholds are calculated within each chunk representing near-normal distribution. We have shown that piecewise 3${\sigma}$ method offers a better approach to short-term outliers detection than 3${\sigma}$ method applied to the entire time series. Nevertheless, this does not perform well for long-term outliers (which can represent another state in the data). In this case, we used principal component analysis (PCA) with Hotelling's $T^2$ statistics to identify the long-term outliers. The results obtained with PCA were subject to DBSCAN clustering method. The outliers (which were visually obvious and correctly detected by the PCA method) were also correctly identified by DBSCAN which supported the consistency and accuracy of the PCA method.
翻译:数据预处理在提高数据质量、从而从原始数据中提取准确信息方面起着重要作用。通常使用的数据预处理技术之一是检测离子体。所谓的3美元(sigma}美元)方法是识别离子体的一种常见做法。如手稿所示,它没有辨别所有离子体,从而可能导致数据总体统计的扭曲。这个问题可能对进一步的数据分析产生重大影响,并可能导致降低预测模型的准确性。除了理论工作外,还存在多种用于检测离子体的多种技术。除了理论工作外,它们都需要做案例研究。两种离子体方法被认为是短期(错误数据、噪音)和长期离子体(如变异) 。 所使用的数据来自亚洲炼油真空蒸馏单位(VDD),包括40个物理传感器(温度、压力和流速率)。我们用3美元调整的临界值的临界值是确定短期离子体(美元) 短期离值(美元) 数据不是短期离子(美元) 数据分析结果代表了短期的基数(美元) 算法中,我们用正态方法算出3的基数 数据是更精确的基数 数据 。我们用直数 算算算算出的基数 数据是比正数 基数 基数 。我们算算算算算出3 基数 3 数据是正常的比正数 基数 基数 基数 基数 。我们算算算算算算的 。