An outlier is an observation or a data point that is far from rest of the data points in a given dataset or we can be said that an outlier is away from the center of mass of observations. Presence of outliers can skew statistical measures and data distributions which can lead to misleading representation of the underlying data and relationships. It is seen that the removal of outliers from the training dataset before modeling can give better predictions. With the advancement of machine learning, the outlier detection models are also advancing at a good pace. The goal of this work is to highlight and compare some of the existing outlier detection techniques for the data scientists to use that information for outlier algorithm selection while building a machine learning model.
翻译:外部线是一个观测点或数据点,远非特定数据集中数据点的其余部分,或可以说外线离观测质量中心很远。 外部线的存在会扭曲统计计量和数据分布,从而导致对基本数据和关系进行误导。 人们看到,在建模之前将外部线从培训数据集中去除可以提供更好的预测。 随着机器学习的推进,外部探测模型也在以良好的速度前进。 这项工作的目标是突出和比较数据科学家现有的一些外部探测技术,以便利用这些信息选择外部算法,同时建立机器学习模型。