Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Many times, this problem contributes to bias while making decisions or implementing policies. Thus, it is vital to understand the factors which cause imbalance in the data (or class imbalance). Such hidden biases and imbalances can lead to data tyranny and a major challenge to a data democracy. In this chapter, two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept; solving such issues helps in building the foundations of a data democracy. Furthermore, statistical measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset (car insurance claims). In the end, popular data-level methods such as random oversampling, random undersampling, synthetic minority oversampling technique, Tomek link, and others are implemented in Python, and their performance is compared.
翻译:在对数据集进行分类时,处理不平衡的数据是一个普遍的问题。许多时候,这个问题在决策或执行政策时造成偏向性。因此,必须了解造成数据不平衡(或阶级不平衡)的因素。这种隐蔽的偏向和不平衡可能导致数据暴政,对数据民主构成重大挑战。本章解决了两个基本统计要素:阶级不平衡的程度和概念的复杂性;解决这些问题有助于建立数据民主的基础。此外,在现实生活数据集(汽车保险索赔)上讨论和执行这些情景中适当的统计措施。归根结底,流行的数据级方法,如随机过度抽样、随机抽样、合成少数群体抽样技术、托普克联系和其他方法在皮松实施,其性能被比较。