This paper presents a new evolutionary approach, EvoSplit, for the distribution of multi-label data sets into disjoint subsets for supervised machine learning. Currently, data set providers either divide a data set randomly or using iterative stratification, a method that aims to maintain the label (or label pair) distribution of the original data set into the different subsets. Following the same aim, this paper first introduces a single-objective evolutionary approach that tries to obtain a split that maximizes the similarity between those distributions independently. Second, a new multi-objective evolutionary algorithm is presented to maximize the similarity considering simultaneously both distributions (label and label pair). Both approaches are validated using well-known multi-label data sets as well as large image data sets currently used in computer vision and machine learning applications. EvoSplit improves the splitting of a data set in comparison to the iterative stratification following different measures: Label Distribution, Label Pair Distribution, Examples Distribution, folds and fold-label pairs with zero positive examples.
翻译:本文介绍了一种新的渐进式方法,即EvoSplit,用于分配多标签数据集,将其分解成互不相连的子集,供监督的机器学习使用。目前,数据集提供者要么随机分割数据集,要么使用迭代分层,这种方法旨在维持原始数据集在不同子集中的标签(或标签配对)分布。按照同样的目的,本文件首先引入了一种单一客观的渐进式方法,该方法试图获得一种分离,使这些分布之间尽可能独立地相似。第二,提出了一种新的多目标进化算法,以尽量扩大相似性,同时考虑两种分布(标签和标签配对)。这两种方法都使用众所周知的多标签数据集以及目前在计算机视觉和机器学习应用程序中使用的大图像数据集进行验证。EvoSplit改进了数据组与迭代分级组的分离,并采取了不同的措施:Label分发、Label Pair分发、实例分布、折叠和折叠标签配,并有零肯定的例子。