Many of the best statistical classification algorithms are binary classifiers that can only distinguish between one of two classes. The number of possible ways of generalizing binary classification to multi-class increases exponentially with the number of classes. There is some indication that the best method will depend on the dataset. Hence, we are particularly interested in data-driven solution design, whether based on prior considerations or on empirical examination of the data. Here we demonstrate how a recursive control language can be used to describe a multitude of different partitioning strategies in multi-class classification, including those in most common use. We use it both to manually construct new partitioning configurations as well as to examine those that have been automatically designed. Eight different strategies were tested on eight different datasets using a support vector machine (SVM) as the base binary classifier. Numerical results suggest that a one-size-fits-all solution consisting of one-versus-one is appropriate for most datasets. Three datasets showed better accuracy using different methods. The best solution for the most improved dataset exploited a property of the data to produce an uncertainty coefficient 36\% higher (0.016 absolute gain) than one-vs.-one. For the same dataset, an adaptive solution that empirically examined the data was also more accurate than one-vs.-one while being faster.
翻译:许多最佳的统计分类算法都是二进制分类法,只能区分两类中的一类。 将二进制分类法普遍化为多级分类的可行方法的数量随着分类数的增加而成倍增加。 有一些迹象表明, 最佳的方法将取决于数据集。 因此, 我们特别感兴趣的是数据驱动的解决方案设计, 无论是基于先前的考虑还是基于对数据的实验性审查。 这里我们展示了一种循环控制语言如何用于描述多种多级分类的不同分区战略, 包括最常用的分类。 我们使用它来手动构建新的分隔配置, 并检查自动设计的那些配置。 8种不同的战略在8个不同的数据集上测试了支持矢量机( SVM) 作为基的二进制分类器。 数值结果显示, 由一反一的一刀切的解决方案适用于大多数数据集。 3个数据集使用不同的方法显示的准确性更好。 最先进的数据集利用了数据属性来生成一种不确定性系数( 0.016) 并且比一个绝对数据得到更快速的适应性。