Rotation forest is a tree based ensemble that performs transforms on subsets of attributes prior to constructing each tree. We present an empirical comparison of classifiers for problems with only real valued features. We evaluate classifiers from three families of algorithms: support vector machines; tree-based ensembles; and neural networks. We compare classifiers on unseen data based on the quality of the decision rule (using classification error) the ability to rank cases (area under the receiver operator curve) and the probability estimates (using negative log likelihood). We conclude that, in answer to the question posed in the title, yes, rotation forest, is significantly more accurate on average than competing techniques when compared on three distinct sets of datasets. The same pattern of results are observed when tuning classifiers on the train data using a grid search. We investigate why rotation forest does so well by testing whether the characteristics of the data can be used to differentiate classifier performance. We assess the impact of the design features of rotation forest through an ablative study that transforms random forest into rotation forest. We identify the major limitation of rotation forest as its scalability, particularly in number of attributes. To overcome this problem we develop a model to predict the train time of the algorithm and hence propose a contract version of rotation forest where a run time cap {\em a priori}. We demonstrate that on large problems rotation forest can be made an order of magnitude faster without significant loss of accuracy and that there is no real benefit (on average) from tuning the ensemble. We conclude that without any domain knowledge to indicate an algorithm preference, rotation forest should be the default algorithm of choice for problems with continuous attributes.
翻译:旋转森林是一种基于树的混合体,在建造每棵树之前对分属性子集进行变换。 我们用经验比较分类者对只有真正有价值特征的问题进行分类。 我们评估了三个算法组的分类者: 支持矢量机器; 树基集合; 神经网络。 我们根据决定规则的质量( 使用分类错误) 、 将案例排位( 接受者曲线下的区域) 和概率估计( 使用负逻辑概率) 来比较。 我们的结论是,在回答标题( y, 旋转森林)中提出的问题时,平均的分类者比在三个不同的数据集中比较的相互竞争的技术要准确得多。 在使用网搜索来调整列数据上的分类者时,也观察到同样的结果模式。 我们调查的是,为什么旋转森林的特性能够很好地用来区分分级性能的性能。 我们通过随机性地将森林的默认值调整到旋转森林的精确度, 我们确定森林的大规模旋转的精确性能比值比, 特别是在三个不同的数据集中。 我们通过一个不连续的递校程来算算出一个巨大的时间序列。