Support Vector Machine (SVM) is one of the most popular classification methods, and a de-facto reference for many Machine Learning approaches. Its performance is determined by parameter selection, which is usually achieved by a time-consuming grid search cross-validation procedure. There exist, however, several unsupervised heuristics that take advantage of the characteristics of the dataset for selecting parameters instead of using class label information. Unsupervised heuristics, while an order of magnitude faster, are scarcely used under the assumption that their results are significantly worse than those of grid search. To challenge that assumption we have conducted a wide study of various heuristics for SVM parameter selection on over thirty datasets, in both supervised and semi-supervised scenarios. In most cases, the cross-validation grid search did not achieve a significant advantage over the heuristics. In particular, heuristical parameter selection may be preferable for high dimensional and unbalanced datasets or when a small number of examples is available. Our results also show that using a heuristic to determine the starting point of further cross-validation does not yield significantly better results than the default start.
翻译:支持矢量机(SVM)是最为流行的分类方法之一,是许多机器学习方法的脱法参考。它的性能是由参数选择决定的,而参数选择通常是通过耗时的网格搜索交叉校验程序实现的。然而,有几种未经监督的超自然学利用数据集的特性来选择参数而不是使用类标签信息。无监督的超自然学虽然在数量上更快,但在假设其结果比网格搜索的结果要差得多的情况下却很少使用。为了挑战这一假设,我们在监督的和半监督的场景中对30多个数据集的SVM参数选择进行了广泛的超自然学研究。在多数情况下,交叉valation电网搜索没有在选择参数时取得显著的优势。特别是,超自然参数选择可能更适合高尺寸和不平衡的数据集,或者在有少量实例的情况下使用。我们的结果还表明,使用超自然论来确定进一步交叉估值的起始点,其结果不会大大高于默认值。