Before any binary classification model is taken into practice, it is important to validate its performance on a proper test set. Without a frame of reference given by a baseline method, it is impossible to determine if a score is `good' or `bad'. The goal of this paper is to examine all baseline methods that are independent of feature values and determine which model is the `best' and why. By identifying which baseline models are optimal, a crucial selection decision in the evaluation process is simplified. We prove that the recently proposed Dutch Draw baseline is the best input-independent classifier (independent of feature values) for all positional-invariant measures (independent of sequence order) assuming that the samples are randomly shuffled. This means that the Dutch Draw baseline is the optimal baseline under these intuitive requirements and should therefore be used in practice.
翻译:在将任何二元分类模式付诸实践之前,必须在适当的测试集上验证其性能。没有基准方法提供的参照框架,就不可能确定一个评分是“好”还是“坏”。本文件的目标是审查所有独立于特征值的基线方法,并确定哪个模型是“最佳”和原因。通过确定哪些基线模型是最佳的,评估过程中的关键选择决定就得到简化。我们证明,最近提议的荷兰绘图基准是所有定位变异措施(取决于序列顺序)的最佳输入独立分类器(取决于特征值),假设样本是随机打乱的。这意味着荷兰绘图基线是这些直观要求下的最佳基线,因此应在实践中使用。