Ensemble classifiers have been investigated by many in the artificial intelligence and machine learning community. Majority voting and weighted majority voting are two commonly used combination schemes in ensemble learning. However, understanding of them is incomplete at best, with some properties even misunderstood. In this paper, we present a group of properties of these two schemes formally under a dataset-level geometric framework. Two key factors, every component base classifier's performance and dissimilarity between each pair of component classifiers are evaluated by the same metric - the Euclidean distance. Consequently, ensembling becomes a deterministic problem and the performance of an ensemble can be calculated directly by a formula. We prove several theorems of interest and explain their implications for ensembles. In particular, we compare and contrast the effect of the number of component classifiers on these two types of ensemble schemes. Empirical investigation is also conducted to verify the theoretical results when other metrics such as accuracy are used. We believe that the results from this paper are very useful for us to understand the fundamental properties of these two combination schemes and the principles of ensemble classifiers in general. The results are also helpful for us to investigate some issues in ensemble classifiers, such as ensemble performance prediction, selecting a small number of base classifiers to obtain efficient and effective ensembles.
翻译:人工智能和机器学习界的许多人调查了各种分类方法。 多数投票和加权多数投票是共同学习中常用的两种混合办法。 但是,对这两种办法的理解最多不过不完全,有些属性甚至被误解。 在本文中,我们正式在一个数据集层次的几何框架之下展示了这两种办法的一组属性。 两个关键因素,每个组成部分的分类方法的性能和每对组成部分分类方法之间的差异都用同样的标准来评价。 因此,组合成为一个确定性的问题,共同学习的性能可以直接用公式来计算。 我们证明了几个利息的理论,并解释了它们对集合的影响。 特别是,我们比较和比较了这两个组合方法的特性对这两类组合方法的影响。 在使用诸如准确性等其他指标时,还进行了实证性调查,以核实理论结果。 我们认为,本文的结果对于我们理解这两个组合方法的基本特性非常有用,共同的性能可以直接用公式计算。 我们证明了几个利息的理论,并解释了它们对于组合方法的精准性能的精准性能。