Ensembles, as a widely used and effective technique in the machine learning community, succeed within a key element -- "diversity." The relationship between diversity and generalization, unfortunately, is not entirely understood and remains an open research issue. To reveal the effect of diversity on the generalization of classification ensembles, we investigate three issues on diversity, i.e., the measurement of diversity, the relationship between the proposed diversity and the generalization error, and the utilization of this relationship for ensemble pruning. In the diversity measurement, we measure diversity by error decomposition inspired by regression ensembles, which decomposes the error of classification ensembles into accuracy and diversity. Then we formulate the relationship between the measured diversity and ensemble performance through the theorem of margin and generalization and observe that the generalization error is reduced effectively only when the measured diversity is increased in a few specific ranges, while in other ranges larger diversity is less beneficial to increasing the generalization of an ensemble. Besides, we propose two pruning methods based on diversity management to utilize this relationship, which could increase diversity appropriately and shrink the size of the ensemble without much-decreasing performance. Empirical results validate the reasonableness of the proposed relationship between diversity and ensemble generalization error and the effectiveness of the proposed pruning methods.
翻译:作为机器学习界广泛使用的有效技术,综合体在“多样性”这一关键要素中取得成功。不幸的是,多样性和一般化之间的关系并没有完全被理解,而且仍然是一个开放的研究问题。为了揭示多样性对分类组合的普及性的影响,我们调查了有关多样性的三个问题,即多样性的衡量、多样性和一般化错误之间的关系,以及利用这种关系进行整体整形。在多样性测量中,我们通过回归组合引起的错误分解来测量多样性,将分类组合的错误分解成准确性和多样性。然后,我们通过差值和概括性理论来制定衡量的多样性和共同性业绩之间的关系,并指出,只有在计量多样性在少数具体范围内增加,而在其他范围中,更大的多样性对于提高整体化的普及性更无益。此外,我们提出了基于多样性管理的两个分解方法,以便利用这一关系,这样可以适当地提高多样性和总体结果的正确性能,同时不降低拟议总体性结果的正确性能。