Decision tree is an effective classification approach in data mining and machine learning. In applications, test costs and misclassification costs should be considered while inducing decision trees. Recently, some cost-sensitive learning algorithms based on ID3 such as CS-ID3, IDX, \lambda-ID3 have been proposed to deal with the issue. These algorithms deal with only symbolic data. In this paper, we develop a decision tree algorithm inspired by C4.5 for numeric data. There are two major issues for our algorithm. First, we develop the test cost weighted information gain ratio as the heuristic information. According to this heuristic information, our algorithm is to pick the attribute that provides more gain ratio and costs less for each selection. Second, we design a post-pruning strategy through considering the tradeoff between test costs and misclassification costs of the generated decision tree. In this way, the total cost is reduced. Experimental results indicate that (1) our algorithm is stable and effective; (2) the post-pruning technique reduces the total cost significantly; (3) the competition strategy is effective to obtain a cost-sensitive decision tree with low cost.
翻译:决策树是数据挖掘和机器学习的一种有效的分类方法。 在应用中, 测试成本和分类错误的成本应该在引导决策树时加以考虑。 最近, 一些基于ID3的成本敏感的学习算法( 如 CS- ID3、 IDX、\lambda- ID3) 被提出来解决这个问题。 这些算法只涉及象征性数据。 在本文中, 我们开发了由C4.5 启发的数值数据决策树算法。 我们的算法有两个主要问题。 首先, 我们开发了测试成本加权信息增益比作为超常信息。 根据这种超常信息, 我们的算法是选择一个能提供更多增益率的属性, 而每个选择的成本更低。 其次, 我们设计一个后调整战略, 通过考虑测试成本和生成决策树的错误分类成本之间的权衡。 这样,总成本就降低了。 实验结果表明:(1) 我们的算法是稳定和有效的; (2) 运行后技术大大降低了总成本; (3) 竞争战略是有效的, 以低成本获得成本敏感的决策树。