We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $\alpha$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{\alpha^3}\Big)$ elementary operations per update, where $d$ is the number of features and $n$ the maximum size of the active set (the net result of the update requests). We give similar bounds for the information gain and the variance gain. In fact, all these bounds are corollaries of a more general result, stated in terms of decision rules -- functions that, given a set $S$ of labeled examples, decide whether to split $S$ or predict a label. Decision rules give a unified view of greedy decision tree algorithms regardless of the example and label domains, and lead to a general notion of $\epsilon$-approximate decision trees that, for natural decision rules such as those used by ID3 or C4.5, implies the gain approximation guarantees above. The heart of our work provides a deterministic algorithm that, given any decision rule and any $\epsilon > 0$, maintains an $\epsilon$-approximate tree using $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}{\epsilon}\right)$ operations per update, where $f(n)$ is the complexity of evaluating the rule over a set of $n$ examples and $h$ is the maximum height of the maintained tree.
翻译:我们给出第一个算法, 以任意的插入和删除标签示例序列维持一个大致决策树, 并在每份更新请求中以最坏情况运行时间为最坏情况运行时间提供强有力的保证。 例如, 我们展示了如何维持一个决定树, 每一个顶端在最佳的添加值$\alpha$中, 使吉尼在最佳的添加值$\ ALpha$中获得收益, 执行 $Big (\\ frac{d\, (log n)\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Big) 一次更新中, 以美元为单位的任意选择, 以美元为单位的( liverfricle) 树算算法。 规则统一了贪婪决定的树算值, 不论例和标签域, 并导致一个通用的 $\\\clon- $ $ $ $ $ $ laxxxxxxxxxxxxxxxxxxxxxxxxxxxx rxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx