使用分类分解法建模决定型森林的建模文本 (Modeling Text with Decision Forests using Categorical-Set Splits)

Decision forest algorithms typically model data by learning a binary tree structure recursively where every node splits the feature space into two sub-regions, sending examples into the left or right branch as a result. In axis-aligned decision forests, the "decision" to route an input example is the result of the evaluation of a condition on a single dimension in the feature space. Such conditions are learned using efficient, often greedy algorithms that optimize a local loss function. For example, a node's condition may be a threshold function applied to a numerical feature, and its parameter may be learned by sweeping over the set of values available at that node and choosing a threshold that maximizes some measure of purity. Crucially, whether an algorithm exists to learn and evaluate conditions for a feature type determines whether a decision forest algorithm can model that feature type at all. For example, decision forests today cannot consume textual features directly -- such features must be transformed to summary statistics instead. In this work, we set out to bridge that gap. We define a condition that is specific to categorical-set features -- defined as an unordered set of categorical variables -- and present an algorithm to learn it, thereby equipping decision forests with the ability to directly model text, albeit without preserving sequential order. Our algorithm is efficient during training and the resulting conditions are fast to evaluate with our extension of the QuickScorer inference algorithm. Experiments on benchmark text classification datasets demonstrate the utility and effectiveness of our proposal.

翻译：决定森林算法通常通过学习二进制树结构来模拟数据,在每个节点将特征空间分成两个分区域,然后向左或右分支提供实例。在轴对齐决策森林中,“决定”选择输入示例,是评估特征空间单一层面的一个条件的结果。这些条件是使用高效的、往往贪婪的算法来学习的,这种算法可以优化当地损失功能。例如,节点的状况可能是适用于数字特征的门槛功能,而其参数可以通过对节点现有的一组数值进行扫描,并选择一个能最大限度地测量纯度的阈值来学习。在轴偏左或右分支中,“决定”选择一个输入示例实例的“决定”是指对一个参数进行学习和评价的条件进行选择的结果。例如,决定森林算法不能直接使用文字特征,而这种特征必须转换为简要统计。在这项工作中,我们为弥补这一差距而确定了一个特定条件,即为不精确的变量的固定变量组合,并选择一个阈值的阈值的阈值阈值阈值值值值值值值值值值,从而直接学习一个快速的排序。