Interactions between several features sometimes play an important role in prediction tasks. But taking all the interactions into consideration will lead to an extremely heavy computational burden. For categorical features, the situation is more complicated since the input will be extremely high-dimensional and sparse if one-hot encoding is applied. Inspired by association rule mining, we propose a method that selects interactions of categorical features, called Random Intersection Chains. It uses random intersections to detect frequent patterns, then selects the most meaningful ones among them. At first a number of chains are generated, in which each node is the intersection of the previous node and a random chosen observation. The frequency of patterns in the tail nodes is estimated by maximum likelihood estimation, then the patterns with largest estimated frequency are selected. After that, their confidence is calculated by Bayes' theorem. The most confident patterns are finally returned by Random Intersection Chains. We show that if the number and length of chains are appropriately chosen, the patterns in the tail nodes are indeed the most frequent ones in the data set. We analyze the computation complexity of the proposed algorithm and prove the convergence of the estimators. The results of a series of experiments verify the efficiency and effectiveness of the algorithm.
翻译:几个特性之间的相互作用有时在预测任务中扮演重要角色。 但是, 将所有相互作用都考虑在内, 会导致极其沉重的计算负担。 对于绝对特性, 情况会更加复杂, 因为如果应用一热编码, 输入的频率会非常高且稀疏。 受关联规则开采的启发, 我们提出一种方法, 选择绝对特性的相互作用, 称为随机交叉链条。 它使用随机交叉点来探测经常模式, 然后在其中选择最有意义的模式。 起初, 生成了许多链条, 其中每个节点是前节点的交叉点, 以及随机选择的观测。 尾节点的规律频率会以最大的可能性估计来估计, 然后选择最大估计频率的规律。 之后, 它们的信心会由 Bayes 的圆点计算 。 最自信的模式最终会由随机交叉链条返回 。 我们显示, 如果链条的数量和长度选择得当, 尾节点中的模式确实是数据集中最常见的。 我们分析提议的算法的复杂性, 并证明测算器的趋近的结果。