Modern pattern recognition tasks use complex algorithms that take advantage of large datasets to make more accurate predictions than traditional algorithms such as decision trees or k-nearest-neighbor better suited to describe simple structures. While increased accuracy is often crucial, less complexity also has value. This paper proposes a training data selection algorithm that identifies multiple subsets with simple structures. A learning algorithm trained on such a subset can classify an instance belonging to the subset with better accuracy than the traditional learning algorithms. In other words, while existing pattern recognition algorithms attempt to learn a global mapping function to represent the entire dataset, we argue that an ensemble of simple local patterns may better describe the data. Hence the sub-setting algorithm identifies multiple subsets with simple local patterns by identifying similar instances in the neighborhood of an instance. This motivation has similarities to that of gradient boosted trees but focuses on the explainability of the model that is missing for boosted trees. The proposed algorithm thus balances accuracy and explainable machine learning by identifying a limited number of subsets with simple structures. We applied the proposed algorithm to the international stroke dataset to predict the probability of survival. Our bottom-up sub-setting algorithm performed on an average 15% better than the top-down decision tree learned on the entire dataset. The different decision trees learned on the identified subsets use some of the previously unused features by the whole dataset decision tree, and each subset represents a distinct population of data.
翻译:现代模式识别任务使用复杂的算法,利用大型数据集进行更准确的预测,比传统的算法,如决策树或K-近距离邻居等,更适合描述简单结构。虽然提高准确性往往非常关键,但复杂性也较低,具有价值。本文件提议了一种培训数据选择算法,以识别多个子集和简单结构。在这样一个子集上受过培训的学习算法可以比传统的学习算法更精确地分类属于子集的一个实例。换句话说,虽然现有的模式识别算法试图学习一个全球绘图功能来代表整个数据集,但我们认为,简单本地模式的组合可能更好地描述数据。因此,子设定算法通过在实例附近找出相似的本地模式来识别多个具有简单本地模式的子集。这个动机类似于梯子集推树,但侧重于增殖树所缺的模型的可解释性。因此,拟议的算法平衡了精度和解释机器学习的精度,方法是确定数量有限、结构简单的子集。我们用拟议的算法对国际中调数据集来预测生存的概率。我们以前在树上所学过的一些下层数据,先期数据是不同的排序。我们以前所学了一种不同的决策树上所学的图图系,先定的图系,先研算。我们用了一个更好的数据,先定的图系,先定的图系,先定的图系,先研算。我们从第15所学了较的底的树所学了较的图。